Why don't personalised IR benchmarks exist? (per-user relevance labels, not just per-query)
Every major IR benchmark - BEIR, MS-MARCO, TREC - has a single relevance label per query-document pair. Same label for all users. This works if you assume the optimal ranking is identical for every user asking the same question. That assumption seems obviously wrong for heterogeneous user populations - a domain expert and a novice asking the same clinical question should get different ranked results. MovieLens and similar recommendation benchmarks measure user preference (taste). That's different from epistemic state — what the user already knows vs what they still need. TREC Session Track (2011-2014) is the closest thing I've found - multi-turn sessions with some user context. But it's old and doesn't have per-user relevance ground truth. Is anyone working on benchmarks where: - The same document has different relevance scores for different users - Relevance is conditioned on documented prior knowledge not just on the query text Have I missed something? Genuine question before I go down the path of building one.