Abstract Summary
Different learned sparse retrieval (LSR) models offer different trade-offs between effectiveness and efficiency. However, while there are standardized and interoperable tools to assess LSR effectiveness, there is no agreed-upon methodology for evaluating efficiency, and datasets with high-quality relevance judgments are too large for repeated efficiency experiments, e.g., across different hardware. To promote the evaluation of LSR~models for effectiveness and efficiency, we introduce the \lsrBenchmark, which measures retrieval effectiveness and efficiency of each step in an LSR~pipeline (document embedding, indexing, query embedding, and retrieval). To ensure tractability and extensibility, we apply current corpus subsampling methods to eleven TREC tasks, precompute embeddings with eleven LSR~models per task, and provide eight retrieval systems as baselines. For the benchmark's hosted version, a modular~API and tools for evaluating effectiveness and efficiency makes submitting new approaches easy. Our experiments show that the chosen embedding model significantly affects the efficiency of a retrieval system and that LSR is more effective but less efficient than BM25---an efficiency gap our benchmark helps to track as new LSR models are published.