Resource III: Evaluation Tooling For Retrieval And RecSys

Loading Session...

Session Information

CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale
Laura Caspari, Michael Dinzinger, Kanishka Ghosh Dastidar, Christofer Fellicious, Jelena Mitrović and Michael Granitzer
GREAT: Group Recommender Evaluation and Analysis Tool
Ariel Smith, David Contreras, Maria Salamo and Ludovico Boratto
Evaluating the Efficiency and Effectiveness of Learned Sparse Retrieval with the lsr_benchmark
Maik Fröbe, Ferdinand Schlatt, Cosimo Rulli, Tim Hagen, Jan Heinrich Merker, Gijs Hendriksen, Carlos Lassance, Franco Maria Nardini, Rossano Venturini and Martin Potthast
An Open SERP Mining Infrastructure for the Archive Query Log
Jan Heinrich Merker, Simon Ruth, Harrisen Scells and Martin Potthast
RoutIR: Fast Serving of Retrieval Pipelines for Retrieval-Augmented Generation
Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield and Trevor Adriaanse

Apr 01, 2026 14:30 - 16:00(Europe/Amsterdam)

Venue : Chemie

20260401T1430 20260401T1600 Europe/Amsterdam Resource III: Evaluation Tooling for Retrieval and RecSys CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale Laura Caspari, Michael Dinzinger, Kanishka Ghosh Dastidar, Christofer Fellicious, Jelena Mitrović and Michael Granitzer GREAT: Group Recommender Evaluation and Analysis Tool Ariel Smith, David Contreras, Maria Salamo and Ludovico Boratto Evaluating the Efficiency and Effectiveness of Learned Sparse Retrieval with the lsr_benchmark Maik Fröbe, Ferdinand Schlatt, Cosimo Rulli, Tim Hagen, Jan Heinrich Merker, Gijs Hendriksen, Carlos Lassance, Franco Maria Nardini, Rossano Venturini and Martin Potthast An Open SERP Mining Infrastructure for the Archive Query Log Jan Heinrich Merker, Simon Ruth, Harrisen Scells and Martin Potthast RoutIR: Fast Serving of Retrieval Pipelines for Retrieval-Augmented Generation Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield and Trevor Adriaanse Chemie ECIR2026 conference-secretariat@blueboxevents.nl

Add to my Schedule

Sub Sessions

CoRECT: A Framework for Evaluating Embedding CompressionTechniques at Scale

ResourceEvaluation research Machine Learning and Large Language Models Search and ranking 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/04/01 12:30:00 UTC - 2026/04/01 14:00:00 UTC

Dense retrieval systems have proven to be effective across various benchmarks, but require substantial memory to store large search indices. Recent advances in embedding compression show that index sizes can be greatly reduced with minimal loss in ranking quality. However, existing studies often overlook the role of corpus complexity -- a critical factor, as recent work shows that both corpus size and document length strongly affect dense retrieval performance. In this paper, we introduce CoRECT (Controlled Retrieval Evaluation of Compression Techniques), a framework for large-scale evaluation of embedding compression methods, supported by a newly curated dataset collection. To demonstrate its utility, we benchmark eight representative types of compression methods. Notably, we show that non-learned compression achieves substantial index size reduction, even on up to 100M passages, with statistically insignificant performance loss. However, selecting the optimal compression method remains challenging, as performance varies across models. Such variability highlights the necessity of CoRECT to enable consistent comparison and informed selection of compression methods. All code, data, and results are available on GitHub and HuggingFace.

Presenters

Co-Authors

GREAT: Group Recommender Evaluation and Analysis Tool

Resource 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/04/01 12:30:00 UTC - 2026/04/01 14:00:00 UTC

Previous research on group recommender systems (GRSs) has shown that group dynamics strongly influence decision-making, yet collaborative filtering (CF)¨Cbased GRSs rarely account for social interactions, largely because suitable analytical tools are lacking. This paper introduces a community resource for studying live groups as they engage with a CF-based recommender system through a domain-independent graphical interface that records interaction signals (such as suggestions, views, and favorites) and integrates them into the recommendation process. A live user study with 72 participants organized into 18 groups demonstrates the system¡¯s effectiveness in capturing and analyzing user interactions. Results show that incorporating interaction awareness enhances group satisfaction and reveals underlying social dynamics, offering new opportunities for adaptive GRSs responsive to real-time user behavior. Source code and dataset available online at this link1.

Presenters

Co-Authors

Evaluating the Efficiency and Effectiveness of Learned Sparse Retrieval with the lsr_benchmark

ResourceResource 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/04/01 12:30:00 UTC - 2026/04/01 14:00:00 UTC

Different learned sparse retrieval (LSR) models offer different trade-offs between effectiveness and efficiency. However, while there are standardized and interoperable tools to assess LSR effectiveness, there is no agreed-upon methodology for evaluating efficiency, and datasets with high-quality relevance judgments are too large for repeated efficiency experiments, e.g., across different hardware. To promote the evaluation of LSR~models for effectiveness and efficiency, we introduce the \lsrBenchmark, which measures retrieval effectiveness and efficiency of each step in an LSR~pipeline (document embedding, indexing, query embedding, and retrieval). To ensure tractability and extensibility, we apply current corpus subsampling methods to eleven TREC tasks, precompute embeddings with eleven LSR~models per task, and provide eight retrieval systems as baselines. For the benchmark's hosted version, a modular~API and tools for evaluating effectiveness and efficiency makes submitting new approaches easy. Our experiments show that the chosen embedding model significantly affects the efficiency of a retrieval system and that LSR is more effective but less efficient than BM25---an efficiency gap our benchmark helps to track as new LSR models are published.

Presenters

Co-Authors

An Open SERP Mining Infrastructure for the Archive Query Log

ResourceResource 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/04/01 12:30:00 UTC - 2026/04/01 14:00:00 UTC

Query logs are key resources for studying search engine interactions and improving retrieval effectiveness but are rarely publicly available. In the past, search providers only shared small subsets of their own logs to curb competition and to ensure privacy. The Archive Query Log (AQL) will become an open alternative: mining query logs from archived search engine result pages (SERPs). While the AQL-22 prototype demonstrated the feasibility of this approach, its limited scalability and maintainability hindered the widespread adoption by the research community. We re-implement the crawling and parsing of the AQL on open infrastructure, using standard tools, a new framework for storing SERPs, and following FAIR data principles. The extended and continuously crawled AQL-25 corpus contains 553 million SERPs from 775 search providers, mined from six web archives, where so far 223 million SERPs (44 TB; 40%) have been downloaded and parsed. We demonstrate the use of this new AQL mining framework in two typical analysis scenarios: a temporal analysis now implemented as a single Elasticsearch query and a batch-processing analysis using Ray. Our resource equips researchers with all the tools needed to analyze SERPs.

Presenters

Co-Authors

RoutIR: Fast Serving of Retrieval Pipelines for Retrieval-Augmented Generation

ResourceApplicationsResource 02:30 PM - 04:00 PM (Europe/Amsterdam) 2026/04/01 12:30:00 UTC - 2026/04/01 14:00:00 UTC

Retrieval models are key components of Retrieval-Augmented Generation (RAG) systems, which generate search queries, process the documents returned, and generate a response. RAG systems are often dynamic and may involve multiple rounds of retrieval. While many state-of-the-art retrieval methods are available through academic IR platforms, these platforms are typically designed for the Cranfield paradigm in which all queries are known up front and can be batch processed offline. This simplification accelerates research but leaves state-of-the-art retrieval models unable to support downstream applications that require online services, such as arbitrary dynamic RAG pipelines that involve looping, feedback, or even self-organizing agents. In this work, we introduce RoutIR, a Python package that provides a simple and efficient HTTP API that wraps arbitrary retrieval methods, including first stage retrieval, reranking, query expansion, and result fusion. By providing a minimal JSON configuration file specifying the retrieval models to serve, RoutIR can be used to construct and query retrieval pipelines on-the-fly using any available models (e.g., fusing the results of several first-stage retrieval methods followed by reranking). The API automatically performs asynchronous query batching and results caching by default. While many state-of-the-art retrieval methods are already supported by the package, RoutIR is also easily expandable by implementing the Engine abstract class. The package is publicly available on GitHub: http://github.com/hltcoe/routir.

Presenters