Multi-Step Semantic Reasoning in Generative Retrieval Steven Dong, Yubao Tang and Maarten de Rijke
SSEmb: A Joint Structural and Semantic Embedding Framework for Mathematical Formula Retrieval Ruyin Li and Xiaoyu Chen
On the Viability of Exploiting Large Language Models for Misinformation Annotation Pablo Landrove, Marcos Fernandez-Pichel and David E. Losada
Incorporating Q&A Nuggets into Retrieval-Augmented Generation Laura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden and James Mayfield
Evolving Mixture of Low-Rank Experts for Continual User Modeling Jeevan Thapa, Sinan Zhao and Koyoshi Shindo
Personalized Autocompletion of Interactions with LLM-based Chatbots Shani Goren, Nachshon Cohen, Oren Kalinsky, Tomer Stav, Yaron Fairstein, Yuri Rapoport, Ram Yazdi, Alex Libov and Guy Kushilevitz
Evaluating Large Language Models as Domain-Specific Retrieval Agents: A Study on Cybersecurity Challenge Benchmarks Omed Abed, Md. Samiul Haque, Patrick-Benjamin Bök and Matteo Große-Kampmann
Large Language Models as Assessors: On the Impact of Relevance Scales Riccardo Zamolo, Riccardo Lunardi, Michael Soprano, Gianluca Demartini, Stefano Mizzaro and Kevin Roitero
Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science Gaia Simeoni, Michael Soprano, Riccardo Lunardi, Kevin Roitero and Stefano Mizzaro
Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries Gabrielle Liu, Bryan Li, Arman Cohan, William Walden and Eugene Yang
DARE: A Dialectical Framework for Adversarial and Evidence-Aware RAG Saisab Sadhu, Dwaipayan Roy and Tannay Basu
Do We Still Need Text for Video Retrieval in the Era of Vision-Language Models? Jiaqi Zhan, Xinyu Zhang, Shengyao Zhuang, Xueguang Ma and Jimmy Lin
Query Performance Prediction using a Child-focused Definition of Relevance Hrishita Chakrabarti and Maria Soledad Pera
ReFormeR: Learning and Applying Explicit Query Reformulation Patterns Amin Bigdeli, Mert Incesu, Negar Arabzadeh, Charles L. A. Clarke and Ebrahim Bagheri
One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking Tanmay Karmakar, Sourav Saha, Debapriyo Majumdar and Surjyanee Halder
Text vs. Speech? Detecting Audio Deepfakes on Instagram Karla Schäfer
MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes Rodrigo Batista, Filipe Cunha, Purificação Silvano, Nuno Guimarães, Alípio Jorge, Evelin Amorim and Ricardo Campos
Revisiting Human-vs-LLM judgments on the TREC Podcast Track Watheq Mansour, J. Shane Culpepper, Joel Mackenzie and Andrew Yates
Forward Index Compression for Learned Sparse Retrieval Sebastian Bruch, Martino Fontana, Franco Maria Nardini, Cosimo Rulli and Rossano Venturini
Multi-Step Semantic Reasoning in Generative RetrievalSteven Dong, Yubao Tang and Maarten de Rijke
SSEmb: A Joint Structural and Semantic Embedding Framework for Mathematical Formula RetrievalRuyin Li and Xiaoyu Chen
On the Viability of Exploiting Large Language Models for Misinformation AnnotationPablo Landrove, Marcos Fernandez-Pichel and David E. Losada
Incorporating Q&A Nuggets into Retrieval-Augmented GenerationLaura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden and James Mayfield
Evolving Mixture of Low-Rank Experts for Continual User ModelingJeevan Thapa, Sinan Zhao and Koyoshi Shindo
Personalized Autocompletion of Interactions with LLM-based ChatbotsShani Goren, Nachshon Cohen, Oren Kalinsky, Tomer Stav, Yaron Fairstein, Yuri Rapoport, Ram Yazdi, Alex Libov and Guy Kushilevitz
Evaluating Large Language Models as Domain-Specific Retrieval Agents: A Study on Cybersecurity Challenge BenchmarksOmed Abed, Md. Samiul Haque, Patrick-Benjamin Bök and Matteo Große-Kampmann
Large Language Models as Assessors: On the Impact of Relevance ScalesRiccardo Zamolo, Riccardo Lunardi, Michael Soprano, Gianluca Demartini, Stefano Mizzaro and Kevin Roitero
Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network ScienceGaia Simeoni, Michael Soprano, Riccardo Lunardi, Kevin Roitero and Stefano Mizzaro
Evaluating Retrieval-Augmented Generation Systems on Unanswerable, ...
Analyzing AI Evaluation Benchmarks Through InformationRetrieval and Network Science
Short papersIR evaluation01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks, with many different approaches. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). However, recent concerns have emerged regarding the robustness of benchmarks and the reliability of leaderboard rankings. In this paper, we apply an IR approach to LLM evaluation: we analyze LLM benchmark results through the lens of network science. Adapting a method developed for TREC test collections, we construct a bipartite graph between models and benchmark questions and apply Kleinberg's HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a modelÁøs tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. This graph-based view provides a principled way to diagnose benchmark-induced biases and to assess the reliability of model rankings. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that model rankings on leaderboards are strongly influenced by subsets of easy questions.
Large Language Models as Assessors: On the Impact ofRelevance Scales
Short papersIR evaluation
Large Language Models01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Traditionally, the task of relevance judgment has relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as proxy to provide fully- or semi-automated judgments of relevance. In this setting, a key yet underexplored factor is the role of the relevance scale adopted when judging relevance. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effect of scale conversions, and their role in the presence of potential data contamination are yet unknown. In this paper, we systematically investigate how different scales, as well as their conversions, affect LLM ability in providing reliable point-wise relevance judgments, exploring that across multiple prompting strategies and model sizes. Using a popular TREC collection, we compare model outputs against both crowd and expert annotations, analyzing their alignment, stability, and potential data contamination issues.
Kevin Roitero Tenure Track Assistant Professor, University Of Udine
SSEmb: A Joint Structural and Semantic Embedding Frameworkfor Mathematical Formula Retrieval
Short papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Formula retrieval is an important topic in Mathematical Information Retrieval. We propose SSEmb, a novel embedding framework capable of capturing both structural and semantic features of formulas. Structurally, we employ Graph Contrastive Learning to encode formulas represented as Shared-substructure Operator Graphs. To enhance structural diversity while preserving mathematical validity of these formula graphs, we introduce a novel graph data augmentation approach through a substitution strategy. Semantically, we utilize Sentence-BERT to encode the surrounding text of formulas. Finally, for each query and its candidates, structural and semantic similarities are calculated separately and then fused through a weighted scheme. In the ARQMath-3 Formula Retrieval Task, SSEmb outperforms existing embedding-based methods by over 5 percentage points on P'@10 and nDCG'@10. Furthermore, SSEmb enhances the performance of all runs of other methods and achieves state-of-the-art results when combined with Approach0.
ReFormeR: Learning and Applying Explicit Query
Reformulation Patterns
Short papersGenerative IRSearch and rankingShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
We present ReFormeR, a pattern-guided approach for query reformulation. Instead of prompting a language model to generate reformulations of a query directly, ReFormeR first elicits short reformulation patterns from pairs of initial queries and empirically stronger reformulations, consolidates them into a compact library of transferable reformulation patterns, and then selects an appropriate reformulation pattern for a new query given its retrieval context. The selected pattern constrains query reformulation to controlled operations such as sense disambiguation, vocabulary grounding, or discriminative facet addition, to name a few. As such, our proposed approach makes the reformulation policy explicit through these reformulation patterns, guiding the LLM towards targeted and effective query reformulations. Our extensive experiments on TREC DL 2019, DL 2020, and DL Hard show consistent improvements over classical feedback methods and recent LLM-based query reformulation and expansion approaches.
Query Performance Prediction using a Child-focused
Definition of Relevance
Short papersIR applicationsIR evaluationSocietally-motivated IR researchUser aspects in IRShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Query performance prediction (QPP) methods have primarily been tailored to mainstream users, thus relying on the traditional concept of relevance. In the case of children, however, relevance goes beyond content-based resource-query matching, which is why we gauge the performance of existing QPP methods in estimating the fit of resources retrieved in response to child-formulated queries. Outcomes from our empirical exploration of various QPP methods using a traditional and a child-focused definition of relevance on 2 datasets reveal the limitations in the adaptability of existing methods to the context of child information retrieval.
Text vs. Speech? Detecting Audio Deepfakes on Instagram
Short papersExplainability methods
IR applications
Societally-motivated IR researchShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
With the increasing use of AI, deepfakes are becoming an increasingly prevalent threat in today's world. At the same time, the performance of most detectors drops significantly when faced with unseen data, whereas generation models are improving, resulting in fewer artefacts. We examined deepfakes published on Instagram, using the SocialDF dataset. In addition to analysing the deepfakes in the frequency domain using audio deepfake detectors, we transcribed the speech and analysed the text (e.g. emotion and topics) and the audio content (e.g. emotion and music genre). We found that audio deepfake detectors struggle to identify real-world deepfakes on Instagram. Furthermore, current audio deepfake detection uses audio artefacts only. Content is not used for detection purposes. We suggest using both the speech recording and the content. This approach improves results on real-world data and provides an explanation for the classification. Using content information, we outperformed frequency-based detection with an F1-score of 74.3%.
MiNER: A Two-Stage Pipeline for Metadata Extraction fromMunicipal Meeting Minutes
Short papersIR applications
Machine learning01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.
Ricardo Campos Professor, University Of Beira Interior / INESC TEC
Forward Index Compression for Learned Sparse Retrieval
Short papersShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Text retrieval using learned sparse representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search---with the emergence of highly efficient algorithms such as the inverted index-based (Seismic) and the graph-based (HNSW)---that retrieval with sparse representations became viable in practice. In this work, we scrutinize the efficiency of sparse retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer compression techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MSMARCO show that our improvements lead to significant space savings while maintaining retrieval efficiency.
Revisiting Human-vs-LLM judgments on the TREC Podcast Track
Short papersLarge Language Models
Search and rankingShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Using large language models (LLMs) to annotate relevance is an increasingly important problem in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human) judgments, other studies have argued for the opposite conclusion. To the best of our knowledge, these studies have primarily focused on classic ad-hoc text search scenarios. In this paper, we conduct an analysis on user agreement between LLM and human experts, and explore the impact disagreement has on system rankings. In contrast to prior studies, we focus on a collection composed of audio files that are transcribed into two-minute segments -- the TREC 2020 and 2021 podcast track. We employ five different LLM models to re-assess all of the query-segment pairs, which were originally annotated by TREC assessors. Furthermore, we re-assess a small subset of pairs where LLM and TREC assessors have the highest disagreement, and found that the human experts tend to agree with LLMs more than with the TREC assessors. Our results reinforce the previous insights of Sormunen in 2002 -- that relying on a single assessor leads to lower user agreement.
Do We Still Need Text for Video Retrieval in the Era of
Vision-Language Models?
Short papersShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Effective video retrieval has historically relied heavily on textual descriptions and metadata. However, recent advances in vision-language models (VLMs) prompt the question: Are text features still essential for effective video retrieval? In this work, we investigate this question using a unified multimodal retrieval framework based on advanced VLM embeddings. Evaluating on the comprehensive and multilingual MultiVENT 2.0 dataset from the MAGMaR shared task, we show that multimodal retrieval systems, combining visual frames, audio signals, and textual descriptions, surpass traditional text-only retrieval performance. Remarkably, our results demonstrate that retrieval based solely on non-text modalities (vision and audio) achieves performance comparable to text-based methods, indicating that explicit text input may no longer be strictly necessary.
DARE: A Dialectical Framework for Adversarial and
Evidence-Aware RAG
Short papersGenerative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspectsShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Retrieval-Augmented Generation (RAG) systems are susceptible to factual inconsistencies when retrieved evidence is conflicting, a common issue with open-domain sources. Prevailing multi-agent approaches attempt to resolve this through unstructured debates that treat all information sources as equally credible. Concurrently, reliability-aware systems address source quality but typically only as a weighting factor during final aggregation, failing to integrate this crucial signal into the reasoning process itself. This paper proposes DARE (A Dialectical Adversarial RAG Engine), a novel framework that implements a formal dialectical process to resolve such conflicts through an evidence-aware adversarial agent that initiates a structured cross-examination of claims made by other agents. This process forces each claim to be defended against the complete set of source documents, allowing the system to dynamically infer an argument's credibility based on its logical resilience. By structuring the debate as a formal dialectic, DARE provides a more robust and principled mechanism for synthesizing truth from unreliable and conflicting information. The same has been observed in our empirical analysis where DARE outperforms the state of the arts in terms of exact match accuracy.
Tannay Basu Indian Institute Of Science Education And Research Bhopal
Multi-Step Semantic Reasoning in Generative Retrieval
Short papersGenerative IRSearch and rankingShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.
Presenters Steven Dong Student, University Of Amsterdam Co-Authors Yubao Tang University Of Amsterdam
Maarten De Rijke Distinguished University Professor, University Of Amsterdam
On the Viability of Exploiting Large Language Models for
Misinformation Annotation
Short papersIR evaluationLarge Language ModelsShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
This paper investigates the potential of LLMs for automatically annotating the usefulness, supportiveness, and credibility of search results. These aspects, while essential to the construction of misinformation benchmarks, are expensive and difficult to obtain at scale. Our comparative study suggests that, under certain conditions, LLMs can provide reasonable estimates of usefulness and supportiveness. In contrast, credibility judgments generated by LLMs show almost no agreement with human assessments. This raises concerns for the exploitation of LLMs to assist in the construction of collections that require annotations that go beyond relevance.
Jia-Huei Ju PhD Student, University Of AmsterdamEugene Yang Research Scientist, Human Language Technology Center Of Excellence, Johns Hopkins University
William Walden Human Language Technology Center Of Excellence, Johns Hopkins UniversityJames Mayfield Principal Computer Scientist, JHU HLTCOE
Evolving Mixture of Low-Rank Experts for Continual User
Modeling
Short papersMachine learningRecommender systemsUser aspects in IRShort papers01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Building a user model that incorporates diverse tasks remains a big challenge. While continual learning offers an alternative to multi-task learning by eliminating the need for retraining on all past tasks, prior works train the whole network backbone along with task-specific masks, which becomes computationally inefficient. Recent prompt-based parameter-efficient continual user modeling (PECUM) addresses this challenge by training only a few parameters, thus reducing the training cost. However, prompt tuning can yield homogeneous task embeddings and converge slowly compared to adapters. Hence, we propose a novel framework to integrate SVD-decomposed low-rank adapters into continual user modeling, which can be interpreted as a relaxed mixture of rank-1 experts. We further develop a novel attention framework that selectively weighs experts trained by semantically similar past tasks, and we jointly learn their attention coefficients along with newly added adapters, enabling interference-free knowledge transfer. We show the effectiveness of our proposed method on two real-world datasets.
Personalized Autocompletion of Interactions with LLM-basedChatbots
Short papersConversational search and recommendationLarge Language ModelsRecommender systems01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Composing messages in chatbot interactions is often time-consuming, making autocompletion an appealing way to reduce user effort. Different users have different preferences and therefore different expectations from autocompletion solutions. We study how personalization can improve the autocompletion process, evaluating four schemes defined along two axes: generation vs. ranking, and prior messages vs. external features. Experiments on the WildChat and PRISM datasets with the Mistral-7B and Phi-3.5-mini models show consistent gains. Our results highlight personalization as a key factor in building effective chatbot autocomplete systems, and assist researchers and practitioners in deciding where and how to invest in improving these solutions.
Evaluating Large Language Models as Domain-SpecificRetrieval Agents: A Study on Cybersecurity ChallengeBenchmarks
Short papersIR evaluationLarge Language ModelsSystem aspects01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Large Language Models are increasingly used as retrieval and reasoning agents in specialized domains. This study evaluates their performance on cybersecurity Capture-the-Flag challenges, reframed as structured retrieval tasks where models must infer information from textual and code-based evidence. Using three public benchmarks, NYU~CSAW, CyBench, and InterCode-CTF, we compare five recent LLMs within a unified and reproducible evaluation framework. Results show significant variation across datasets and task categories, with performance differences across models. The proposed benchmark provides a foundation for assessing domain-specific retrieval and reasoning.
Evaluating Retrieval-Augmented Generation Systems onUnanswerable, Uncheatable, Realistic, Multi-hop Queries
Short papersLarge Language ModelsRetrieval-Augmented Generation01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of uncheatable, realistic, unanswerable, and multi-hop queries (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.
William Walden Human Language Technology Center Of Excellence, Johns Hopkins UniversityEugene Yang Research Scientist, Human Language Technology Center Of Excellence, Johns Hopkins University