Loading Session...

Poster Session (Short papers)

Session Information

Short papers:

Multi-Step Semantic Reasoning in Generative Retrieval
Steven Dong, Yubao Tang and Maarten de Rijke

SSEmb: A Joint Structural and Semantic Embedding Framework for Mathematical Formula Retrieval
Ruyin Li and Xiaoyu Chen

On the Viability of Exploiting Large Language Models for Misinformation Annotation
Pablo Landrove, Marcos Fernandez-Pichel and David E. Losada

Incorporating Q&A Nuggets into Retrieval-Augmented Generation
Laura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden and James Mayfield

Evolving Mixture of Low-Rank Experts for Continual User Modeling
Jeevan Thapa, Sinan Zhao and Koyoshi Shindo

Personalized Autocompletion of Interactions with LLM-based Chatbots
Shani Goren, Nachshon Cohen, Oren Kalinsky, Tomer Stav, Yaron Fairstein, Yuri Rapoport, Ram Yazdi, Alex Libov and Guy Kushilevitz

Evaluating Large Language Models as Domain-Specific Retrieval Agents: A Study on Cybersecurity Challenge Benchmarks
Omed Abed, Md. Samiul Haque, Patrick-Benjamin Bök and Matteo Große-Kampmann

Large Language Models as Assessors: On the Impact of Relevance Scales
Riccardo Zamolo, Riccardo Lunardi, Michael Soprano, Gianluca Demartini, Stefano Mizzaro and Kevin Roitero

Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network Science
Gaia Simeoni, Michael Soprano, Riccardo Lunardi, Kevin Roitero and Stefano Mizzaro

Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries
Gabrielle Liu, Bryan Li, Arman Cohan, William Walden and Eugene Yang

DARE: A Dialectical Framework for Adversarial and Evidence-Aware RAG
Saisab Sadhu, Dwaipayan Roy and Tannay Basu

Do We Still Need Text for Video Retrieval in the Era of Vision-Language Models?
Jiaqi Zhan, Xinyu Zhang, Shengyao Zhuang, Xueguang Ma and Jimmy Lin

Query Performance Prediction using a Child-focused Definition of Relevance
Hrishita Chakrabarti and Maria Soledad Pera

ReFormeR: Learning and Applying Explicit Query Reformulation Patterns
Amin Bigdeli, Mert Incesu, Negar Arabzadeh, Charles L. A. Clarke and Ebrahim Bagheri

One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking
Tanmay Karmakar, Sourav Saha, Debapriyo Majumdar and Surjyanee Halder

Text vs. Speech? Detecting Audio Deepfakes on Instagram
Karla Schäfer

MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes
Rodrigo Batista, Filipe Cunha, Purificação Silvano, Nuno Guimarães, Alípio Jorge, Evelin Amorim and Ricardo Campos

Revisiting Human-vs-LLM judgments on the TREC Podcast Track
Watheq Mansour, J. Shane Culpepper, Joel Mackenzie and Andrew Yates

Forward Index Compression for Learned Sparse Retrieval
Sebastian Bruch, Martino Fontana, Franco Maria Nardini, Cosimo Rulli and Rossano Venturini

Mar 30, 2026 13:30 - 14:30(Europe/Amsterdam)
Venue : Chemie & Chaos
20260330T1330 20260330T1430 Europe/Amsterdam Poster Session (Short papers)

Short papers:

Multi-Step Semantic Reasoning in Generative RetrievalSteven Dong, Yubao Tang and Maarten de Rijke

SSEmb: A Joint Structural and Semantic Embedding Framework for Mathematical Formula RetrievalRuyin Li and Xiaoyu Chen

On the Viability of Exploiting Large Language Models for Misinformation AnnotationPablo Landrove, Marcos Fernandez-Pichel and David E. Losada

Incorporating Q&A Nuggets into Retrieval-Augmented GenerationLaura Dietz, Bryan Li, Gabrielle Liu, Jia-Huei Ju, Eugene Yang, Dawn Lawrie, William Walden and James Mayfield

Evolving Mixture of Low-Rank Experts for Continual User ModelingJeevan Thapa, Sinan Zhao and Koyoshi Shindo

Personalized Autocompletion of Interactions with LLM-based ChatbotsShani Goren, Nachshon Cohen, Oren Kalinsky, Tomer Stav, Yaron Fairstein, Yuri Rapoport, Ram Yazdi, Alex Libov and Guy Kushilevitz

Evaluating Large Language Models as Domain-Specific Retrieval Agents: A Study on Cybersecurity Challenge BenchmarksOmed Abed, Md. Samiul Haque, Patrick-Benjamin Bök and Matteo Große-Kampmann

Large Language Models as Assessors: On the Impact of Relevance ScalesRiccardo Zamolo, Riccardo Lunardi, Michael Soprano, Gianluca Demartini, Stefano Mizzaro and Kevin Roitero

Analyzing AI Evaluation Benchmarks Through Information Retrieval and Network ScienceGaia Simeoni, Michael Soprano, Riccardo Lunardi, Kevin Roitero and Stefano Mizzaro

Evaluating Retrieval-Augmented Generation Systems on Unanswerable, ...

Chemie & Chaos ECIR2026 conference-secretariat@blueboxevents.nl

Sub Sessions

Analyzing AI Evaluation Benchmarks Through InformationRetrieval and Network Science

Short papersIR evaluation 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks, with many different approaches. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). However, recent concerns have emerged regarding the robustness of benchmarks and the reliability of leaderboard rankings. In this paper, we apply an IR approach to LLM evaluation: we analyze LLM benchmark results through the lens of network science. Adapting a method developed for TREC test collections, we construct a bipartite graph between models and benchmark questions and apply Kleinberg's HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a modelÁøs tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. This graph-based view provides a principled way to diagnose benchmark-induced biases and to assess the reliability of model rankings. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that model rankings on leaderboards are strongly influenced by subsets of easy questions.
Presenters
GS
Gaia Simeoni
Co-Authors
MS
Michael Soprano
University Of Udine
RL
Riccardo Lunardi
University Of Udine
KR
Kevin Roitero
Tenure Track Assistant Professor, University Of Udine
SM
Stefano Mizzaro
University Of Udine

Large Language Models as Assessors: On the Impact ofRelevance Scales

Short papersIR evaluation Large Language Models 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Traditionally, the task of relevance judgment has relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as proxy to provide fully- or semi-automated judgments of relevance. In this setting, a key yet underexplored factor is the role of the relevance scale adopted when judging relevance. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effect of scale conversions, and their role in the presence of potential data contamination are yet unknown. In this paper, we systematically investigate how different scales, as well as their conversions, affect LLM ability in providing reliable point-wise relevance judgments, exploring that across multiple prompting strategies and model sizes. Using a popular TREC collection, we compare model outputs against both crowd and expert annotations, analyzing their alignment, stability, and potential data contamination issues.
Presenters
RZ
Riccardo Zamolo
Co-Authors
RL
Riccardo Lunardi
University Of Udine
MS
Michael Soprano
University Of Udine
GD
Gianluca Demartini
The University Of Queensland
SM
Stefano Mizzaro
University Of Udine
KR
Kevin Roitero
Tenure Track Assistant Professor, University Of Udine

SSEmb: A Joint Structural and Semantic Embedding Frameworkfor Mathematical Formula Retrieval

Short papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Formula retrieval is an important topic in Mathematical Information Retrieval. We propose SSEmb, a novel embedding framework capable of capturing both structural and semantic features of formulas. Structurally, we employ Graph Contrastive Learning to encode formulas represented as Shared-substructure Operator Graphs. To enhance structural diversity while preserving mathematical validity of these formula graphs, we introduce a novel graph data augmentation approach through a substitution strategy. Semantically, we utilize Sentence-BERT to encode the surrounding text of formulas. Finally, for each query and its candidates, structural and semantic similarities are calculated separately and then fused through a weighted scheme. In the ARQMath-3 Formula Retrieval Task, SSEmb outperforms existing embedding-based methods by over 5 percentage points on P'@10 and nDCG'@10. Furthermore, SSEmb enhances the performance of all runs of other methods and achieves state-of-the-art results when combined with Approach0.
Presenters
RL
Ruyin Li
Co-Authors
XC
Xiaoyu Chen
Associate Professor, Beihang University

One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking

Short papersIR evaluation Machine learning Search and rankingShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Presenters
TK
Tanmay Karmakar
Indian Statistical Institute Kolkata
Co-Authors
SS
Sourav Saha
Indian Statistical Institute Kolkata
DM
Debapriyo Majumdar
Assistant Professor, Indian Statistical Institute
SH
Surjyanee Halder
Indian Statistical Institute Kolkata

ReFormeR: Learning and Applying Explicit Query Reformulation Patterns

Short papersGenerative IRSearch and rankingShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
We present ReFormeR, a pattern-guided approach for query reformulation. Instead of prompting a language model to generate reformulations of a query directly, ReFormeR first elicits short reformulation patterns from pairs of initial queries and empirically stronger reformulations, consolidates them into a compact library of transferable reformulation patterns, and then selects an appropriate reformulation pattern for a new query given its retrieval context. The selected pattern constrains query reformulation to controlled operations such as sense disambiguation, vocabulary grounding, or discriminative facet addition, to name a few. As such, our proposed approach makes the reformulation policy explicit through these reformulation patterns, guiding the LLM towards targeted and effective query reformulations. Our extensive experiments on TREC DL 2019, DL 2020, and DL Hard show consistent improvements over classical feedback methods and recent LLM-based query reformulation and expansion approaches.
Presenters
AB
Amin Bigdeli
PhD Student, University Of Waterloo
Co-Authors
MI
Mert Incesu
University Of Toronto
NA
Negar Arabzadeh
University Of Bereckly
CC
Charles Clarke
Professor, University Of Waterloo
EB
Ebrahim Bagheri
University Of Toronto

Query Performance Prediction using a Child-focused Definition of Relevance

Short papersIR applicationsIR evaluationSocietally-motivated IR researchUser aspects in IRShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Query performance prediction (QPP) methods have primarily been tailored to mainstream users, thus relying on the traditional concept of relevance. In the case of children, however, relevance goes beyond content-based resource-query matching, which is why we gauge the performance of existing QPP methods in estimating the fit of resources retrieved in response to child-formulated queries. Outcomes from our empirical exploration of various QPP methods using a traditional and a child-focused definition of relevance on 2 datasets reveal the limitations in the adaptability of existing methods to the context of child information retrieval.
Presenters
HC
Hrishita Chakrabarti
PhD Candidate, Delft University Of Technology
Co-Authors
MP
Maria Soledad Pera
Delft University Of Technology

Text vs. Speech? Detecting Audio Deepfakes on Instagram

Short papersExplainability methods IR applications Societally-motivated IR researchShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
With the increasing use of AI, deepfakes are becoming an increasingly prevalent threat in today's world. At the same time, the performance of most detectors drops significantly when faced with unseen data, whereas generation models are improving, resulting in fewer artefacts. We examined deepfakes published on Instagram, using the SocialDF dataset. In addition to analysing the deepfakes in the frequency domain using audio deepfake detectors, we transcribed the speech and analysed the text (e.g. emotion and topics) and the audio content (e.g. emotion and music genre). We found that audio deepfake detectors struggle to identify real-world deepfakes on Instagram. Furthermore, current audio deepfake detection uses audio artefacts only. Content is not used for detection purposes. We suggest using both the speech recording and the content. This approach improves results on real-world data and provides an explanation for the classification. Using content information, we outperformed frequency-based detection with an F1-score of 74.3%.
Presenters
KS
Karla Schäfer
Fraunhofer SIT, ATHENE

MiNER: A Two-Stage Pipeline for Metadata Extraction fromMunicipal Meeting Minutes

Short papersIR applications Machine learning 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.
Presenters
RB
Rodrigo Batista
Student, Faculdade De Ciências, Universidade Do Porto, Porto
Co-Authors
LC
Luis Filipe Cunha
PhD Student, University Of Porto | INESC TEC
PS
Puri Silvano
Professor, University Of Porto; INESC TEC
NG
Nuno Guimaraes
Researcher, INESC TEC
AJ
Alípio Jorge
Professor, Universidade Do Porto / INESC TEC
EA
Evelin Amorim
University Of Porto; INESC TEC
RC
Ricardo Campos
Professor, University Of Beira Interior / INESC TEC

Forward Index Compression for Learned Sparse Retrieval

Short papersShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Text retrieval using learned sparse representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search---with the emergence of highly efficient algorithms such as the inverted index-based (Seismic) and the graph-based (HNSW)---that retrieval with sparse representations became viable in practice. In this work, we scrutinize the efficiency of sparse retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer compression techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MSMARCO show that our improvements lead to significant space savings while maintaining retrieval efficiency.
Presenters
SB
Sebastian Bruch
Northeastern University
Co-Authors
MF
Martino Fontana
University Of Pisa
FN
Franco Maria Nardini
Research Director, ISTI-CNR
CR
Cosimo Rulli
Researcher , ISTI-CNR
RV
Rossano Venturini
University Of Pisa

Revisiting Human-vs-LLM judgments on the TREC Podcast Track

Short papersLarge Language Models Search and rankingShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Using large language models (LLMs) to annotate relevance is an increasingly important problem in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human) judgments, other studies have argued for the opposite conclusion. To the best of our knowledge, these studies have primarily focused on classic ad-hoc text search scenarios. In this paper, we conduct an analysis on user agreement between LLM and human experts, and explore the impact disagreement has on system rankings. In contrast to prior studies, we focus on a collection composed of audio files that are transcribed into two-minute segments -- the TREC 2020 and 2021 podcast track. We employ five different LLM models to re-assess all of the query-segment pairs, which were originally annotated by TREC assessors. Furthermore, we re-assess a small subset of pairs where LLM and TREC assessors have the highest disagreement, and found that the human experts tend to agree with LLMs more than with the TREC assessors. Our results reinforce the previous insights of Sormunen in 2002 -- that relying on a single assessor leads to lower user agreement.
Presenters
WM
Watheq Mansour
PhD Student, The University Of Queensland
Co-Authors
JC
J. Shane Culpepper
The University Of Queensland
JM
Joel Mackenzie
Senior Lecturer, The University Of Queensland
AY
Andrew Yates
Johns Hopkins University, HLTCOE

Do We Still Need Text for Video Retrieval in the Era of Vision-Language Models?

Short papersShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Effective video retrieval has historically relied heavily on textual descriptions and metadata. However, recent advances in vision-language models (VLMs) prompt the question: Are text features still essential for effective video retrieval? In this work, we investigate this question using a unified multimodal retrieval framework based on advanced VLM embeddings. Evaluating on the comprehensive and multilingual MultiVENT 2.0 dataset from the MAGMaR shared task, we show that multimodal retrieval systems, combining visual frames, audio signals, and textual descriptions, surpass traditional text-only retrieval performance. Remarkably, our results demonstrate that retrieval based solely on non-text modalities (vision and audio) achieves performance comparable to text-based methods, indicating that explicit text input may no longer be strictly necessary.
Presenters
JZ
Jiaqi Zhan
University Of Waterloo
Co-Authors
XZ
Xinyu Zhang
University Of Waterloo
SZ
Shengyao Zhuang
CSIRO
XM
Xueguang Ma
PhD Candidate, University Of Waterloo
JL
Jimmy Lin
Professor, University Of Waterloo

DARE: A Dialectical Framework for Adversarial and Evidence-Aware RAG

Short papersGenerative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspectsShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Retrieval-Augmented Generation (RAG) systems are susceptible to factual inconsistencies when retrieved evidence is conflicting, a common issue with open-domain sources. Prevailing multi-agent approaches attempt to resolve this through unstructured debates that treat all information sources as equally credible. Concurrently, reliability-aware systems address source quality but typically only as a weighting factor during final aggregation, failing to integrate this crucial signal into the reasoning process itself. This paper proposes DARE (A Dialectical Adversarial RAG Engine), a novel framework that implements a formal dialectical process to resolve such conflicts through an evidence-aware adversarial agent that initiates a structured cross-examination of claims made by other agents. This process forces each claim to be defended against the complete set of source documents, allowing the system to dynamically infer an argument's credibility based on its logical resilience. By structuring the debate as a formal dialectic, DARE provides a more robust and principled mechanism for synthesizing truth from unreliable and conflicting information. The same has been observed in our empirical analysis where DARE outperforms the state of the arts in terms of exact match accuracy.
Presenters
SS
Saisab Sadhu
Master's Student, Indian Institute Of Science Education And Research Bhopal
Co-Authors
DR
Dwaipayan Roy
Assistant Professor, Indian Institute Of Science Education And Research Kolkata
TB
Tannay Basu
Indian Institute Of Science Education And Research Bhopal

Multi-Step Semantic Reasoning in Generative Retrieval

Short papersGenerative IRSearch and rankingShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Generative retrieval (GR) models encode a corpus within model parameters and generate relevant document identifiers directly for a given query. While this paradigm shows promise in retrieval tasks, existing GR models struggle with complex queries in numerical contexts, such as those involving semantic reasoning over financial reports, due to limited reasoning capabilities. This limitation leads to suboptimal retrieval accuracy and hinders practical applicability. We propose ReasonGR, a framework designed to enhance multi-step semantic reasoning in numerical contexts within GR. ReasonGR employs a structured prompting strategy combining task-specific instructions with stepwise reasoning guidance to better address complex retrieval queries. Additionally, it integrates a reasoning-focused adaptation module to improve learning of reasoning-related parameters. Experiments on the FinQA dataset, which contains financial queries over complex documents, demonstrate that ReasonGR improves retrieval accuracy and consistency, indicating its potential for advancing GR models in reasoning-intensive retrieval scenarios.
Presenters Steven Dong
Student, University Of Amsterdam
Co-Authors Yubao Tang
University Of Amsterdam
MD
Maarten De Rijke
Distinguished University Professor, University Of Amsterdam

On the Viability of Exploiting Large Language Models for Misinformation Annotation

Short papersIR evaluationLarge Language ModelsShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
This paper investigates the potential of LLMs for automatically annotating the usefulness, supportiveness, and credibility of search results. These aspects, while essential to the construction of misinformation benchmarks, are expensive and difficult to obtain at scale. Our comparative study suggests that, under certain conditions, LLMs can provide reasonable estimates of usefulness and supportiveness. In contrast, credibility judgments generated by LLMs show almost no agreement with human assessments. This raises concerns for the exploitation of LLMs to assist in the construction of collections that require annotations that go beyond relevance.
Presenters
PL
Pablo Landrove
Universidade De Santiago De Compostela
Co-Authors
MF
Marcos Fernández-Pichel
Assistant Professor, Universidade De Santiago De Compostela
DL
David E. Losada
Universidade De Santiago De Compostela

Incorporating Q&A Nuggets into Retrieval-Augmented Generation

Short papersRetrieval-Augmented GenerationShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Presenters
LD
Laura Dietz
Associate Professor, University Of New Hampshire
Co-Authors
BL
Bryan Li
University Of Pennsylvania
GL
Gabrielle Liu
Yale University
JJ
Jia-Huei Ju
PhD Student, University Of Amsterdam
Eugene Yang
Research Scientist, Human Language Technology Center Of Excellence, Johns Hopkins University
DL
Dawn Lawrie
Senior Research Scientist, HLTCOE At Johns Hopkins University
WW
William Walden
Human Language Technology Center Of Excellence, Johns Hopkins University
James Mayfield
Principal Computer Scientist, JHU HLTCOE

Evolving Mixture of Low-Rank Experts for Continual User Modeling

Short papersMachine learningRecommender systemsUser aspects in IRShort papers 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Building a user model that incorporates diverse tasks remains a big challenge. While continual learning offers an alternative to multi-task learning by eliminating the need for retraining on all past tasks, prior works train the whole network backbone along with task-specific masks, which becomes computationally inefficient. Recent prompt-based parameter-efficient continual user modeling (PECUM) addresses this challenge by training only a few parameters, thus reducing the training cost. However, prompt tuning can yield homogeneous task embeddings and converge slowly compared to adapters. Hence, we propose a novel framework to integrate SVD-decomposed low-rank adapters into continual user modeling, which can be interpreted as a relaxed mixture of rank-1 experts. We further develop a novel attention framework that selectively weighs experts trained by semantically similar past tasks, and we jointly learn their attention coefficients along with newly added adapters, enabling interference-free knowledge transfer. We show the effectiveness of our proposed method on two real-world datasets.
Presenters
JT
Jeevan Thapa
Rochester Institute Of Technology
Co-Authors
KS
Koyoshi Shindo
Senior Applied Scientist, Zillow Group

Personalized Autocompletion of Interactions with LLM-basedChatbots

Short papersConversational search and recommendationLarge Language ModelsRecommender systems 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Composing messages in chatbot interactions is often time-consuming, making autocompletion an appealing way to reduce user effort. Different users have different preferences and therefore different expectations from autocompletion solutions. We study how personalization can improve the autocompletion process, evaluating four schemes defined along two axes: generation vs. ranking, and prior messages vs. external features. Experiments on the WildChat and PRISM datasets with the Mistral-7B and Phi-3.5-mini models show consistent gains. Our results highlight personalization as a key factor in building effective chatbot autocomplete systems, and assist researchers and practitioners in deciding where and how to invest in improving these solutions.

Evaluating Large Language Models as Domain-SpecificRetrieval Agents: A Study on Cybersecurity ChallengeBenchmarks

Short papersIR evaluationLarge Language ModelsSystem aspects 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Large Language Models are increasingly used as retrieval and reasoning agents in specialized domains. This study evaluates their performance on cybersecurity Capture-the-Flag challenges, reframed as structured retrieval tasks where models must infer information from textual and code-based evidence. Using three public benchmarks, NYU~CSAW, CyBench, and InterCode-CTF, we compare five recent LLMs within a unified and reproducible evaluation framework. Results show significant variation across datasets and task categories, with performance differences across models. The proposed benchmark provides a foundation for assessing domain-specific retrieval and reasoning.
Presenters
OA
Omed Abed
Hochschule Rhein-Waal
Co-Authors
MH
Md. Samiul Haque
PB
Patrick-Benjamin B?k
Rhine-Waal University Of Applied Sciences
MG
Matteo Gro?e-Kampmann
Rhine-Waal University Of Applied Sciences

Evaluating Retrieval-Augmented Generation Systems onUnanswerable, Uncheatable, Realistic, Multi-hop Queries

Short papersLarge Language ModelsRetrieval-Augmented Generation 01:30 PM - 02:30 PM (Europe/Amsterdam) 2026/03/30 11:30:00 UTC - 2026/03/30 12:30:00 UTC
Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of uncheatable, realistic, unanswerable, and multi-hop queries (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.
Presenters
GL
Gabrielle Liu
Yale University
Co-Authors
BL
Bryan Li
University Of Pennsylvania
AC
Arman Cohan
WW
William Walden
Human Language Technology Center Of Excellence, Johns Hopkins University
Eugene Yang
Research Scientist, Human Language Technology Center Of Excellence, Johns Hopkins University
176 visits

Session Participants

User Online
Session speakers, moderators & attendees
Student
,
University of Amsterdam
Universidade de Santiago de Compostela
Associate Professor
,
University Of New Hampshire
Rochester Institute of Technology
Technion
+ 14 more speakers. View All
No moderator for this session!
 Sérgio Nunes
University of Porto | INESC TEC
49 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.