Loading Session...

Reproducibility II: Retrieval

Session Information

  • Fast, Compact, Dynamic Indexing for Learned Sparse Retrieval Systems
  • Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"
  • Multivector Reranking in the Era of Strong First-Stage Retrievers
  • Temporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN
Mar 30, 2026 16:00 - 17:00(Europe/Amsterdam)
Venue : Chaos
20260330T1600 20260330T1700 Europe/Amsterdam Reproducibility II: Retrieval Fast, Compact, Dynamic Indexing for Learned Sparse Retrieval SystemsDown with the Hierarchy: The 'H' in HNSW Stands for "Hubs"Multivector Reranking in the Era of Strong First-Stage RetrieversTemporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULAN Chaos ECIR2026 conference-secretariat@blueboxevents.nl

Sub Sessions

Fast, Compact, Dynamic Indexing for Learned Sparse Retrieval Systems

ReproducibilityReproducibility 04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Learned sparse retrieval (LSR) is an emerging paradigm that uses pretrained language models to assign learned weights to the terms of a document, enabling practitioners to deploy next-generation rankers within their existing lexical retrieval pipelines. Although LSR systems have been found to provide strong increases in effectiveness over traditional statistical approaches, this boost comes at the cost of both indexing and retrieval efficiency. In this work, we explore the application of LSR to a practical online setting where new documents must be indexed and searchable as soon as they arrive. In particular, we create a clean-room re-implementation of the current state-of-the-art linked block dynamic indexing approach, and propose a set of important augmentations that enable efficient online indexing and query processing to generalize to the learned sparse regime. Our results over two traditional and two LSR models, and a multitude of experimental settings, demonstrate the practicality of our approach, allowing new documents to be queried at a reasonable latency while maintaining fast insertion ability.
Presenters
BR
Billy Rule
The University Of Queensland
Co-Authors
JM
Joel Mackenzie
Senior Lecturer, The University Of Queensland

Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"

ReproducibilityReproducibility 04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Driven by recent breakthrough advances in neural representation learning, approximate near-neighbor (ANN) search over vector embeddings has emerged as a critical computational workload. With the introduction of the seminal Hierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have established themselves as the overwhelmingly dominant paradigm for efficient and scalable ANN search. As the name suggests, HNSW searches a layered hierarchical graph to quickly identify neighborhoods of similar points to a given query vector. But is this hierarchy even necessary? A rigorous experimental analysis to answer this question would provide valuable insights into the nature of algorithm design for ANN search and motivate directions for future work in this increasingly crucial domain. To that end, we conduct an extensive reproducibility study covering more large-scale datasets than prior investigations of this question. We ultimately find that a flat navigable small world graph retains all of the benefits of HNSW on high-dimensional datasets, with latency and recall performance essentially identical to the original algorithm but with less memory overhead. Furthermore, we go a step beyond prior works and study why the hierarchy of HNSW provides no benefit in high dimensions, hypothesizing that navigable small world graphs contain a well-connected, frequently traversed highway of hub nodes that maintain the same purported function as the hierarchical layers. We present compelling empirical evidence that the Hub Highway Hypothesis holds for real datasets and investigate the mechanisms by which the highway forms. The implications of this hypothesis may also provide future research directions in developing enhancements to graph-based ANN search.
Presenters
BM
Blaise Munyampirwa
Argmax
Co-Authors
VL
Vihan Lakshman
PhD Student, Massachusetts Institute Of Technology
BC
Benjamin Coleman
Google

Multivector Reranking in the Era of Strong First-Stage Retrievers

ReproducibilityReproducibility 04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Learned multivector representations power modern search systems with strong retrieval effectiveness, but their real-world use is limited by the high cost of exhaustive token-level retrieval. Therefore, most systems adopt a gather-and-refine strategy, where a lightweight gather phase selects candidates for full scoring. However, this approach requires expensive searches over large token-level indexes and often misses the documents that would rank highest under full similarity. In this paper, we reproduce several state-of-the-art multivector retrieval methods on two publicly available datasets, providing a clear picture of the current multivector retrieval field and observing the inefficiency of token-level gathering. Building on top of that, we show that replacing the token-level gather phase with a single-vector document retriever ---specifically, a learned sparse retriever (LSR)--- produces a smaller and more semantically coherent candidate set. This recasts the gather-and-refine pipeline into the well-established two-stage retrieval architecture. As retrieval latency decreases, query encoding with two neural encoders becomes the dominant computational bottleneck. To mitigate this, we integrate recent inference-free LSR methods, demonstrating that they preserve the retrieval effectiveness of the dual-encoder pipeline while substantially reducing query encoding time. Finally, we investigate multiple reranking configurations that balance efficiency, memory, and effectiveness, and we introduce two optimization techniques that prune low-quality candidates early. Empirical results show that these techniques improve retrieval efficiency by up to 1.8 times with no loss in quality. Overall, our two-stage approach achieves over 24 times speedup over the state-of-the-art multivector retrieval systems, while maintaining comparable or superior retrieval quality.
Presenters
SM
Silvio Martinico
PhD Student, University Of Pisa
Co-Authors
FN
Franco Maria Nardini
Research Director, ISTI-CNR
CR
Cosimo Rulli
Researcher , ISTI-CNR
RV
Rossano Venturini
University Of Pisa

Temporal Fact Conflicts in LLMs: Reproducibility Insightsfrom Unifying DYNAMICQA and MULAN

Reproducibility 04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying dataset report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model's output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each study. Importantly, using an LLM, we synthetically generate realistic natural language contexts to replace MULAN¡¯s programmatically constructed statements when reproducing the findings of DYNAMICQA. Our analysis reveals strong dataset dependence: MULAN¡¯s findings generalise under both methodological frameworks, whereas applying MULAN¡¯s evaluation to DYNAMICQA yields mixed outcomes. Finally, while the original studies only considered 7B LLMs, we reproduce these experiments across LLMs of varying sizes, revealing how scale influences the encoding and updating of temporal facts. Our results highlight how dataset design, evaluation metrics, and model size shape LLM behaviour in the presence of temporal knowledge conflicts. Our code and data are publicly available at https://anonymous.4open.science/r/Temporal-Conflict-Repro-30B7.
Presenters
RD
Ritajit Dey
PhD Student, University Of Glasgow
Co-Authors
IO
Iadh Ounis
Professor, University Of Glasgow
GM
Graham McDonald
Senior Lecturer, University Of Glasgow
YM
Yashar Moshfeghi
Associate Professor, University Of Strathclyde
113 visits

Session Participants

User Online
Session speakers, moderators & attendees
The University of Queensland
PhD Student
,
University Of Pisa
PhD student
,
University Of Glasgow
Senior Lecturer
,
University Of Glasgow
No attendee has checked-in to this session!
21 attendees saved this session

Session Chat

Live Chat
Chat with participants attending this session

Questions & Answers

Answered
Submit questions for the presenters

Session Polls

Active
Participate in live polls

Need Help?

Technical Issues?

If you're experiencing playback problems, try adjusting the quality or refreshing the page.

Questions for Speakers?

Use the Q&A tab to submit questions that may be addressed in follow-up sessions.