20260330T160020260330T1700Europe/AmsterdamReproducibility II: RetrievalFast, Compact, Dynamic Indexing for Learned Sparse Retrieval SystemsDown with the Hierarchy: The 'H' in HNSW Stands for "Hubs"Multivector Reranking in the Era of Strong First-Stage RetrieversTemporal Fact Conflicts in LLMs: Reproducibility Insights from Unifying DYNAMICQA and MULANChaosECIR2026conference-secretariat@blueboxevents.nl
Fast, Compact, Dynamic Indexing for Learned Sparse
Retrieval Systems
ReproducibilityReproducibility04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Learned sparse retrieval (LSR) is an emerging paradigm that uses pretrained language models to assign learned weights to the terms of a document, enabling practitioners to deploy next-generation rankers within their existing lexical retrieval pipelines. Although LSR systems have been found to provide strong increases in effectiveness over traditional statistical approaches, this boost comes at the cost of both indexing and retrieval efficiency. In this work, we explore the application of LSR to a practical online setting where new documents must be indexed and searchable as soon as they arrive. In particular, we create a clean-room re-implementation of the current state-of-the-art linked block dynamic indexing approach, and propose a set of important augmentations that enable efficient online indexing and query processing to generalize to the learned sparse regime. Our results over two traditional and two LSR models, and a multitude of experimental settings, demonstrate the practicality of our approach, allowing new documents to be queried at a reasonable latency while maintaining fast insertion ability.
Joel Mackenzie Senior Lecturer, The University Of Queensland
Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs"
ReproducibilityReproducibility04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Driven by recent breakthrough advances in neural representation learning, approximate near-neighbor (ANN) search over vector embeddings has emerged as a critical computational workload. With the introduction of the seminal Hierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have established themselves as the overwhelmingly dominant paradigm for efficient and scalable ANN search. As the name suggests, HNSW searches a layered hierarchical graph to quickly identify neighborhoods of similar points to a given query vector. But is this hierarchy even necessary? A rigorous experimental analysis to answer this question would provide valuable insights into the nature of algorithm design for ANN search and motivate directions for future work in this increasingly crucial domain. To that end, we conduct an extensive reproducibility study covering more large-scale datasets than prior investigations of this question. We ultimately find that a flat navigable small world graph retains all of the benefits of HNSW on high-dimensional datasets, with latency and recall performance essentially identical to the original algorithm but with less memory overhead. Furthermore, we go a step beyond prior works and study why the hierarchy of HNSW provides no benefit in high dimensions, hypothesizing that navigable small world graphs contain a well-connected, frequently traversed highway of hub nodes that maintain the same purported function as the hierarchical layers. We present compelling empirical evidence that the Hub Highway Hypothesis holds for real datasets and investigate the mechanisms by which the highway forms. The implications of this hypothesis may also provide future research directions in developing enhancements to graph-based ANN search.
Multivector Reranking in the Era of Strong First-Stage
Retrievers
ReproducibilityReproducibility04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Learned multivector representations power modern search systems with strong retrieval effectiveness, but their real-world use is limited by the high cost of exhaustive token-level retrieval. Therefore, most systems adopt a gather-and-refine strategy, where a lightweight gather phase selects candidates for full scoring. However, this approach requires expensive searches over large token-level indexes and often misses the documents that would rank highest under full similarity. In this paper, we reproduce several state-of-the-art multivector retrieval methods on two publicly available datasets, providing a clear picture of the current multivector retrieval field and observing the inefficiency of token-level gathering. Building on top of that, we show that replacing the token-level gather phase with a single-vector document retriever ---specifically, a learned sparse retriever (LSR)--- produces a smaller and more semantically coherent candidate set. This recasts the gather-and-refine pipeline into the well-established two-stage retrieval architecture. As retrieval latency decreases, query encoding with two neural encoders becomes the dominant computational bottleneck. To mitigate this, we integrate recent inference-free LSR methods, demonstrating that they preserve the retrieval effectiveness of the dual-encoder pipeline while substantially reducing query encoding time. Finally, we investigate multiple reranking configurations that balance efficiency, memory, and effectiveness, and we introduce two optimization techniques that prune low-quality candidates early. Empirical results show that these techniques improve retrieval efficiency by up to 1.8 times with no loss in quality. Overall, our two-stage approach achieves over 24 times speedup over the state-of-the-art multivector retrieval systems, while maintaining comparable or superior retrieval quality.
Temporal Fact Conflicts in LLMs: Reproducibility Insightsfrom Unifying DYNAMICQA and MULAN
Reproducibility04:00 PM - 05:00 PM (Europe/Amsterdam) 2026/03/30 14:00:00 UTC - 2026/03/30 15:00:00 UTC
Large Language Models (LLMs) often struggle with temporal fact conflicts due to outdated or evolving information in their training data. Two recent studies with accompanying dataset report opposite conclusions on whether external context can effectively resolve such conflicts. DYNAMICQA evaluates how effective external context is in shifting the model's output distribution, finding that temporal facts are more resistant to change. In contrast, MULAN examines how often external context changes memorised facts, concluding that temporal facts are easier to update. In this reproducibility paper, we first reproduce experiments from both benchmarks. We then reproduce the experiments of each study on the dataset of the other to investigate the source of their disagreement. To enable direct comparison of findings, we standardise both datasets to align with the evaluation settings of each study. Importantly, using an LLM, we synthetically generate realistic natural language contexts to replace MULAN¡¯s programmatically constructed statements when reproducing the findings of DYNAMICQA. Our analysis reveals strong dataset dependence: MULAN¡¯s findings generalise under both methodological frameworks, whereas applying MULAN¡¯s evaluation to DYNAMICQA yields mixed outcomes. Finally, while the original studies only considered 7B LLMs, we reproduce these experiments across LLMs of varying sizes, revealing how scale influences the encoding and updating of temporal facts. Our results highlight how dataset design, evaluation metrics, and model size shape LLM behaviour in the presence of temporal knowledge conflicts. Our code and data are publicly available at https://anonymous.4open.science/r/Temporal-Conflict-Repro-30B7.