An Empirical Study of Model Casing in Learned Sparse Retrieval

This abstract has open access
Abstract Summary
Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have almost exclusively relied on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; performing lowercasing during preprocessing eliminates this gap. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: \url{https://anonymous.4open.science/r/Uncased-vs-cased-models-in-LSR-F75D}
Abstract ID :
NKDR52
Submission Type
Submission Topics
PhD Student
,
University Of Glasgow
PhD student
,
University Of Amsterdam
TKH AI Technology
TKH AI Technology
Senior Lecturer
,
University Of Glasgow
Johns Hopkins University, HLTCOE

Abstracts With Same Type

Abstract ID
Abstract Title
Abstract Topic
Submission Type
Primary Author
NKDR51
Search and rankingSocietally-motivated IR research
Full papers
Martim Baltazar
NKDR15
ApplicationsMachine Learning and Large Language Models
Full papers
Saeedeh Javadi
NKDR49
Societally-motivated IR researchUser aspects in IR
Full papers
Niall McGuire
NKDR177
ApplicationsSearch and ranking
Full papers
Danyang Hou
NKDR184
ApplicationsEvaluation research
Full papers
Danyang Hou
NKDR193
ApplicationsSearch and ranking
Full papers
Danyang Hou
NKDR39
ApplicationsMachine Learning and Large Language Models
Full papers
Sarmistha Das
1 visits