Large Language Models as Assessors: On the Impact ofRelevance Scales

This abstract has open access
Abstract Summary
Traditionally, the task of relevance judgment has relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as proxy to provide fully- or semi-automated judgments of relevance. In this setting, a key yet underexplored factor is the role of the relevance scale adopted when judging relevance. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effect of scale conversions, and their role in the presence of potential data contamination are yet unknown. In this paper, we systematically investigate how different scales, as well as their conversions, affect LLM ability in providing reliable point-wise relevance judgments, exploring that across multiple prompting strategies and model sizes. Using a popular TREC collection, we compare model outputs against both crowd and expert annotations, analyzing their alignment, stability, and potential data contamination issues.
Abstract ID :
NKDR207
Submission Type

Associated Sessions

University of Udine
University of Udine
The University of Queensland
University of Udine
Tenure Track Assistant Professor
,
University Of Udine

Abstracts With Same Type

Abstract ID
Abstract Title
Abstract Topic
Submission Type
Primary Author
NKDR99
Machine learning Search and ranking
Short papers
Mr. Amir Khosrojerdi
NKDR115
IR applications Large Language Models
Short papers
Omar Adjali
NKDR108
IR evaluation Search and ranking
Short papers
Ms. PAYEL SANTRA
NKDR112
Machine learning Search and ranking
Short papers
Amirabbas Afzali
NKDR82
Generative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspects
Short papers
Saisab Sadhu
1 visits