Large Language Models as Assessors: On the Impact ofRelevance Scales

This abstract has open access

Abstract Summary

Traditionally, the task of relevance judgment has relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as proxy to provide fully- or semi-automated judgments of relevance. In this setting, a key yet underexplored factor is the role of the relevance scale adopted when judging relevance. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effect of scale conversions, and their role in the presence of potential data contamination are yet unknown. In this paper, we systematically investigate how different scales, as well as their conversions, affect LLM ability in providing reliable point-wise relevance judgments, exploring that across multiple prompting strategies and model sizes. Using a popular TREC collection, we compare model outputs against both crowd and expert annotations, analyzing their alignment, stability, and potential data contamination issues.

Abstract ID :

NKDR207

Submission Type

Submission Topics

IR evaluation Large Language Models

Associated Sessions

Poster Session (Short Papers)

Author
Co-Authors

Riccardo Zamolo

Riccardo Lunardi

University of Udine

Michael Soprano

University of Udine

Gianluca Demartini

The University of Queensland

Stefano Mizzaro

University of Udine

Kevin Roitero

Tenure Track Assistant Professor

,

University Of Udine

Abstracts With Same Type

Abstract ID

Abstract Title

Abstract Topic

Submission Type

Primary Author

NKDR99

Adversarial Edge Perturbation Framework in Graph-based Retrieval

Machine learning Search and ranking

Short papers

Mr. Amir Khosrojerdi

NKDR115

Aligning Instruction-Tuned LLMs for Event Extraction with Multi-objective Reinforcement Learning

IR applications Large Language Models

Short papers

Omar Adjali

NKDR208

Analyzing AI Evaluation Benchmarks Through InformationRetrieval and Network Science

IR evaluation

Short papers

Gaia Simeoni

NKDR108

Beyond Correlations: A Downstream Evaluation Framework forQuery Performance Prediction

IR evaluation Search and ranking

Short papers

Ms. PAYEL SANTRA

NKDR104

Beyond Persuasiveness: A User-Centric Evaluation Framework of Explanations for Food Recommendation

Short papers

Yurou Zhao

NKDR112

Controlling Gender Bias in Retrieval via a Backpack Architecture

Machine learning Search and ranking

Short papers

Amirabbas Afzali

NKDR118

Correct but Incomplete: Why Chain-of-Thought Cannot Currently Support Auditable Reasoning

IR evaluation

Short papers

Edward Richards

NKDR82

DARE: A Dialectical Framework for Adversarial and Evidence-Aware RAG

Generative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspects

Short papers

Saisab Sadhu

NKDR84

Do We Still Need Text for Video Retrieval in the Era of Vision-Language Models?

Short papers

Jiaqi Zhan

View All Abstracts

1 visits