Abstract Summary
Traditionally, the task of relevance judgment has relied on human annotators, but recent advances in Large Language Models (LLMs) have prompted growing interest in their use as proxy to provide fully- or semi-automated judgments of relevance. In this setting, a key yet underexplored factor is the role of the relevance scale adopted when judging relevance. Relevance scales range from binary to fine-grained ones, and their impact on the effectiveness of LLM-based judgments, the effect of scale conversions, and their role in the presence of potential data contamination are yet unknown. In this paper, we systematically investigate how different scales, as well as their conversions, affect LLM ability in providing reliable point-wise relevance judgments, exploring that across multiple prompting strategies and model sizes. Using a popular TREC collection, we compare model outputs against both crowd and expert annotations, analyzing their alignment, stability, and potential data contamination issues.