Analyzing AI Evaluation Benchmarks Through InformationRetrieval and Network Science

This abstract has open access
Abstract Summary
Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks, with many different approaches. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). However, recent concerns have emerged regarding the robustness of benchmarks and the reliability of leaderboard rankings. In this paper, we apply an IR approach to LLM evaluation: we analyze LLM benchmark results through the lens of network science. Adapting a method developed for TREC test collections, we construct a bipartite graph between models and benchmark questions and apply Kleinberg's HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a modelÁøs tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. This graph-based view provides a principled way to diagnose benchmark-induced biases and to assess the reliability of model rankings. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that model rankings on leaderboards are strongly influenced by subsets of easy questions.
Abstract ID :
NKDR208
Submission Type
Submission Topics

Associated Sessions

University of Udine
University of Udine
Tenure Track Assistant Professor
,
University Of Udine
University of Udine

Abstracts With Same Type

Abstract ID
Abstract Title
Abstract Topic
Submission Type
Primary Author
NKDR99
Machine learning Search and ranking
Short papers
Mr. Amir Khosrojerdi
NKDR115
IR applications Large Language Models
Short papers
Omar Adjali
NKDR108
IR evaluation Search and ranking
Short papers
Ms. PAYEL SANTRA
NKDR112
Machine learning Search and ranking
Short papers
Amirabbas Afzali
NKDR82
Generative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspects
Short papers
Saisab Sadhu
NKDR102
Short papers
Mehmet Erdeniz Aydo?du
2 visits