Analyzing AI Evaluation Benchmarks Through InformationRetrieval and Network Science

This abstract has open access

Abstract Summary

Many analyses have been performed on Information Retrieval (IR) evaluation benchmarks, with many different approaches. Benchmarking also plays a central role in evaluating the capabilities of Large Language Models (LLMs). However, recent concerns have emerged regarding the robustness of benchmarks and the reliability of leaderboard rankings. In this paper, we apply an IR approach to LLM evaluation: we analyze LLM benchmark results through the lens of network science. Adapting a method developed for TREC test collections, we construct a bipartite graph between models and benchmark questions and apply Kleinberg's HITS algorithm to uncover latent structure in the evaluation data. In this framework, model hubness quantifies a modelÁøs tendency to perform well on easy questions, while question hubness captures its ability to discriminate between more and less effective models. This graph-based view provides a principled way to diagnose benchmark-induced biases and to assess the reliability of model rankings. We conduct experiments on seven multiple-choice QA benchmarks with a pool of 34 LLMs. Through this IR-inspired approach, we show that model rankings on leaderboards are strongly influenced by subsets of easy questions.

Abstract ID :

NKDR208

Submission Type

Short papers

Submission Topics

IR evaluation