Towards Quantitative Summarization Evaluation: AnIntegrated Atomic-Based Evaluation Framework and Datasetfor Text Summarization

This abstract has open access
Abstract Summary
Despite the dramatic advances in Large Language Models (LLMs), traditional summarization benchmarks are critically saturated, failing to differentiate state-of-the-art models or reflect real-world user needs. Existing evaluation datasets suffer from fundamental constraints in instructional diversity, narrow domain coverage, and homogeneous text lengths, missing the practical demands of modern summarization. To address this limitation, we introduce SumBench, a challenging benchmark derived from an in-depth analysis of user requirements and designed to stress-test advanced LLM capabilities. SumBench incorporates a diverse range of goal-oriented instructions, cross-domain long documents (up to 32K tokens), and complex domain knowledge. Crucially, we integrate the Atomic Summarization Evaluation Framework (ATMSumE), which leverages atomic decompositions of instructions and references to enable fine-grained, multi-dimensional assessment across instruction adherence, key point coverage, and factual accuracy. Our analysis on SumBench reveals systemic LLM limitations: performance substantially degrades when source length exceeds 16K tokens, showing pronounced weaknesses in completeness and factuality within specialized domains. Critically, failure modes are highly task-dependent: Completeness gaps emerge in Timeline and Global Summarization, while reasoning-intensive tasks incur higher factual error rates. These results empirically establish that LLM performance is strongly modulated by document length, domain complexity, and instruction type, providing an evidence-based roadmap for robust model development. The benchmark and tools will be publicly released.
Abstract ID :
NKDR203
Submission Type
Phd Student
,
Key Laboratory Of AI Safety, Institute Of Computing Technolog, University Of Chinese Academy Of Science
Institute of Computing Technology, Chinese Academy of Sciences
Institute of Computing Technology , Chinese Academy of Sciences

Abstracts With Same Type

Abstract ID
Abstract Title
Abstract Topic
Submission Type
Primary Author
NKDR52
Search and ranking
Full papers
Emmanouil Georgios Lionis
NKDR51
Search and rankingSocietally-motivated IR research
Full papers
Martim Baltazar
NKDR15
ApplicationsMachine Learning and Large Language Models
Full papers
Saeedeh Javadi
NKDR49
Societally-motivated IR researchUser aspects in IR
Full papers
Niall McGuire
NKDR177
ApplicationsSearch and ranking
Full papers
Danyang Hou
NKDR184
ApplicationsEvaluation research
Full papers
Danyang Hou
NKDR193
ApplicationsSearch and ranking
Full papers
Danyang Hou
NKDR39
ApplicationsMachine Learning and Large Language Models
Full papers
Sarmistha Das
1 visits