Abstract Summary
Despite the dramatic advances in Large Language Models (LLMs), traditional summarization benchmarks are critically saturated, failing to differentiate state-of-the-art models or reflect real-world user needs. Existing evaluation datasets suffer from fundamental constraints in instructional diversity, narrow domain coverage, and homogeneous text lengths, missing the practical demands of modern summarization. To address this limitation, we introduce SumBench, a challenging benchmark derived from an in-depth analysis of user requirements and designed to stress-test advanced LLM capabilities. SumBench incorporates a diverse range of goal-oriented instructions, cross-domain long documents (up to 32K tokens), and complex domain knowledge. Crucially, we integrate the Atomic Summarization Evaluation Framework (ATMSumE), which leverages atomic decompositions of instructions and references to enable fine-grained, multi-dimensional assessment across instruction adherence, key point coverage, and factual accuracy. Our analysis on SumBench reveals systemic LLM limitations: performance substantially degrades when source length exceeds 16K tokens, showing pronounced weaknesses in completeness and factuality within specialized domains. Critically, failure modes are highly task-dependent: Completeness gaps emerge in Timeline and Global Summarization, while reasoning-intensive tasks incur higher factual error rates. These results empirically establish that LLM performance is strongly modulated by document length, domain complexity, and instruction type, providing an evidence-based roadmap for robust model development. The benchmark and tools will be publicly released.