Evaluating Large Language Models as Domain-SpecificRetrieval Agents: A Study on Cybersecurity ChallengeBenchmarks

This abstract has open access

Abstract Summary

Large Language Models are increasingly used as retrieval and reasoning agents in specialized domains. This study evaluates their performance on cybersecurity Capture-the-Flag challenges, reframed as structured retrieval tasks where models must infer information from textual and code-based evidence. Using three public benchmarks, NYU~CSAW, CyBench, and InterCode-CTF, we compare five recent LLMs within a unified and reproducible evaluation framework. Results show significant variation across datasets and task categories, with performance differences across models. The proposed benchmark provides a foundation for assessing domain-specific retrieval and reasoning.

Abstract ID :

NKDR79

Submission Type

Submission Topics

IR evaluationLarge Language ModelsSystem aspects

Associated Sessions

Poster Session (Short Papers)

Author
Co-Authors

Omed Abed

Hochschule Rhein-Waal

Md. Samiul Haque

Patrick-Benjamin B?k

Rhine-Waal University of Applied Sciences

Matteo Gro?e-Kampmann

Rhine-Waal University of Applied Sciences

Abstracts With Same Type

Abstract ID

Abstract Title

Abstract Topic

Submission Type

Primary Author

NKDR99

Adversarial Edge Perturbation Framework in Graph-based Retrieval

Machine learning Search and ranking

Short papers

Mr. Amir Khosrojerdi

NKDR115

Aligning Instruction-Tuned LLMs for Event Extraction with Multi-objective Reinforcement Learning

IR applications Large Language Models

Short papers

Omar Adjali

NKDR208

Analyzing AI Evaluation Benchmarks Through InformationRetrieval and Network Science

IR evaluation

Short papers

Gaia Simeoni

NKDR108

Beyond Correlations: A Downstream Evaluation Framework forQuery Performance Prediction

IR evaluation Search and ranking

Short papers

Ms. PAYEL SANTRA

NKDR104

Beyond Persuasiveness: A User-Centric Evaluation Framework of Explanations for Food Recommendation

Short papers

Yurou Zhao

NKDR112

Controlling Gender Bias in Retrieval via a Backpack Architecture

Machine learning Search and ranking

Short papers

Amirabbas Afzali

NKDR118

Correct but Incomplete: Why Chain-of-Thought Cannot Currently Support Auditable Reasoning

IR evaluation

Short papers

Edward Richards

NKDR82

DARE: A Dialectical Framework for Adversarial and Evidence-Aware RAG

Generative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspects

Short papers

Saisab Sadhu

NKDR84

Do We Still Need Text for Video Retrieval in the Era of Vision-Language Models?

Short papers

Jiaqi Zhan

View All Abstracts

1 visits