Evaluating Large Language Models as Domain-SpecificRetrieval Agents: A Study on Cybersecurity ChallengeBenchmarks

This abstract has open access
Abstract Summary
Large Language Models are increasingly used as retrieval and reasoning agents in specialized domains. This study evaluates their performance on cybersecurity Capture-the-Flag challenges, reframed as structured retrieval tasks where models must infer information from textual and code-based evidence. Using three public benchmarks, NYU~CSAW, CyBench, and InterCode-CTF, we compare five recent LLMs within a unified and reproducible evaluation framework. Results show significant variation across datasets and task categories, with performance differences across models. The proposed benchmark provides a foundation for assessing domain-specific retrieval and reasoning.
Abstract ID :
NKDR79
Submission Type

Associated Sessions

Hochschule Rhein-Waal
Rhine-Waal University of Applied Sciences
Rhine-Waal University of Applied Sciences

Abstracts With Same Type

Abstract ID
Abstract Title
Abstract Topic
Submission Type
Primary Author
NKDR99
Machine learning Search and ranking
Short papers
Mr. Amir Khosrojerdi
NKDR115
IR applications Large Language Models
Short papers
Omar Adjali
NKDR108
IR evaluation Search and ranking
Short papers
Ms. PAYEL SANTRA
NKDR112
Machine learning Search and ranking
Short papers
Amirabbas Afzali
NKDR82
Generative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspects
Short papers
Saisab Sadhu
1 visits