Large Language Models are increasingly used as retrieval and reasoning agents in specialized domains. This study evaluates their performance on cybersecurity Capture-the-Flag challenges, reframed as structured retrieval tasks where models must infer information from textual and code-based evidence. ...
IR evaluationLarge Language ModelsSystem aspectsShort papers