How Do LLMs Cite? A Mechanistic Interpretation ofAttribution in RAG

This abstract has open access

Abstract Summary

With the rising popularity of large language models (LLMs) for various applications ranging from every-day information seeking to high stakes applications such as complex medical question answering, hallucinated content poses a serious risk. Retrieval-Augmented Generation (RAG) aims to enhance the trustworthiness of LLMs by grounding their outputs in external documents, often using inline citations for verifiability. However if those citations are not faithful, i.e. if they do not accurately reflect the source of the information during the answer generation, user trust might be misguided by the sheer existence of such references to trusted sources. We argue that to understand citation faithfulness and to develop a reliable framework for the evaluation of citation faithfulness, a mechanistic approach that considers the model internals, rather than mere observations on the input/output of the model, is necessary. This paper offers the first mechanistic account of how a large language model decides whether to attach an inline citation while answering a factoid question. Through activation patching we identify an ¡°attributional ensemble¡± of attention heads and MLP layers that are responsible for the citation generation process in Llama-3.1-8B-Instruct. Our findings suggest that citation decisions rely heavily on shallow heuristics such as entity co-reference matching, raising concerns about the trustworthiness of such citations.

Abstract ID :

NKDR157

Submission Type

IR for good

Submission Topics