Tutorial on Mechanistic Interpretability

This abstract has open access

Abstract Summary

"This tutorial introduces mechanistic interpretability, a growing research area within the broader interpretability community that seeks to reverse-engineer model components to understand how neural models perform tasks. While this area has rapidly advanced in NLP, yielding insights into the inner workings of large Transformer-based models and enabling model diagnostics, controllability, and safety, it remains largely unexplored in IR. This tutorial provides a foundational overview of mechanistic interpretability in NLP, covering its key goals and core methods. We then zoom in on its early applications in IR, examining the few existing studies in depth and discussing how these methods can be adapted to retrieval settings. Through an interactive coding session, participants will gain a practical understanding of how to design, implement, and analyze mechanistic interpretability experiments. By the end, attendees will be equipped with the conceptual and practical foundation needed to initiate their own research and help strengthen the emerging interpretability and explainability community within IR."

Abstract ID :

NKDR229

Submission Type

Tutorials

Submission Topics