Mechanistic Interpretability of Language Models

The Interpretability Hierarchy#

Feature-Level Understanding: At the most granular level, mechanistic interpretability seeks to understand what individual neurons and attention heads represent, how they activate in response to different inputs, and what computational roles they play.

Circuit-Level Analysis: Moving up the hierarchy, researchers identify how collections of features work together to implement specific algorithms or perform particular computational tasks, forming interpretable "circuits" within the network.

Algorithmic Understanding: At the highest level, the goal is to understand the overall algorithms that models learn to solve problems, how information flows through the network, and how complex behaviors emerge from simpler components.

Core Measurement Techniques#

Activation Patching: This technique involves systematically modifying internal activations during model execution to understand how changes in specific components affect final outputs, helping identify causal relationships between internal representations and behaviors.

Gradient-Based Analysis: By examining gradients flowing through different components during backpropagation, researchers can understand which parts of the network are most influential for specific predictions or behaviors.

Probing and Linear Readouts: These methods involve training simple classifiers on internal representations to determine what information is linearly accessible at different layers and processing stages.

Mechanistic Interpretability of Language Models

🔧 Fundamental Concepts in Model Interpretability

The Interpretability Hierarchy#

Core Measurement Techniques#