## Tags - Part of: [[Machine learning]] [[Artificial Intelligence]] [[Capability]] [[AI safety]] - Related: [[Mathematical theory of artificial intelligence]] - Includes: - Additional: ## Definitions - Study of taking a trained [[Artificial neural networks]], and analyzing the weights to reverse engineer the algorithms learned by the model. ## Main resources - [AI interpretability wiki](https://aiinterpretability.miraheze.org/wiki/Main_Page) - [Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability - YouTube](https://www.youtube.com/watch?v=2Rdp9GvcYOE) - ![Chris Olah - Looking Inside Neural Networks with Mechanistic Interpretability - YouTube](https://www.youtube.com/watch?v=2Rdp9GvcYOE) - [Concrete Steps to Get Started in Transformer Mechanistic Interpretability — Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability/getting-started) - <iframe src="https://www.neelnanda.io/mechanistic-interpretability/getting-started" allow="fullscreen" allowfullscreen="" style="height:100%;width:100%; aspect-ratio: 16 / 5; "></iframe> - [A Comprehensive Mechanistic Interpretability Explainer & Glossary — Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability/glossary) - <iframe src="https://www.neelnanda.io/mechanistic-interpretability/glossary" allow="fullscreen" allowfullscreen="" style="height:100%;width:100%; aspect-ratio: 16 / 5; "></iframe> - [Mechanistic Interpretability - NEEL NANDA (DeepMind) - YouTube](https://www.youtube.com/watch?v=_Ygf0GnlwmY) - ![[Mechanistic Interpretability - NEEL NANDA (DeepMind) - YouTube](https://www.youtube.com/watch?v=_Ygf0GnlwmY) - [ARENA](https://www.arena.education/) ## Landscapes - [GitHub - JShollaj/awesome-llm-interpretability: A curated list of Large Language Model (LLM) Interpretability resources.](https://github.com/JShollaj/awesome-llm-interpretability) - By theme: - [[Superposition]] - [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning](https://transformer-circuits.pub/2023/monosemantic-features/index.html) - [[Grokking]] - [[Grokking Modular Addition]] - [[2301.05217] Progress measures for grokking via mechanistic interpretability](https://arxiv.org/abs/2301.05217) - [[Induction heads]] [Induction heads - illustrated — LessWrong](https://www.lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-heads-illustrated) - [[Grokking Group Operations using Representation theory]] [[2302.03025] A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations](https://arxiv.org/abs/2302.03025) - [[Top down representation engineering]] - [Representation Engineering: A Top-Down Approach to AI Transparency](https://www.ai-transparency.org/) - [[Fractal data manifold dimensions]] - [[2004.10802] A Neural Scaling Law from the Dimension of the Data Manifold](https://arxiv.org/abs/2004.10802) - [[Mathematical theory of artificial intelligence]] ## Brainstorming [[Thoughts AI mechinterp]] ## Resources [[Resources theory reverse engineering mechinterp and alignment AI]] [[Links AI mechinterp]] ## Contents ## Deep dives - [GitHub - JShollaj/awesome-llm-interpretability: A curated list of Large Language Model (LLM) Interpretability resources.](https://github.com/JShollaj/awesome-llm-interpretability) - [An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 — AI Alignment Forum](https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite-1) ## Written by AI (may include hallucinated factually incorrect information) # The complete map of mechanistic interpretability Mechanistic interpretability (mech interp) is the subfield of AI research that reverse-engineers neural networks into human-understandable algorithms by analyzing internal weights, activations, and circuits. This map catalogs **every major concept, technique, result, and tool** in the field as of early 2026, organized into ten sections. Each entry includes a one-sentence explanation and a canonical source URL. --- ## 1. Core concepts and foundations These are the theoretical primitives that define how the field thinks about neural networks. **What is mechanistic interpretability.** The practice of reverse-engineering neural networks into human-understandable components and algorithms, analogous to reverse-engineering a compiled program into source code. ([source](https://www.transformer-circuits.pub/2022/mech-interp-essay)) **Superposition hypothesis.** Neural networks represent more features than they have dimensions by encoding features as nearly-orthogonal directions in activation space, tolerating small amounts of interference—a strategy that is especially advantageous when features are sparse. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Features and representations.** A "feature" is a human-understandable property of the input that is represented as a direction in a neural network's activation space, and identifying these features is the fundamental first step in decomposing network behavior. ([source](https://distill.pub/2020/circuits/zoom-in/)) **Polysemanticity.** The phenomenon where individual neurons respond to multiple, semantically unrelated concepts (e.g., a neuron activating for both cat faces and car fronts), making individual neurons poor units of analysis and motivating the search for better decompositions. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Monosemanticity.** The desirable property where a unit of analysis responds to a single, coherent concept; Anthropic demonstrated that sparse autoencoders can decompose polysemantic neurons into thousands of monosemantic features representing concepts like DNA sequences, legal language, or Hebrew text. ([source](https://transformer-circuits.pub/2023/monosemantic-features)) **Linear representation hypothesis.** The proposal that high-level semantic concepts are encoded as linear directions in activation space, such that concepts can be identified via linear probes and behavior can be steered by adding vectors along these directions. ([source](https://arxiv.org/abs/2311.03658)) **Residual stream.** The central communication channel of a transformer, where each attention head and MLP layer reads from and writes to a shared running sum via residual connections, making computation interpretable as a series of independent additions. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **Circuits.** Subgraphs of neural networks consisting of features (nodes) connected by weights (edges) that together implement a specific, identifiable computation—such as a curve detector built from edge detectors, or an induction circuit that performs pattern completion. ([source](https://distill.pub/2020/circuits/zoom-in/)) **Universality.** The hypothesis that different neural networks trained on similar tasks independently learn similar features and circuits, suggesting these representations reflect genuine structure in the data rather than arbitrary solutions. ([source](https://distill.pub/2020/circuits/zoom-in/)) **Privileged basis.** A layer has a privileged basis when architectural features (such as element-wise ReLU) make the standard basis directions special, potentially encouraging features to align with individual neurons; the residual stream theoretically has no privileged basis, though in practice the Adam optimizer's per-dimension normalizers create one. ([source](https://transformer-circuits.pub/2023/privileged-basis/index.html)) **Feature geometry.** The study of how features organize spatially in representation space when in superposition—Anthropic's toy models revealed features arrange into specific geometric structures such as digons, triangles, pentagons, and tetrahedra, governed by phase transitions. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Bottleneck superposition.** Superposition occurring in bottleneck layers (such as the residual stream or attention keys/queries) where more features must be encoded than there are dimensions available. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Computation in superposition.** The ability of neural networks to perform meaningful computations (e.g., absolute value) on superposed representations, suggesting networks may be "noisily simulating" larger, sparser networks. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Multi-dimensional features.** Features that occupy more than one dimension in activation space (e.g., circular features representing days of the week), extending beyond the one-feature-per-direction assumption of the linear representation hypothesis. ([source](https://arxiv.org/abs/2405.14860)) **Nonlinear representation hypothesis.** The emerging idea that some features may be represented along nonlinear manifolds in activation space rather than as simple linear directions, challenging assumptions underlying most current interpretability methods. ([source](https://arxiv.org/abs/2405.14860)) --- ## 2. Key techniques and methods These are the primary tools researchers use to peer inside neural networks and establish causal claims about their internal computations. ### 2a. Sparse autoencoders and dictionary learning **Sparse Autoencoders (SAEs).** An unsupervised dictionary learning method that trains an autoencoder with a sparsity penalty to decompose polysemantic neural network activations into interpretable, monosemantic features that each correspond to a single concept. ([source](https://transformer-circuits.pub/2023/monosemantic-features)) **Dictionary learning.** The broader framework of decomposing data into a sparse weighted combination of learned basis elements ("atoms"), originating from Olshausen & Field's computational neuroscience work on sparse coding of natural images. ([source](https://www.nature.com/articles/381607a0)) **TopK SAEs.** SAEs using a top-k activation function that directly controls sparsity by retaining only the k largest activations, eliminating L1 penalties and shrinkage artifacts; proposed by Gao et al. at OpenAI with clean scaling laws on GPT-4. ([source](https://arxiv.org/abs/2406.04093)) **Gated SAEs.** SAEs with separate gating and magnitude estimation pathways that solve the shrinkage problem of L1-penalized SAEs by applying the sparsity penalty only to the gate, achieving a Pareto improvement on models up to Gemma 7B. ([source](https://arxiv.org/abs/2404.16014)) **JumpReLU SAEs.** SAEs using a discontinuous JumpReLU activation with learnable thresholds, trained via straight-through estimators to directly optimize L0 sparsity, achieving state-of-the-art fidelity on Gemma 2 9B. ([source](https://arxiv.org/abs/2407.14435)) **End-to-end SAEs.** SAEs trained with task-specific loss (KL divergence on model logits) rather than just reconstruction, significantly improving downstream cross-entropy loss fidelity. ([source](https://arxiv.org/abs/2503.17272)) **Residual stream SAEs.** SAEs applied specifically to residual stream activations, the main approach in Anthropic's Scaling Monosemanticity work, chosen because the residual stream is smaller than MLP layers and helps mitigate cross-layer superposition. ([source](https://transformer-circuits.pub/2024/scaling-monosemanticity/)) **Attention SAEs.** SAEs applied to attention layer outputs to decompose attention head computations into interpretable features. ([source](https://arxiv.org/abs/2404.16014)) **MLP SAEs.** SAEs applied specifically to MLP layer outputs to decompose the nonlinear computations of MLP sublayers into sparse interpretable features. ([source](https://arxiv.org/abs/2404.16014)) ### 2b. Intervention and patching methods **Activation patching / causal tracing.** Running a model on clean and corrupted inputs, then selectively restoring ("patching") specific activations from the clean run into the corrupted run to identify which components causally mediate a particular model behavior. ([source](https://arxiv.org/abs/2202.05262)) **Path patching.** Tracing the causal effect of specific information along particular paths through a network's computational graph by replacing activations only along the path of interest, enabling isolation of how specific components communicate. ([source](https://arxiv.org/abs/2211.00593)) **Attribution patching.** A fast, gradient-based first-order approximation to activation patching that estimates each component's causal effect by computing the product of activation difference and gradient, enabling efficient circuit discovery in a single forward-backward pass. ([source](https://www.neelnanda.io/mechanistic-interpretability/attribution-patching)) **Edge attribution patching.** An extension of attribution patching applied to individual edges (connections between components) rather than nodes, using gradient-based approximations to efficiently estimate each edge's causal importance. ([source](https://aclanthology.org/2024.blackboxnlp-1.25.pdf)) **Causal scrubbing.** Redwood Research's evaluation method that tests interpretability hypotheses by resampling activations—replacing every activation the hypothesis claims is irrelevant with values from different inputs—and checking whether model behavior is preserved. ([source](https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-redwood-research)) **Interchange interventions.** Swapping a neural representation obtained from processing one input ("source") into the model's computation on a different input ("base") to test whether that representation has the causal role predicted by an interpretable high-level causal model. ([source](https://arxiv.org/abs/2106.02997)) **Subspace activation patching.** Patching in specific low-dimensional subspaces of activations (found via methods like DAS) rather than entire activation vectors, enabling more fine-grained causal analysis. ([source](https://arxiv.org/abs/2311.17030)) **Activation patching on SAE features.** Combining SAEs with activation patching to perform causal interventions at the level of individual interpretable SAE features, enabling fine-grained, human-interpretable causal analysis via sparse feature circuits. ([source](https://arxiv.org/abs/2403.19647)) ### 2c. Ablation methods **Zero ablation.** Setting the activations of a specific model component to zero during a forward pass to measure its importance by observing the resulting change in model output. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **Mean ablation.** Replacing a component's activations with their average value across a dataset, providing a less destructive baseline than zero ablation by preserving expected activation magnitude. ([source](https://arxiv.org/abs/2309.16042)) **Resample ablation.** Replacing a component's activations with values from a different, randomly sampled input, destroying input-specific information while maintaining realistic activation statistics. ([source](https://arxiv.org/abs/2309.16042)) ### 2d. Attribution and lens methods **Direct logit attribution (DLA).** Measuring each model component's direct contribution to final output logits by projecting its output through the unembedding matrix, exploiting the additive structure of the residual stream. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **Logit lens.** Applying the model's final unembedding matrix to intermediate layer representations to decode them into vocabulary-space probability distributions, revealing what the model "believes" at each layer. ([source](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens)) **Tuned lens.** An improved version of the logit lens that trains a learned affine transformation for each layer to map intermediate hidden states to vocabulary distributions, correcting for representation drift across layers. ([source](https://arxiv.org/abs/2303.08112)) **Probing classifiers.** Training simple classifiers (typically linear) on frozen intermediate activations to test what information is encoded at each layer. ([source](https://arxiv.org/abs/1610.01644)) **Gradient-based attribution.** Computing the gradient of a model's output with respect to input features or internal components to produce saliency maps indicating which features most influence the prediction. ([source](https://arxiv.org/abs/1312.6034)) **Integrated gradients.** An axiomatic attribution method computing feature importance by integrating gradients along a straight-line path from a baseline input to the actual input, satisfying Sensitivity and Implementation Invariance. ([source](https://arxiv.org/abs/1703.01365)) ### 2e. Steering and editing methods **Activation engineering / steering vectors.** Adding computed "steering vectors"—derived from differences in activations between contrastive prompt pairs—to a model's forward pass to controllably shift output properties like sentiment, truthfulness, or topic. ([source](https://arxiv.org/abs/2308.10248)) **Representation engineering (RepE).** A top-down approach to AI transparency that reads and controls high-level cognitive phenomena (honesty, harmlessness, power-seeking) by identifying and manipulating directions in representation space. ([source](https://arxiv.org/abs/2310.01405)) **Inference-time intervention (ITI).** Improving LLM truthfulness by using linear probes to identify attention heads encoding truth-related information, then shifting their activations along the "truthful direction" during inference. ([source](https://arxiv.org/abs/2306.03341)) **Contrastive activation addition (CAA).** Computing steering vectors by averaging the difference in residual stream activations between paired positive and negative behavioral examples, then adding them at all token positions during inference. ([source](https://arxiv.org/abs/2312.06681)) **ROME (Rank-One Model Editing).** Editing specific factual associations by performing a rank-one update to the weight matrix of a critical mid-layer MLP module, informed by causal tracing. ([source](https://arxiv.org/abs/2202.05262)) **MEMIT (Mass-Editing Memory in a Transformer).** Scaling factual knowledge editing to thousands of simultaneous associations by spreading updates across multiple critical MLP layers using a least-squares objective. ([source](https://arxiv.org/abs/2210.07229)) **Concept erasure (LEACE).** A closed-form method for removing specified concepts from representations by computing an optimal linear projection that provably prevents any linear classifier from detecting the target concept. ([source](https://arxiv.org/abs/2306.03819)) ### 2f. Visualization and automated methods **Feature visualization.** Generating synthetic inputs via optimization that maximally activate specific neurons or features, revealing what visual features the network has learned at each level of abstraction. ([source](https://distill.pub/2017/feature-visualization/)) **Max activating dataset examples.** Finding real inputs from a dataset that produce the highest activation values for a given neuron or feature, providing concrete examples of what that unit responds to. ([source](https://distill.pub/2020/circuits/zoom-in/)) **Feature dashboards.** Visual interfaces aggregating key information about individual SAE features—including top activating examples, activation histograms, logit effects, and auto-generated explanations—for rapid human interpretation. ([source](https://www.neuronpedia.org/)) **Automated interpretability.** Using a large language model (GPT-4) to automatically generate natural-language explanations of what individual neurons respond to, then scoring explanations by having the LLM simulate activations. ([source](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html)) **ACDC (Automatic Circuit DisCovery).** An automated algorithm that iteratively prunes edges from the computational graph by measuring (via activation patching) whether each edge's contribution to a task-specific metric exceeds a threshold. ([source](https://arxiv.org/abs/2304.14997)) **Neuron2Graph (N2G).** Automatically constructing an interpretable graph representation of a neuron's activation behavior by taking maximally activating examples, computing token saliency, and creating a searchable trie structure. ([source](https://arxiv.org/abs/2305.19911)) **Sparse feature circuits.** Circuits described in terms of SAE features rather than polysemantic neurons, discovered via attribution-patching-based indirect effects, yielding causally implicated subnetworks of human-interpretable features. ([source](https://arxiv.org/abs/2403.19647)) --- ## 3. Architectural components under the microscope Each transformer component has been studied individually to understand its computational role. **Attention heads.** Individual attention heads are treated as independent, additive computational units that each read from and write to the residual stream, with each head implementing its own information-moving function analyzable via its QK and OV circuits. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **MLP layers.** Feed-forward layers operate as key-value memories where each neuron's incoming weights correlate with textual patterns and outgoing weights induce distributions over output tokens, constituting two-thirds of a transformer's parameters. ([source](https://arxiv.org/abs/2012.14913)) **Induction heads.** Attention heads implementing a pattern-completion algorithm of the form [A][B]…[A] → [B], typically via a two-head circuit, constituting a proposed primary mechanism for the majority of in-context learning in transformers. ([source](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)) **Copy/suppression heads.** Attention heads (notably L10H7 in GPT-2 Small) that suppress naive copying behavior by attending to tokens that earlier layers have predicted and reducing their logit scores, improving calibration across ~77% of behavior. ([source](https://arxiv.org/abs/2310.04625)) **Previous token heads.** Heads whose attention pattern predominantly attends to the immediately preceding token, serving as a key component in induction circuits by marking each token's predecessor. ([source](https://arxiv.org/abs/2211.00593)) **Backup heads.** Heads that do not normally perform a function but take it over when primary heads are ablated—the "Hydra effect" of emergent self-repair that occurs even in models trained without dropout. ([source](https://arxiv.org/abs/2307.15771)) **QK and OV circuits.** Each attention head decomposes into two independent circuits: the QK circuit (W_Q^T W_K) determining which tokens attend to which, and the OV circuit (W_O W_V) determining what information is moved when attention is paid. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **Layer norm.** Layer normalization introduces a nonlinear, context-dependent scaling operation that complicates interpretability; it is typically "folded in" to adjacent weight matrices during analysis but contributes to self-repair effects. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **Embedding and unembedding matrices.** The embedding matrix converts tokens into residual stream vectors and the unembedding projects back to vocabulary distributions; together they capture bigram statistics (W_E^T W_U) in zero-layer transformers and enable the logit lens technique. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **Skip connections / residual connections.** The architectural feature enabling the "residual stream" view: by adding each layer's output to a running sum rather than replacing it, they make the transformer interpretable as independent modules reading from and writing to a shared channel. ([source](https://transformer-circuits.pub/2021/framework/index.html)) **Negative heads.** Attention heads that systematically reduce the logit of the correct output token—first identified as "negative name mover heads" in the IOI circuit, later comprehensively explained as performing copy suppression. ([source](https://arxiv.org/abs/2310.04625)) --- ## 4. Discovered circuits and behaviors These are specific algorithms researchers have reverse-engineered from trained models, providing concrete evidence that mech interp can uncover real computation. **Indirect Object Identification (IOI) circuit.** A circuit of **26 attention heads** in GPT-2 Small, grouped into 7 classes (duplicate token, S-inhibition, name mover heads, etc.), implementing an algorithm to predict indirect objects in sentences like "John and Mary went to the store, John gave the bag to [Mary]." ([source](https://arxiv.org/abs/2211.00593)) **Induction circuit.** A two-attention-head circuit where a previous token head copies predecessor identity and an induction head completes patterns [A][B]…[A] → [B], constituting the proposed primary mechanism for in-context learning. ([source](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)) **Greater-than circuit.** A circuit in GPT-2 Small that computes whether a year is greater than a preceding year in sentences like "The war lasted from 1745 to 17__", using specific attention heads for copying and MLPs for sharpening outputs. ([source](https://arxiv.org/abs/2305.00586)) **Docstring circuit.** A circuit in a 4-layer attention-only transformer predicting repeated argument names in Python docstrings, featuring composition between heads across layers including previous token heads and copying heads. ([source](https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only)) **Modular addition / grokking.** One-layer transformers trained on modular addition learn a **Fourier multiplication algorithm** mapping inputs onto rotations in ℝ² using trigonometric identities, and grokking is explained as three continuous phases (memorization, circuit formation, cleanup). ([source](https://arxiv.org/abs/2301.05217)) **Copy suppression.** A mechanism implemented by "negative heads" (notably L10H7 in GPT-2 Small) that suppresses naive copying by attending to tokens earlier layers predicted and reducing their logit scores, improving overall calibration. ([source](https://arxiv.org/abs/2310.04625)) **Successor heads.** Attention heads found across multiple architectures (GPT-2, Pythia, Llama-2, 31M to 12B parameters) that increment tokens with natural orderings (Monday→Tuesday, 2→3), using abstract "mod-10" numeric features shared across model families. ([source](https://arxiv.org/abs/2312.09230)) **Factual recall circuits.** Factual associations are stored as localized, directly-editable computations in middle-layer MLP modules, which are decisive when processing the last token of the subject entity, as revealed by causal tracing and validated by ROME. ([source](https://arxiv.org/abs/2202.05262)) **Othello-GPT world models.** A GPT trained solely to predict legal Othello moves develops an emergent internal board representation extractable via probes and causally interventionable, demonstrating that sequence models can learn world models from next-token prediction. ([source](https://arxiv.org/abs/2210.13382)) **S-inhibition heads.** Attention heads in the IOI circuit (heads 7.3, 7.9, 8.6, 8.10 in GPT-2 Small) that inhibit the subject token by writing negative signals into the queries of Name Mover Heads, steering attention toward the indirect object. ([source](https://arxiv.org/abs/2211.00593)) **Name mover heads.** Heads in the IOI circuit (notably 9.9, 10.0, 9.6) that attend to the indirect object name and directly move its information to the final position via the OV circuit, boosting the correct name's logit. ([source](https://arxiv.org/abs/2211.00593)) **Negative name mover heads.** Heads in the IOI circuit that move name information with a negative contribution to the logit difference, acting as a form of self-correction. ([source](https://arxiv.org/abs/2211.00593)) **Acronym circuit.** A circuit of 8 attention heads in GPT-2 Small, grouped into three classes (previous token, information mover, letter mover heads), predicting three-letter acronyms by extracting the first letter of each word. ([source](https://arxiv.org/abs/2405.04156)) **Gender bias circuit.** Circuits involved in pronoun gender agreement identified through causal mediation analysis, with specific attention heads mediating gender bias in language models. ([source](https://arxiv.org/abs/2004.12265)) --- ## 5. Evaluation and validation How do we know if an interpretation is correct? These methods provide the field's answer. **Faithfulness.** Whether a proposed circuit or interpretation accurately reflects the model's true internal computation, operationalized by testing whether the circuit alone reproduces the full model's behavior when all other components are ablated. ([source](https://arxiv.org/abs/2211.00593)) **Completeness.** Whether an explanation accounts for all components that materially contribute to the model's behavior, tested by verifying that ablating the complement of the circuit does not significantly degrade performance. ([source](https://arxiv.org/abs/2211.00593)) **Minimality.** Whether a circuit uses only necessary components, tested by verifying that removing any individual component causes meaningful degradation. ([source](https://arxiv.org/abs/2211.00593)) **Causal scrubbing as evaluation.** Rigorously testing mechanistic hypotheses by treating them as claims about which activations can be resampled without affecting behavior, then performing behavior-preserving resampling to check. ([source](https://www.alignmentforum.org/s/h95ayYYwMebGEYN5y)) **Ablation-based evaluation.** A family of validation methods (zero, mean, resample ablation) that replace activations inside or outside a circuit and measure performance effects, with recent work showing these metrics are sensitive to methodological choices. ([source](https://arxiv.org/abs/2407.08734)) **Feature splitting.** A phenomenon where, as SAE dictionary size increases, a single concept (e.g., "math") splits into finer sub-features ("algebra," "geometry"), reflecting hierarchical structure in the model's representations. ([source](https://arxiv.org/abs/2409.14507)) **Feature absorption.** A failure mode where seemingly monosemantic parent features fail to activate in contexts they should, because their activation gets "absorbed" into more specific child features due to sparsity optimization. ([source](https://arxiv.org/abs/2409.14507)) **KL divergence for circuit evaluation.** Using KL divergence between full model and circuit output distributions as a task-agnostic completeness metric, capturing distributional differences rather than just single logit comparisons. ([source](https://arxiv.org/abs/2304.14997)) **Faithfulness metrics robustness.** Recent work showing that standard circuit faithfulness metrics (logit difference recovery, KL divergence) are sensitive to ablation type, granularity, and evaluation methodology. ([source](https://arxiv.org/abs/2407.08734)) **Ground truth circuits in toy models (Tracr).** Using models with known, hand-coded circuits—compiled from human-readable RASP programs via the Tracr compiler—as evaluation benchmarks where ground truth is known by construction. ([source](https://arxiv.org/abs/2301.05062)) **Hypothesis testing the circuit hypothesis.** Formal statistical approaches to evaluating whether discovered circuits are significant explanations of model behavior versus artifacts. ([source](https://arxiv.org/abs/2211.00593)) --- ## 6. Scaling and frontier research The field's cutting edge: applying mech interp to production-scale models and developing the next generation of methods. **Scaling Monosemanticity.** Anthropic's landmark work scaling SAEs to **Claude 3 Sonnet**, extracting **34 million interpretable features** from a production model, demonstrating that dictionary learning can identify abstract, multimodal, and safety-relevant features in frontier LLMs. ([source](https://transformer-circuits.pub/2024/scaling-monosemanticity/)) **Crosscoders.** SAEs trained simultaneously across multiple model layers (or across base/chat model pairs) that identify shared cross-layer features, reducing redundant feature duplication and simplifying circuit analysis. ([source](https://transformer-circuits.pub/2024/crosscoders/index.html)) **Transcoders.** SAE variants that approximate an MLP sublayer's input-output function (rather than reconstructing a single activation vector), enabling input-invariant circuit analysis by replacing dense MLP computation with a wider, sparsely-activating layer. ([source](https://arxiv.org/abs/2406.11944)) **Circuit tracing / attribution graphs.** Anthropic's March 2025 method using cross-layer transcoders to build attribution graphs that trace the step-by-step computation of language models, applied to Claude 3.5 Haiku to reveal multi-step reasoning, planning, and safety-relevant circuits. ([source](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)) **On the Biology of a Large Language Model.** The companion paper applying attribution graphs to study Claude 3.5 Haiku's internal reasoning across a wide range of phenomena including multi-step reasoning, poetry planning, medical diagnosis chains, and hallucination mechanisms. ([source](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)) **QK attributions for attention decomposition.** Anthropic's July 2025 extension of attribution graphs that decomposes attention patterns as bilinear functions of feature activations on query and key positions, enabling full circuit tracing through attention. ([source](https://transformer-circuits.pub/2025/attention-qk/index.html)) **Feature steering at scale.** Clamping SAE feature activations in Claude 3 Sonnet to causally steer behavior—exemplified by "Golden Gate Claude" where clamping a bridge feature caused the model to identify as the Golden Gate Bridge. ([source](https://transformer-circuits.pub/2024/scaling-monosemanticity/)) **Sparse probing at scale.** Using sparse linear probes on SAE features or model activations to study representations and classify safety-relevant behaviors in large models. ([source](https://transformer-circuits.pub/2024/scaling-monosemanticity/)) **Multi-layer / deep SAEs.** SAE architectures operating across multiple layers simultaneously, including Matryoshka SAEs learning multi-level features and RouteSAE using routers to dynamically integrate multi-layer activations. ([source](https://aclanthology.org/2025.emnlp-main.346.pdf)) **SAE feature geometry and structure.** Research demonstrating that some features have non-linear (e.g., circular) geometric structure in activation space—days of the week and months of the year form circles used for modular arithmetic. ([source](https://arxiv.org/abs/2405.14860)) **Mechanistic interpretability for safety.** Research investigating how SAEs, circuit tracing, and probes can identify deception, sycophancy, and dangerous-content features in models, enabling monitoring, steering, and alignment verification. ([source](https://transformer-circuits.pub/2024/scaling-monosemanticity/)) **Alignment implications.** The discovery that SAE features for deception, sycophancy, and power-seeking exist in production models and can potentially be monitored or suppressed, informing concrete alignment strategies. ([source](https://transformer-circuits.pub/2024/scaling-monosemanticity/)) **Computational interpretability.** Research examining the theoretical computational complexity and tractability of fully decomposing neural network computations into human-understandable circuits. ([source](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)) --- ## 7. Tools and libraries The open-source infrastructure that makes mech interp research possible for the broader community. **TransformerLens.** Neel Nanda's open-source library for mechanistic interpretability of GPT-style language models, providing HookPoints on every activation, caching, and utilities for patching, DLA, and circuit analysis across **50+ supported models**. ([source](https://github.com/TransformerLensOrg/TransformerLens)) **SAELens.** A comprehensive library (by Joseph Bloom, Curt Tigges, et al.) for training, analyzing, and visualizing SAEs on language models, supporting TopK, Gated, JumpReLU architectures with TransformerLens and HuggingFace integration. ([source](https://github.com/jbloomAus/SAELens)) **Neuronpedia.** An open-source web platform (by Johnny Lin / Decode Research) for exploring, annotating, and sharing SAE features, providing interactive feature dashboards, circuit tracing visualization, and community-driven interpretability research. ([source](https://www.neuronpedia.org/)) **nnsight.** A Python library (by Fiotto-Kaufman et al., Northeastern/NDIF) for interpreting and intervening on any PyTorch model's internals through a deferred-execution tracing system, supporting local and remote execution on large models. ([source](https://github.com/ndif-team/nnsight)) **pyvene.** Stanford NLP's library for performing customizable interventions on PyTorch model internals, supporting activation patching, causal abstraction, and interchange interventions via a unified configuration-based API. ([source](https://github.com/stanfordnlp/pyvene)) **CircuitsVis.** A React/Python visualization library by Alan Cooney and Neel Nanda for creating interactive attention pattern displays, colored token views, and activation visualizations in Jupyter notebooks and web applications. ([source](https://github.com/TransformerLensOrg/CircuitsVis)) **Baukit.** David Bau's lightweight PyTorch toolkit for tracing and editing internal activations, providing Trace/TraceDict utilities and interactive widgets—used extensively in ROME and knowledge editing research. ([source](https://github.com/davidbau/baukit)) **Activation Atlas.** An interactive visualization technique (Carter et al., 2019) using feature inversion on millions of activations to create explorable 2D maps of learned features, revealing how networks organize visual concepts hierarchically. ([source](https://distill.pub/2019/activation-atlas/)) **Transformer Debugger (OpenAI).** OpenAI's tool combining automated interpretability with SAEs for investigating transformer behaviors, enabling code-free exploration of model decisions, attention patterns, and neuron ablation. ([source](https://github.com/openai/transformer-debugger)) **tuned-lens library.** The implementation of the tuned lens technique, training affine probes at each layer to decode hidden states into vocabulary distributions as a more reliable alternative to the logit lens. ([source](https://github.com/AlignmentResearch/tuned-lens)) **SAE Vis / SAE Dashboard.** Libraries for generating interactive feature dashboards displaying top-activating examples, activation distributions, and logit effects for individual SAE features, integrated with SAELens and Neuronpedia. ([source](https://github.com/jbloomAus/SAELens)) **Attribution Graphs Frontend.** Anthropic's open-source frontend code for interactive exploration of attribution graphs from their circuit tracing work. ([source](https://github.com/anthropics/attribution-graphs-frontend)) --- ## 8. Related subfields and approaches Mechanistic interpretability draws from and connects to many adjacent research areas. **Developmental interpretability.** Studying how neural network representations, circuits, and computational structures emerge through phase transitions during training, using tools from Singular Learning Theory. ([source](https://devinterp.com/)) **Singular learning theory (SLT) connections.** Watanabe's algebraic-geometric framework provides tools (the local learning coefficient / RLCT) for understanding phase transitions and model complexity in neural networks; Jesse Hoogland and Daniel Murfet (Timaeus) lead efforts connecting SLT to alignment. ([source](https://www.lesswrong.com/s/SfFQE8DXbgkjk62JK/p/TjaeCWvLZtEDAS5Ex)) **Toy models of superposition.** Anthropic's foundational study using small ReLU networks demonstrating that neural networks represent more features than dimensions via superposition, revealing phase changes and geometric structures. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Compressed sensing connections.** The Toy Models of Superposition paper explicitly notes that feature superposition is "very closely related to the long-studied topic of compressed sensing in mathematics," connecting sparse recovery theory to dictionary learning. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Information theory approaches.** Using information-theoretic tools (mutual information, entropy, information bottleneck) to measure and understand what neural network representations encode. ([source](https://arxiv.org/abs/1703.00810)) **Geometric interpretability.** Understanding representations through their geometric structure—polytopes in superposition, representation manifolds, and spatial organization of features. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) **Concept bottleneck models.** Models forcing intermediate predictions through human-interpretable concepts (e.g., "bone spur" for arthritis prediction), enabling both interpretability and test-time intervention on concept predictions. ([source](https://arxiv.org/abs/2007.04612)) **TCAV (Testing with Concept Activation Vectors).** Using directional derivatives along learned Concept Activation Vectors to quantify how important a user-defined high-level concept (e.g., "striped") is to a model's classification. ([source](https://arxiv.org/abs/1711.11279)) **Network pruning connections.** Network pruning—removing unnecessary weights or neurons—is related to finding minimal circuits, as both seek to identify the smallest sufficient computational subgraph. ([source](https://arxiv.org/abs/2404.14082)) **Knowledge neurons.** Specific MLP neurons whose activation is positively correlated with the expression of particular factual knowledge, and which can be manipulated to edit facts without fine-tuning. ([source](https://arxiv.org/abs/2104.08696)) **Causal abstraction.** A framework aligning neural network representations with variables in interpretable causal models, with interchange interventions experimentally verifying that neural representations have the causal properties of their aligned variables. ([source](https://arxiv.org/abs/2106.02997)) **Distributed alignment search (DAS).** A gradient descent method for finding alignments between interpretable causal variables and distributed neural representations in non-standard bases, removing the need for brute-force search. ([source](https://arxiv.org/abs/2303.02536)) **Natural abstractions hypothesis.** John Wentworth's hypothesis that certain abstractions are "natural" in the sense that a wide variety of cognitive systems will converge on using approximately the same low-dimensional summaries for high-dimensional systems. ([source](https://www.lesswrong.com/posts/cy3BhHrGinZCp3LXE/testing-the-natural-abstraction-hypothesis-project-intro)) **Microscope AI.** Chris Olah's concept of using AI as a "microscope"—training a predictive model and then using interpretability to inspect what it learned, extracting knowledge without deploying the model as an acting agent. ([source](https://www.alignmentforum.org/posts/X2i9dQQK3gETCyqh2/chris-olah-s-views-on-agi-safety)) **Eliciting latent knowledge (ELK).** ARC's central open problem: training a reporter to convey what an AI model internally "believes" to be true, rather than what it predicts a human would believe, to detect cases like sensor tampering. ([source](https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit)) **Mechanistic anomaly detection.** ARC's approach of using mechanistic explanations to flag outputs produced by "unusual reasons" (abnormal internal mechanisms), applicable to detecting deceptive alignment and backdoors. ([source](https://www.alignment.org/blog/mechanistic-anomaly-detection-and-elk/)) **Representation topology.** Studying neural activation manifolds via persistent homology and topological data analysis to understand how representations are organized and connected in high-dimensional space. ([source](https://arxiv.org/abs/1905.12200)) **Binding problem in neural networks.** The question of how networks bind multiple features (e.g., color and shape) to specific objects in their representations, connected to superposition and feature composition. ([source](https://transformer-circuits.pub/2022/toy_model/index.html)) --- ## 9. Key organizations and research groups The institutional landscape driving mech interp forward. **Anthropic Interpretability Team.** Led by Chris Olah, this team pioneered the Circuits framework, SAEs for decomposing superposition, and scaled interpretability to frontier models—extracting millions of features from Claude and developing circuit tracing. ([source](https://transformer-circuits.pub/)) **EleutherAI.** An open-source AI research collective contributing interpretability tools (tuned lens), open models (Pythia), and research that enables the broader community to conduct mech interp research. ([source](https://www.eleuther.ai/)) **Redwood Research.** An AI safety lab known for developing causal scrubbing, a rigorous method for evaluating circuit-level hypotheses, and for work on AI control strategies. ([source](https://www.redwoodresearch.org/)) **Google DeepMind.** Home to Neel Nanda's open-source mech interp initiative, Gated/JumpReLU SAE research (Rajamanoharan et al.), and the Gemma Scope project providing open SAEs for Gemma 2. ([source](https://deepmind.google/research/)) **OpenAI Interpretability.** Contributions include automated interpretability (Bills et al.), the Transformer Debugger tool, TopK SAEs (Gao et al.), and early Circuits research when Chris Olah was at the organization. ([source](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html)) **Apart Research.** An organization running mechanistic interpretability hackathons and research sprints, lowering barriers to entry and producing community-driven results. ([source](https://www.apartresearch.com/)) **MATS (ML Alignment Theory Scholars).** An independent fellowship in Berkeley and London connecting researchers with top alignment mentors, training the next generation through structured 12-week research sprints. ([source](https://www.matsprogram.org/)) **Open Source Mechanistic Interpretability.** Neel Nanda's initiative to democratize mech interp through TransformerLens, educational content, the "200 Concrete Open Problems" sequence, and active community building. ([source](https://github.com/TransformerLensOrg/TransformerLens)) **Center for AI Safety (CAIS).** A research and field-building organization supporting AI safety research including interpretability, providing compute grants and publishing safety benchmarks. ([source](https://www.safe.ai/)) **Conjecture.** An AI safety company (founded by Connor Leahy) conducting research on interpretability, cognitive emulation, and alignment approaches focused on understanding and controlling AI systems. ([source](https://conjecture.dev/)) **FAR AI.** A research organization working on adversarial robustness, interpretability applications, and alignment, with published work on exploiting interpretability insights for attacks and defenses. ([source](https://far.ai/)) **Alignment Forum.** The primary community platform where mech interp research is discussed, debated, and published, serving as the intellectual hub for the alignment research community. ([source](https://www.alignmentforum.org/)) --- ## 10. Applications Where mechanistic interpretability meets the real world—from safety to model editing to scientific understanding. **Detecting deception and monitoring.** Using linear probes on residual stream activations to detect when a model engages in deceptive reasoning or produces outputs for anomalous internal reasons, enabling real-time safety monitoring. ([source](https://www.anthropic.com/research/probes-catch-sleeper-agents)) **Sleeper agent / backdoor detection.** Anthropic demonstrated that deliberately backdoored "sleeper agent" models resist standard safety training, but simple linear probes on internal activations can detect backdoor triggers with **>99% AUROC**. ([source](https://arxiv.org/abs/2401.05566)) **Model editing.** Modifying specific model behaviors through targeted weight changes (ROME, MEMIT) informed by mechanistic understanding of where and how factual knowledge is stored. ([source](https://arxiv.org/abs/2202.05262)) **Understanding in-context learning.** Mechanistic explanations showing that in-context learning emerges through discrete phase transitions and is implemented via induction heads—attention circuits that match and copy patterns from context. ([source](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)) **Understanding chain-of-thought.** Investigating whether models actually use their generated reasoning steps internally or arrive at answers through separate circuits, with measurement of faithfulness in chain-of-thought reasoning. ([source](https://arxiv.org/abs/2305.04388)) **Understanding hallucinations.** Using mech interp to trace how factual recall circuits operate and fail, revealing when models confabulate by identifying internal mechanisms that produce unfaithful outputs. ([source](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)) **Bias detection and mitigation.** Using causal mediation analysis to find neurons or features encoding protected attributes and understanding how these causally influence model outputs, enabling targeted debiasing interventions. ([source](https://arxiv.org/abs/2004.12265)) **Targeted adversarial attacks.** Using mechanistic insights about internal feature representations and circuits to craft more effective, targeted adversarial attacks against neural networks. ([source](https://arxiv.org/abs/2404.14082)) **Feature-based content filtering.** Using SAE features to implement content moderation by identifying and intervening on specific interpretable features corresponding to harmful content categories. ([source](https://transformer-circuits.pub/2024/scaling-monosemanticity/)) **Unlearning.** Using mechanistic understanding of where specific knowledge or capabilities are stored to enable targeted "forgetting" (e.g., copyrighted or dangerous content) without degrading other capabilities. ([source](https://arxiv.org/abs/2310.10683)) **Jailbreak detection.** Monitoring internal activation patterns or SAE features associated with refusal/compliance circuits to detect when a model is being jailbroken into harmful outputs. ([source](https://www.anthropic.com/research/probes-catch-sleeper-agents)) **Improving model robustness.** Using interpretability insights about how models process information to identify vulnerabilities, improve out-of-distribution generalization, and strengthen defenses against distribution shift. ([source](https://arxiv.org/abs/2404.14082)) --- ## Conclusion This map covers **155+ distinct concepts** across the ten pillars of mechanistic interpretability—from the theoretical foundations (superposition, circuits, linear representations) through the complete toolkit (SAEs, activation patching, automated interpretability), the landmark empirical results (IOI circuit, induction heads, Othello-GPT), evaluation methodology, scaling to frontier models, open-source infrastructure, adjacent research fields, institutional landscape, and real-world applications. Three structural observations emerge from this survey. First, the field has converged on a dominant paradigm: **SAE-based feature decomposition combined with causal intervention methods**, with Anthropic's 2025 circuit tracing work representing the current apex of this approach. Second, the gap between toy-model understanding and frontier-model understanding has narrowed dramatically—attribution graphs now trace reasoning in Claude 3.5 Haiku, a production model. Third, the field's open-source culture (TransformerLens, SAELens, Neuronpedia) has been essential to its rapid growth, enabling hundreds of independent researchers to contribute discoveries. The central open challenge remains whether these methods can scale to provide the comprehensive, reliable understanding needed for safety guarantees as models grow more capable. More: [[AI-written Mechanistic interpretability]]