← All Demos · robert@barcik.training

Interpreting LLMs

We have a microscope, not an X-ray. Explore how researchers are beginning to understand what happens inside language models.

What does an LLM actually "know"? For years, neural networks were black boxes. Researchers could measure what goes in and what comes out, but the internal representations remained opaque. A new field — mechanistic interpretability — is changing that. Using sparse autoencoders and attribution graphs, we can now identify meaningful features inside models and trace how information flows through them. This demo walks you through the key ideas.
Polysemantic Neurons: Why Individual Neurons Don't Tell Us Much

What We'd Hope For

dog
cat
car
tree
happy
sad
Monosemantic — one neuron, one concept

What Actually Happens

N1
dogloyalty4-leg furn.
N2
catindepend.anc. Egypt
N3
carspeedGer. eng.
N4
treefamilydata str.
N5
happyyellow440Hz
N6
sadrainminor key
Polysemantic — one neuron, many concepts (superposition)

Click any neuron on the right to see example sentences that activate it. Individual neurons are like tangled wires in a cable — each wire carries multiple signals simultaneously. The model packs far more concepts than it has neurons (superposition). We need a way to untangle them.

From Neurons to Features
Sparse Autoencoder
Golden Gate Bridge
Code syntax
French language
Sarcasm
DNA sequences
Why Sparse?
0 of 50 features active (0%)

At any given moment, only a tiny fraction of features are active. This sparsity is what makes them interpretable — each active feature tells us something specific.

Feature Gallery

Click a feature card to see example texts and their activation strengths.

Tracing Information Flow

Attribution graphs show how features connect across layers — revealing how the model actually reasons. Select an example and click Animate to watch information flow.

Golden Gate Claude: When Features Go Wild

In May 2024, Anthropic researchers found a "Golden Gate Bridge" feature inside Claude. They amplified its activation — and the model became obsessed. Every response, no matter the topic, pivoted to the Golden Gate Bridge.

Golden Gate Claude responds:

Why this matters: If amplifying a feature reliably changes behavior, that feature is causally real — not just a correlation. This is how researchers move from "we found a pattern" to "we understand a mechanism."

What We Don't Know Yet

  • Attribution graphs work well on only ~25% of prompts studied so far
  • Attention mechanisms aren't yet captured by these methods
  • Scaling to frontier models is an active research challenge
  • DeepMind found their SAEs underperformed simpler baselines on some safety tasks
  • "We have a microscope, not an X-ray — we see some things clearly but miss a lot"

Key Insight

Neurons are not concepts. The right unit of analysis is a "feature" — a direction in activation space that corresponds to a human-understandable concept. Sparse autoencoders extract these features, and attribution graphs show how information flows through them. We can even intervene on features to change model behavior. But this is still early — a powerful microscope, not a complete picture.

  1. Neurons ≠ concepts — features (directions in activation space) are the interpretable unit
  2. Internal representations — models create concepts not in the input or output (genuine multi-step reasoning)
  3. Causal intervention — suppressing or amplifying features changes behavior, proving they matter
  4. Still early — only ~25% of behavior captured; attention not yet integrated; microscope, not X-ray