Interpreting LLMs — Interactive Demo

What does an LLM actually "know"? For years, neural networks were black boxes. Researchers could measure what goes in and what comes out, but the internal representations remained opaque. A new field — mechanistic interpretability — is changing that. Using sparse autoencoders and attribution graphs, we can now identify meaningful features inside models and trace how information flows through them. This demo walks you through the key ideas.

Polysemantic Neurons: Why Individual Neurons Don't Tell Us Much

What We'd Hope For

dog

cat

car

tree

happy

sad

Monosemantic — one neuron, one concept

What Actually Happens

dogloyalty4-leg furn.

catindepend.anc. Egypt

carspeedGer. eng.

treefamilydata str.

happyyellow440Hz

sadrainminor key

Polysemantic — one neuron, many concepts (superposition)

Click any neuron on the right to see example sentences that activate it. Individual neurons are like tangled wires in a cable — each wire carries multiple signals simultaneously. The model packs far more concepts than it has neurons (superposition). We need a way to untangle them.

From Neurons to Features

→ Sparse Autoencoder

Golden Gate Bridge

Code syntax

French language

Sarcasm

DNA sequences

Why Sparse?

0 of 50 features active (0%)

At any given moment, only a tiny fraction of features are active. This sparsity is what makes them interpretable — each active feature tells us something specific.

Feature Gallery

Click a feature card to see example texts and their activation strengths.

Golden Gate Bridge

Click to explore

"The bridge spanning the Golden Gate strait..."

0.95

"San Francisco landmarks include..."

0.82

"Suspension bridge engineering..."

0.41

Python Code

Click to explore

def fibonacci(n):

0.91

import numpy as np

0.78

"list comprehension syntax"

0.53

Deception

Click to explore

"The spy concealed the documents..."

0.88

"She lied about her whereabouts..."

0.79

"The misleading advertisement claimed..."

0.62

Tracing Information Flow

Attribution graphs show how features connect across layers — revealing how the model actually reasons. Select an example and click Animate to watch information flow.

Golden Gate Claude: When Features Go Wild

In May 2024, Anthropic researchers found a "Golden Gate Bridge" feature inside Claude. They amplified its activation — and the model became obsessed. Every response, no matter the topic, pivoted to the Golden Gate Bridge.

Golden Gate Claude responds:

Why this matters: If amplifying a feature reliably changes behavior, that feature is causally real — not just a correlation. This is how researchers move from "we found a pattern" to "we understand a mechanism."

What We Don't Know Yet

Attribution graphs work well on only ~25% of prompts studied so far
Attention mechanisms aren't yet captured by these methods
Scaling to frontier models is an active research challenge
DeepMind found their SAEs underperformed simpler baselines on some safety tasks
"We have a microscope, not an X-ray — we see some things clearly but miss a lot"

Key Insight

Neurons are not concepts. The right unit of analysis is a "feature" — a direction in activation space that corresponds to a human-understandable concept. Sparse autoencoders extract these features, and attribution graphs show how information flows through them. We can even intervene on features to change model behavior. But this is still early — a powerful microscope, not a complete picture.

Neurons ≠ concepts — features (directions in activation space) are the interpretable unit
Internal representations — models create concepts not in the input or output (genuine multi-step reasoning)
Causal intervention — suppressing or amplifying features changes behavior, proving they matter
Still early — only ~25% of behavior captured; attention not yet integrated; microscope, not X-ray