We have a microscope, not an X-ray. Explore how researchers are beginning to understand what happens inside language models.
Click any neuron on the right to see example sentences that activate it. Individual neurons are like tangled wires in a cable — each wire carries multiple signals simultaneously. The model packs far more concepts than it has neurons (superposition). We need a way to untangle them.
At any given moment, only a tiny fraction of features are active. This sparsity is what makes them interpretable — each active feature tells us something specific.
Click a feature card to see example texts and their activation strengths.
Attribution graphs show how features connect across layers — revealing how the model actually reasons. Select an example and click Animate to watch information flow.
Neurons are not concepts. The right unit of analysis is a "feature" — a direction in activation space that corresponds to a human-understandable concept. Sparse autoencoders extract these features, and attribution graphs show how information flows through them. We can even intervene on features to change model behavior. But this is still early — a powerful microscope, not a complete picture.