The Cocktail Party

An interactive guide to the Attention mechanism in LLMs

1How Attention Works
2What's Stored in the Model
3Neurons vs Attention

The Cocktail Party

Click on any word to see how it "pays attention" to the others.

Click a word to see its attention pattern
Each word looks at every other word and decides: who is relevant to me right now? Some words matter a lot, others are nearly invisible.
Query
What do I need?
You shout at the party: "Who can help me move?"
Key
What can I offer?
Everyone wears a name tag: "Strong & free", "Has a truck"...
Value
My actual useful content
Matched people share: "I can come at 9am, I'll bring boxes."
💡
Every word does this with every other word, all at the same time. That's why it's called "attention" — the model learns where to focus for each word.

What Does the Model Actually Store?

Not the answers — the rules for finding them.

The "Training Course"

Before the party, every guest takes a course that teaches three skills:

1
How to ask good questions
→ produces the Query
2
How to write your name tag
→ produces the Key
3
How to share useful information
→ produces the Value
✕ Untrained Model
Guests took a terrible course. They ask random questions, write confusing name tags, and share useless information.
→ Attention goes everywhere randomly
✓ Trained Model
Guests went through a course refined over billions of examples. Sharp questions, accurate tags, exactly the right info.
→ Attention goes to the right places
Random attention — nothing makes sense
💡
The model doesn't store "cat is important when you see sat." It stores the recipe that allows any word, in any sentence, to figure out who to pay attention to — on the fly.

Where Does the Knowledge Live?

Two systems, one model — they work together.

🗄️
The Archives
Feed-Forward Layers (Neurons)
Stores facts and knowledge
Each neuron memorized specific things during training. Stronger connections between neurons = more confident knowledge.
Examples of what's stored:
"Paris → capital of France" "Water boils → 100°C" "Dog → animal, pet, loyal" "Shakespeare → playwright"
💬
The Meeting Room
Attention Layers (Multi-Head)
Stores rules for communication
No facts stored here — only the skill of figuring out which words matter for each other, right now. Each head asks a different question:
Each head specializes:
Head 1: "Who did the action?" Head 2: "Where did it happen?" Head 3: "Positive or negative?" Head 4: "Related earlier word?"
Attention
Neurons
Attention
Neurons
Attention
Neurons
Output
Click to animate — watch the layers alternate
💡
Communicate → Recall → Communicate → Recall. Each round deepens understanding. The attention layers figure out relationships, the feed-forward layers contribute knowledge. Together, layer after layer, they build meaning.