← All Demos · robert@barcik.training

The Cocktail Party

An interactive guide to the Attention mechanism in LLMs

The Cocktail Party

Click on any word to see how it "pays attention" to the others.

Click a word to see its attention pattern
Each word looks at every other word and decides: who is relevant to me right now? Some words matter a lot, others are nearly invisible.
Query
What do I need?
You shout at the party: "Who can help me move?"
Key
What can I offer?
Everyone wears a name tag: "Strong & free", "Has a truck"...
Value
My actual useful content
Matched people share: "I can come at 9am, I'll bring boxes."
💡
Every word does this with every other word, all at the same time. That's why it's called "attention" — the model learns where to focus for each word.

What Does the Model Actually Store?

Not the answers — the rules for finding them.

The "Training Course"

Before the party, every guest takes a course that teaches three skills:

1
How to ask good questions
→ produces the Query
2
How to write your name tag
→ produces the Key
3
How to share useful information
→ produces the Value
✕ Untrained Model
Guests took a terrible course. They ask random questions, write confusing name tags, and share useless information.
→ Attention goes everywhere randomly
✓ Trained Model
Guests went through a course refined over billions of examples. Sharp questions, accurate tags, exactly the right info.
→ Attention goes to the right places
Random attention — nothing makes sense
💡
The model doesn't store "cat is important when you see sat." It stores the recipe that allows any word, in any sentence, to figure out who to pay attention to — on the fly.

Where Does the Knowledge Live?

Two systems, one model — they work together.

🗄️
The Archives
Feed-Forward Layers (Neurons)
Stores facts and knowledge
Each neuron memorized specific things during training. Stronger connections between neurons = more confident knowledge.
Examples of what's stored:
"Paris → capital of France" "Water boils → 100°C" "Dog → animal, pet, loyal" "Shakespeare → playwright"
💬
The Meeting Room
Attention Layers (Multi-Head)
Stores rules for communication
No facts stored here — only the skill of figuring out which words matter for each other, right now. Each head asks a different question:
Each head specializes:
Head 1: "Who did the action?" Head 2: "Where did it happen?" Head 3: "Positive or negative?" Head 4: "Related earlier word?"
Block 1
Attention
+
Neurons
Block 2
Attention
+
Neurons
Block 3
Attention
+
Neurons
Output
Click to animate — watch comprehension deepen through each block
💡
Communicate → Recall → Communicate → Recall. Each round deepens understanding. The attention layers figure out relationships, the feed-forward layers contribute knowledge. Together, layer after layer, they build meaning.