The Cocktail Party

Click on any word to see how it "pays attention" to the others.

Click a word to see its attention pattern

Each word looks at every other word and decides: who is relevant to me right now? Some words matter a lot, others are nearly invisible.

Query

What do I need?

You shout at the party: "Who can help me move?"

Key

What can I offer?

Everyone wears a name tag: "Strong & free", "Has a truck"...

Value

My actual useful content

Matched people share: "I can come at 9am, I'll bring boxes."

💡

Every word does this with every other word, all at the same time. That's why it's called "attention" — the model learns where to focus for each word.

What Does the Model Actually Store?

Not the answers — the rules for finding them.

The "Training Course"

Before the party, every guest takes a course that teaches three skills:

1

How to ask good questions

→ produces the Query

2

How to write your name tag

→ produces the Key

3

How to share useful information

→ produces the Value

✕ Untrained Model

Guests took a terrible course. They ask random questions, write confusing name tags, and share useless information.

→ Attention goes everywhere randomly

✓ Trained Model

Guests went through a course refined over billions of examples. Sharp questions, accurate tags, exactly the right info.

→ Attention goes to the right places

Random attention — nothing makes sense

💡

The model doesn't store "cat is important when you see sat." It stores the recipe that allows any word, in any sentence, to figure out who to pay attention to — on the fly.

Where Does the Knowledge Live?

Two systems, one model — they work together.

🗄️

The Archives

Feed-Forward Layers (Neurons)

Stores facts and knowledge

Each neuron memorized specific things during training. Stronger connections between neurons = more confident knowledge.

Examples of what's stored:

"Paris → capital of France" "Water boils → 100°C" "Dog → animal, pet, loyal" "Shakespeare → playwright"

💬

The Meeting Room

Attention Layers (Multi-Head)

Stores rules for communication

No facts stored here — only the skill of figuring out which words matter for each other, right now. Each head asks a different question:

Each head specializes:

Head 1: "Who did the action?" Head 2: "Where did it happen?" Head 3: "Positive or negative?" Head 4: "Related earlier word?"

Block 1

Attention

+

Neurons

→

Block 2

Attention

+

Neurons

→

Block 3

Attention

+

Neurons

→

Output

Click to animate — watch comprehension deepen through each block

💡

Communicate → Recall → Communicate → Recall. Each round deepens understanding. The attention layers figure out relationships, the feed-forward layers contribute knowledge. Together, layer after layer, they build meaning.