A single head only does one thing
The attention layer you have built so far computes one weighted average per query. That weighted average can emphasize different keys for different queries, but the kind of relationship it captures (syntactic agreement? coreference? long-range topic? next-token continuation?) is one fixed thing per layer.
Trained transformers, when probed, turn out to want several attention patterns at once. One head specializes in attending to the previous token. Another tracks subject-verb agreement. Another keeps an eye on the start-of-sequence token. A single head can’t do all of these simultaneously because softmax is a single probability distribution per query.
The fix is to run several attention heads in parallel, each operating on a different subspace of the same input.
Reshape, don't widen
Take the model dimension and split it into equal chunks of width . Each chunk is one head’s working subspace.
Each head gets its own , each of shape . Per head, the operation is exactly the scaled dot-product attention from the last lesson, just operating in a smaller subspace.
Click through and watch the shape passport at the top of the widget update. The model dimension stays at 24; what changes is how many heads slice it. The mini-heatmap for each head shows that head’s attention pattern on the same input. They look different, because each head sees a different subspace and attends accordingly.
Concat the heads, project them down
After each head computes its output, concatenate along the feature axis:
Then mix them through a learned output projection :
That’s the whole multi-head attention layer. Output shape: , same as the input, ready to feed into a residual connection.
The role of is small but real. It is the only place where information from different heads can mix; without it, each head’s output would live in its own block of the feature axis forever, and downstream layers would have no way to combine them. Most practical implementations make slightly bigger than necessary so the mixing has some capacity to play with.
Per-head dimension
With and heads, what is per head?
Param count is conserved
What is the ratio of total parameters in -head attention to parameters in single-head attention (with )? Multiply across all heads.
Why this is free
Notice from the widget what doesn’t change as you crank up: the parameter count.
Each head’s is , so the total across all heads is . Identical to a single head with . Same for and . The FLOPs for the attention computation also work out the same: each head does work, and there are of them, giving .
So multi-head doesn’t cost more parameters or compute than a single big head. What it gives you is several attention patterns operating in parallel. Empirically this matters: single-head transformers underperform multi-head transformers at the same total compute. The patterns that emerge across heads are more diverse than what a single softmax can capture, even one with the same total dimension to play with.
A small caveat: this lesson said “free”, but there is a subtle cost. With , each head’s might be too small to have rich enough subspaces; you’d be pinching the per-head capacity below what it needs to model the relationships you want. Most production transformers settle on between 8 and 32, with between 64 and 128. There is a sweet spot, and it isn’t at the extremes.
What's left
You now have the entire single-block attention operation: scaled dot-product, packed into matrix form, masked when needed, position-aware via PE, and run in parallel across heads. That is the core operation of a transformer.
What the next (and final) lesson of this module covers: how much this operation actually costs to run, and the one inference trick (the KV-cache) that makes streaming generation feasible at production sequence lengths.
Lesson complete