Every position queries, every position answers
The last two lessons built attention with one query against keys. That’s enough to define the operation but not enough to do useful work: a transformer layer needs every position to attend to every other position, in one parallel sweep.
The leap is small. Take the input sequence , where each row is a token’s vector. We’re going to make every row both a querier and a query target: same data, three roles.
Three projections of the same X
For each token vector, build a query, a key, and a value via three independent linear layers:
with and . Q, K, V are matrices of shape , , . Each row is the query, key, value for one token.
In real code the three projections are usually fused into one matrix multiply for speed:
qkv = nn.Linear(d, 3 * d)(x) # one matmul, no bias
q, k, v = qkv.split(d, dim=-1) # then three viewsThis is the canonical nanoGPT pattern, and it is the reason a learner who sees qkv instead of q, k, v in code shouldn’t panic: they’re the same three projections, packed.
The point worth carrying with you: Q, K, V are not three different inputs. They are three views of the same , computed by three different matrices. A single token at position contributes its query (when it is doing the asking), its key (when others might match against it), and its value (the payload it delivers if matched).
The matrix form, in one line
Apply scaled dot-product attention to in parallel:
That is the equation you came to this module for. Let’s walk through it shape by shape.
- is a matrix of all pairwise scores. Cell is : how strongly query matches key .
- Divide by : last lesson’s variance fix.
- Softmax along the last axis (the key axis): every row becomes a probability distribution over the keys.
- Multiply by : each row of the output is a weighted average of the value vectors, weighted by that row’s attention distribution.
- Output: . Same number of rows as input. Same shape passport, plus mixing.
The “shape passport” (the chain of tensor shapes) is the single most useful debugging tool when reading attention code. If you can recite it, you can localize a bug in someone else’s transformer in about forty-five seconds.
Shape passport drill
With tokens, model dimension, and key dimension, has shape . How many scalar entries does have?
Self vs. cross: same op, different bindings
A quick clarifier in case you read older transformer papers and got confused by encoder–decoder diagrams.
- Self-attention: , , all come from the same sequence. , and the formula above is unchanged. This is what GPT-style decoder-only models use.
- Cross-attention: comes from one sequence, and from another. Used in encoder–decoder translation models, where the decoder’s queries attend to the encoder’s keys/values. Same equation, different argument bindings.
The implementation is identical. The codepath is one function. Don’t think of cross-attention as a different mechanism: think of it as the same Attention(Q, K, V) call with different inputs in two of its three slots. We will spend the rest of this module on self-attention because it is what a decoder-only transformer needs.
Mask the future
At training time, a decoder-only language model wants every position to predict the next token from positions up to and including itself. So position must not be allowed to read positions . The future is what we are trying to predict: leaking it would let the model memorize the answer.
But at training time we do feed the entire -length sequence into the layer in one forward pass. We don’t recompute attention times for different prefixes. So the question is: how do you, in a single call, force every row to ignore the positions ?
You add an upper-triangular mask of to the scores before the softmax:
After softmax, , so every entry above the diagonal becomes a hard zero. The remaining entries renormalize across the unmasked positions, and each row is still a valid probability distribution.
Toggle the mask off and back on. Off: every query reads every key (this is what an encoder, like a BERT, does). On: every row’s distribution lives only on the diagonal and below; the upper triangle is zeroed. The row sums on the right stay at 1.00 either way, because the masking happens in score space, not weight space.
The same Q, K, V trains as a bidirectional encoder or as a causal decoder, depending on a single line of code: whether you add the mask. That is the whole architectural difference between BERT and GPT, and it lives in this one tensor.
Inside the mask
With the causal mask on at , what is the value of the attention weight (query at position 2 attending to key at position 3)?
Aside on interpretability. It is tempting to look at the heatmap above and read off “the model’s reasoning.” Don’t. Jain and Wallace (NAACL 2019) showed that one can construct alternative attention distributions that yield the same predictions, and that attention is often uncorrelated with gradient-based feature importance. Wiegreffe and Pinter (EMNLP 2019) refined the picture but did not rescue the strong claim. Treat attention heatmaps as a useful diagnostic, not a faithful explanation.
The bug nobody mentions out loud
There is something quietly broken about the operation you just built. Watch.
Start with the chips in identity order and the positional encoding off. Tap any two chips to swap them. Notice the verdict: outputs match permutation. What’s happening: when you swap positions and in , the rows of , , swap; the rows and columns of swap; the softmax operates row-wise so it commutes with that swap; the final multiplication by produces an output that is just with the corresponding rows swapped.
Formally:
That property is called permutation equivariance. It is the reason a transformer is fast and parallel: every position is processed identically, no recurrence, no fixed window. It is also the reason a transformer, in the form we have built so far, literally cannot tell dog bites man from man bites dog. The model has no notion of position. It operates over sets, not sequences.
Permutation equivariance, in one number
With positional encoding off and any permutation applied to the input, what is (the difference between attention on the permuted input and the permutation of the original output)?
The fix is the next lesson
Now flip positional encoding on in the widget and swap two chips. The verdict flips to outputs do not match. Adding position information to the inputs breaks the permutation equivariance, on purpose. We want the model to know that position 0 is the start of the sequence and position is the most recent token, because the meaning of a sentence depends on order.
There are several ways to inject position. Sinusoidal encodings, learned absolute encodings, rotary encodings: each has a different geometry and a different trade-off. That’s what the next lesson covers.
The key thing to leave this lesson with: positional encoding is not decoration. It is the fix to a specific, named symptom: the fact that scaled dot-product attention, on its own, is blind to order. Once you have seen the bug, the patch is unsurprising.
Lesson complete