Patching the bug we already saw
Last lesson ended with a working attention layer that cannot tell dog bites man from man bites dog. The operation is permutation-equivariant: a rearrangement of the input rows produces a corresponding rearrangement of the output rows, and nothing else.
We need to inject position into the operation so that two tokens at different positions produce different attention behavior, even when they are the same token. There are three approaches in widespread use, each with a different geometry. We’ll cover all three at the level of “what does it actually do to the math,” not derivation.
Sinusoidal: a multi-scale clock added to the input
The original transformer (Vaswani et al., 2017) injects position as a fixed vector added to each token embedding before it enters the first attention layer:
The vector has pairs of components. Each pair is a 2-D point on the unit circle, rotating at a different frequency: .
So is a clock with hands, each hand turning at its own rate. The fastest hand (smallest ) sweeps the unit circle in steps. The slowest (largest ) takes about steps to make one revolution. Together, the hands give a unique fingerprint for each position over a very large range.
Why those weird wavelengths
The choice of geometric wavelengths from to is not aesthetic. It is what makes the encoding relatively shift-invariant.
For each frequency band, the (sin, cos) pair at position is a 2-D rotation of the (sin, cos) pair at position :
That means is a fixed linear function of for any offset . The attention layer can, in principle, learn that linear function and use it to read off relative positions. Sinusoidal encodings give the model a smooth, position-invariant way to think about offsets.
The slowest hand
For dimension index at , the wavelength of the sinusoid is (one revolution per position step). For the last pair (), what is the wavelength in positions, to the nearest integer?
Learned absolute: skip the math, buy the table
GPT-2 and GPT-3 swap sinusoidal for a learned table. Maintain a parameter matrix , one row per position, rows total. At every position, look up the corresponding row and add it to the token embedding.
Trade-offs:
- Pro: zero math. Just gradient descent. Whatever positions the training data exercises, the table learns whatever embedding helps the loss.
- Pro: simpler to explain.
- Con: hard cap at . If you trained with a 2048-position table and you want to run inference at length 4096, you have nothing to feed positions 2048–4095. The model cannot extrapolate.
- Con: each new context length needs a retrain.
For a research-scale model with a fixed context budget, this is fine. For a frontier model that wants to scale context windows after pre-training, it is the wrong shape.
RoPE: position lives inside the score
Rotary positional embeddings (Su et al., 2021) take a different angle entirely. Don’t add anything to the embedding. Instead, rotate the query and key vectors by an angle proportional to their positions, inside the attention computation:
(Generalize: and live in . Treat each consecutive pair of dimensions as a 2-D plane and apply its own rotation at frequency . The are geometrically spaced like sinusoidal PE.)
The killer property: the dot product after rotation depends only on the offset , not on the absolute positions or .
That’s it. The position information makes it through the rotation but the dot product reads off only the relative offset, exactly what an attention head wants in order to model “this token is three back” without caring about its absolute index.
Watch what happens. Set and : offset is 2. Read the rotated dot product. Now press +1 twice. , , offset still 2. The rotated dot product is the same number to the last decimal. Press −1 until and slide back. Same number. The yellow reference row computes the same dot product at the canonical to confirm that what you’re seeing is the geometry, not numerical luck.
In a real transformer, is small (frequencies as before) and the rotation is per-pair, not for the whole vector. The widget exaggerates the rotation to per step so the geometry is visible.
Shifting the whole sequence
A model is using RoPE. We shift every token’s position by +5 (so token at position 0 becomes position 5, position 1 becomes position 6, etc.). By how much do the attention scores change?
What to use, what to remember
- Sinusoidal: cheap, fixed, no parameters, the textbook answer. Worth knowing because it appears in every transformer paper.
- Learned absolute: GPT-2/3 era. Simple, hard cap at training-time context length.
- RoPE: what most modern open-weight models (LLaMA family, most fine-tuneable LLMs) actually use. Encodes relative position, extrapolates better, plugs in cleanly per layer.
If you’re building a transformer today, default to RoPE unless you have a specific reason not to. If you’re reading a 2017–2018-era paper, you’ll see sinusoidal. If you’re reading the original GPT-2 code, you’ll see a learned table. All three are solving the same named problem: attention without position information cannot tell sequences from sets, and we wanted sequences.
We have now built a single attention head that scales correctly, masks the future, and respects position. The next two lessons widen and run it: multi-head lets several attention patterns operate in parallel on the same input; the cost and the cache lesson shows you why context windows are expensive and how production inference dodges most of that cost.
Lesson complete