The 12 d² rule
Per-block parameter count, ignoring biases and LayerNorm (both negligible):
Two-thirds of this is FFN. The four attention projections (each ) sum to . The two FFN matrices (first , then ) sum to . Per block, the FFN owns twice the parameter count of attention.
This is a 2017 convention that keeps holding up. Every standard pre-LN transformer follows it. Modern gated FFNs (SwiGLU in LLaMA) reduce the ratio to ~ because the gate adds a third matrix; the underlying ratio of “FFN dominates” is preserved.
Per-block params at GPT-2 dimensions
GPT-2 small has . What is the per-block parameter count (assume no biases, ignore LayerNorm)?
Where GPT-2 small's 124M actually goes
Walk the derivation, with the standard configuration (, , , , weight-tied):
| Component | Formula | Value |
|---|---|---|
| Token embedding (tied with LM head) | 38.6 M | |
| Learned positional encoding | 0.79 M | |
| 12 blocks of attention | 28.3 M | |
| 12 blocks of FFN | 56.6 M | |
| LayerNorms (per-block + final) | 0.04 M (negligible) | |
| Total | ≈ 124.4 M |
GPT-2 reports 124M for small. The arithmetic accounts for nearly every parameter; the small remainder is biases (which we ignored).
Two things this table makes loud:
- At GPT-2 small’s scale, the embedding alone is 31% of the total. That’s a quirk of small models with large vocabularies. As grows, stays linear in but goes quadratic, so embeddings shrink as a share of the budget.
- Of the non-embedding parameters, two-thirds live in the FFN (56.6 / (28.3 + 56.6) ≈ 67%). Attention is the famous part of a transformer. The FFN is where the weight is.
Now scale this up. Snap the widget below to GPT-2 small and watch it land near 124 M. Then snap to GPT-2 medium (, , ~355 M). Then GPT-2 XL (, , ~1.5 B). Then GPT-3 175B. Notice the embedding’s relative slice collapsing as you go.
GPT-2 small total
With , , , , weight tying on, estimate the total parameter count of GPT-2 small to the nearest million.
FLOPs follow params, almost exactly
The Kaplan-and-friends rule of thumb (Kaplan et al. 2020 and follow-ups):
That is: training compute per token, in FLOPs, is roughly six times the non-embedding parameter count. The factor of 6 comes from one forward pass (≈ FLOPs) plus a backward pass (≈ FLOPs, accounting for gradients with respect to both inputs and parameters). Within a small constant, training one token through any transformer costs FLOPs of arithmetic.
This rule of thumb is what makes scaling laws so clean. Total training compute = (FLOPs per token) × (tokens trained on) = . Predict performance from and alone, ignoring most architectural details. It’s why “Chinchilla-optimal” training ratios talk about ; they’re statements about the -vs- trade-off in a constant-FLOPs budget, derived from this approximation.
The T² term: when it actually dominates
M15 ended on the cost of attention being , quadratic in sequence length. Time to put that in context.
Attention’s per-block cost has two parts:
- Projections ():
- The matmul mixing ( and ):
The crossover is where , i.e., . Below that, projection FLOPs dominate. Above that, the matmul dominates.
For GPT-2 small with , the crossover is at . Below that (including the standard context) attention FLOPs are dominated by projections, not by . This is why “long context” is a research-frontier problem: at and the same , the matmul dominates and grows quadratically forever. Below the crossover, scaling is cheap; above it, scaling is expensive.
The FFN, meanwhile, is , linear in , dominant at all reasonable contexts at GPT-2-small scale. As we said: FFN is where the weight is, and where most of the FLOPs are.
Doubling sequence length
At long enough context that the matmul dominates, by what factor does attention compute grow when you double sequence length ?
What's already done, and what's next
You have the entire transformer architecture. Six lessons in this module:
- One block, top to bottom: what a transformer block is.
- Why pre-LN: why modern wirings put LayerNorm where they do.
- The residual stream as the object: the framing that makes everything that follows tractable.
- Stacking N and the full forward pass: the seven-line forward pass.
- Counting the cost: where parameters and FLOPs go.
Plus everything from earlier modules that this module just reused: token embeddings (m14), softmax cross-entropy loss (m14), gradients flowing through residuals (m13), layer norm (m13), the MLP (m11), full multi-head attention (m15). The transformer is small. The transformer is six modules of prerequisites and one module of assembly.
The last thing left for the language-model arc is how to actually use the trained model. M17 covers tokenization (BPE, that is, byte-pair encoding, the boundary between “raw text” and “token IDs the model sees”) and sampling (temperature, top-k, top-p, beam search, the decisions you make at inference time about how to turn the next-token distribution into the next token). Then the M18 capstone trains a tiny GPT in your browser, end-to-end.
Lesson complete
Nice tinkering.
Before you go