The Transformer Block · 14 min

Why pre-LN

Two transformer-block architectures differ by one wire, whether LayerNorm sits before each sub-layer or after the residual add. One trains without warmup at any depth. The other doesn't. Find out why.

0 / 0

Two architectures, one wire of difference

There are two canonical wirings for the transformer block. Both have a residual + LayerNorm + sub-layer pattern around each of the two sub-layers. They differ in which side of the residual add the LayerNorm sits on.

Pre-LN (modern default):

x    x+Sub(LN(x))x \;\to\; x + \mathrm{Sub}(\mathrm{LN}(x))

Post-LN (the original 2017 wiring):

x    LN(x+Sub(x))x \;\to\; \mathrm{LN}(x + \mathrm{Sub}(x))

In pre-LN the residual trunk (the unmodified xx) is never normalized between blocks. The LayerNorm is upstream of the sub-layer’s input (a “private” normalization just for that sub-layer’s computation). In post-LN, the LN sits on the trunk after each residual add, normalizing every block’s accumulated state.

You can read which architecture a piece of code is by asking one question: does xx ever pass through an LN before the next residual add? If no, pre-LN. If yes, post-LN.

Read off the architecture

In a pre-LN block, does the residual trunk xx pass through any LayerNorm before the next block’s residual add? Enter 0 for no, 1 for yes.

Watch the gradients diverge

The architectural difference would be a footnote if it didn’t have a sharp empirical consequence. It does. Xiong et al. (2020) proved that at depth LL:

  • Post-LN’s expected gradient norm at the top of the stack is Θ(lnL)\Theta(\ln L), growing with depth.
  • Pre-LN’s is Θ(lnL/L)\Theta(\ln L \,/\, \sqrt{L}), shrinking with depth.

In plain words: the deeper a post-LN model is, the bigger the gradient pulse on its last block at step zero. That pulse is what blows up training unless you ramp the learning rate up gradually with a warmup schedule. Pre-LN doesn’t have the pulse; its gradient norm stays bounded (and small) at every depth.

post-LN pre-LN
0.001.773.54ℓ = 1ℓ = 12ℓ = 24‖∂L / ∂θ_ℓ‖layer index
post-LN max norm 3.22
pre-LN max norm 0.66
post : pre ratio 4.9×
post-LN at this depth → warmup required

qualitative profile after Xiong et al. 2020. post-LN's gradient grows toward the output as L increases; pre-LN's flattens at O(1/√L).

Drag the depth slider. At L=4L = 4, the two curves are similar; both architectures train fine. At L=12L = 12, post-LN’s profile starts climbing toward the output. By L=24L = 24, the ratio is past 3× and the verdict flips: warmup is required to keep post-LN stable. By L=48L = 48, pre-LN’s curve has flattened into the floor while post-LN’s ramp has gotten steeper.

This is the entire reason modern open-weight LLMs (GPT-2/3, LLaMA, every Mistral, Qwen, Gemma, etc.) use pre-LN. They train deep with constant or simply-scheduled learning rates. Post-LN at the same depth would need careful warmup tuning to avoid divergence in the first hundred steps.

The price pre-LN pays

There is no free lunch and pre-LN has a small, real cost.

Because pre-LN’s residual trunk is never normalized between blocks, the magnitude of the residual stream tends to grow as more blocks are stacked; each block’s delta is added to whatever came before, with no rescaling. Without intervention, the activations entering the language-modeling head are at an unbounded scale.

The fix is a single LayerNorm at the very top of the stack, after the last block, before the unembedding. Just one, not one per block. Pre-LN models always include it. Post-LN models don’t need it because every block’s output is already normalized by construction.

A second cost: very-well-tuned post-LN models can sometimes reach slightly lower final loss than pre-LN at moderate depth. With aggressive warmup and careful hyperparameter sweeps, post-LN’s normalized trunk is mathematically more expressive (every layer’s input lives at a stable scale by construction). The trade is depth scalability for one half-percentage point of final perplexity. Modern recipes overwhelmingly take depth.

Spot the post-LN model

Of these four widely-deployed models (GPT-2, BERT-original, T5, LLaMA-2), which one uses post-LN?

Build it yourself

The block has four wires you can choose: where each sub-layer’s LayerNorm sits, and whether each sub-layer’s residual is connected. Three of these four wires are commonly debated; the residual switches are not. Turning either residual off breaks the gradient highway and the model becomes untrainable above a few layers.

sub-layer 1 (attention)

sub-layer 2 (FFN)

x = x + MHA(LN(x))
x = x + FFN(LN(x))
✓ pre-LN block
Modern default. Each sub-layer normalizes its input, computes a delta, and adds the unnormalized delta to the residual trunk. Gradient norm stays O(1/√L) at depth, no warmup needed. You will need a final LN at the top of the stack.

Walk through the obvious ones first. Set everything to pre + on; the verdict says ✓ pre-LN block. Switch both LNs to post; you get ✓ post-LN block. Then experiment: mix one pre with one post; turn off a residual; turn off a LayerNorm. Each broken combination shows you specifically what fails. The mixed pre/post case isn’t strictly “broken” but no real transformer ships with it; it inherits the warmup-required cost of post-LN without getting the normalized-output benefit.

The takeaway: modern transformers settle on pre-LN with both residuals on. That decision sits in three places of the model file (two per block, one at the top of the stack) and propagates through everything: optimizer schedule, training stability, depth scalability, even the logit-lens trick we’ll use in the next lesson. Cells with high gradient norms in the wrong place are how thousands of training runs have ended early.

Lesson complete

Nice tinkering.