Training Dynamics & Modern Tricks · 16 min

The tricks that make depth possible

Residual connections, learning-rate warmup, gradient clipping. The three things that turn a 12-layer transformer from "diverges in 200 steps" into "converges cleanly."

0 / 0

The degradation problem

Around 2014–2015, papers kept reporting the same uncomfortable result. You’d train a 20-layer network on ImageNet. It worked. You’d add more layers (30, 50, 56) and the loss got worse. Not just worse on the test set: worse on the training set. Adding capacity to the model was making optimization harder, not easier.

Predict before you read further: what should happen if you take a working 20-layer network and stack 36 more identity-like layers on top? Surely the network can learn to make the extra layers do nothing useful and recover the 20-layer performance.

It can’t. The optimizer can’t find the identity mapping in those extra layers. By the time gradient information has flowed back through 36 more layers of saturating nonlinearities, it’s noise. The deeper network is harder to optimize than the shallower one, even though the shallower one is technically a special case of it.

This is the degradation problem, and the fix that solves it is one of the simplest ideas in deep learning.

Residuals: y = x + F(x)

Instead of asking each block to compute the output directly, ask it to compute the correction you should add to the input:

y  =  x  +  F(x;θ).y \;=\; x \;+\; F(x;\, \theta).

The block has an explicit identity path (the bare xx) plus a correction path (F(x)F(x)). If the correction is useless, FF can learn to output zero (easy), and the block reduces to the identity. If the correction is useful, FF learns it. Either way, the optimizer doesn’t have to reconstruct an identity from scratch through layers of nonlinearity.

This is the structural change that lets you go from 30 layers to 150 layers without the degradation problem. Everything modern, including the transformer, is built on it.

The gradient identity (one line; the whole point)

Backprop through a residual block:

Lx  =  Ly(I  +  Fx).\frac{\partial L}{\partial x} \;=\; \frac{\partial L}{\partial y}\,\bigl(I \;+\; \tfrac{\partial F}{\partial x}\bigr).

The identity matrix in there is the entire reason this works.

At initialization, the residual block’s FF is small (especially with the GPT-2 init scaling from the previous lesson, where its output projection is initialized near zero). So F/x0\partial F/\partial x \approx 0, and

Lx    Ly.\frac{\partial L}{\partial x} \;\approx\; \frac{\partial L}{\partial y}.

Gradient passes through unattenuated. It doesn’t matter how saturated the nonlinearity inside FF is, doesn’t matter how deep the stack is. The identity term guarantees the gradient never has to multiply through FF to reach earlier layers. Vanishing gradients can no longer kill you.

Residuals are not “letting the network ignore layers.” They are gradient highways. The Jacobian on the shortcut is the identity, and backprop cannot break that.

depth 12
∂L/∂x at layer 1 1.268
orders of magnitude lost +0.1
↑ loss
1.020
layer 11
1.040
layer 10
1.061
layer 9
1.082
layer 8
1.104
layer 7
1.126
layer 6
1.149
layer 5
1.172
layer 4
1.195
layer 3
1.219
layer 2
1.243
↓ input
1.268

Each row is one layer; bar width is its gradient magnitude on a log₁₀ axis. Backprop multiplies by ∂hi/∂hi−1 at each step. Without residuals, that Jacobian shrinks the gradient at every layer (worse for tanh than ReLU); by the time you reach the input, the signal is gone. Flip residuals ON and ∂hi/∂hi−1 becomes I + ∂F/∂xI, and the gradient cruises through unchanged. The identity term in the Jacobian is the entire point.

Toggle the residual shortcuts off. Switch to tanh. The gradient at the input side collapses by ten or twenty orders of magnitude: that’s vanishing gradients, made visible. Toggle them back on and the bar stays roughly uniform from top to bottom.

Pre-LN beats post-LN, full stop

There are two places you can put the LayerNorm. The original Transformer paper put it after the residual addition:

post-LN:y  =  LN(x  +  F(x)).\text{post-LN:}\qquad y \;=\; \text{LN}(x \;+\; F(x)).

Modern transformers (GPT-2, GPT-3, LLaMA, every paper since 2019) put it inside the residual branch, before FF:

pre-LN:y  =  x  +  F(LN(x)).\text{pre-LN:}\qquad y \;=\; x \;+\; F(\text{LN}(x)).

Pre-LN is unambiguously better. The original post-LN architecture required learning-rate warmup just to not diverge in the first 200 steps, and even then it became progressively harder to train as depth increased. Pre-LN trains stably without warmup at much greater depth. The reason: post-LN normalizes the residual stream itself, breaking the variance-bound argument; pre-LN keeps the residual stream pristine and only normalizes what flows into the sub-block.

We will use pre-LN for the transformer in this course. Unconditionally.

Residual init scaling

A 12-layer transformer has two residual additions per block (attention + MLP). Following the GPT-2 prescription with base std 0.02, what std should the output projections of each residual sub-block be initialized with? Answer to four decimal places is fine.

Adam's early instability, and why warmup is principled

Adam keeps a running estimate of the second moment of each parameter’s gradient (call it v^t\hat v_t) and divides each step by v^t\sqrt{\hat v_t}. This adaptive rescaling is most of why Adam works.

For small tt, that estimate is built from very few gradient samples. Its variance is high. The per-parameter step sizes are unreliable. If a noisy gradient at step 5 happens to be much larger than typical, Adam interprets that as “this parameter has high gradient variance” and dampens future steps for it, even though the original signal was just noise.

The fix is to take small steps until the statistics stabilize. Linear warmup ramps the learning rate from zero to its peak over the first twarmupt_{\text{warmup}} steps:

ηt  =  ηmaxmin ⁣(ttwarmup,  1).\eta_t \;=\; \eta_{\max} \cdot \min\!\left(\frac{t}{t_{\text{warmup}}},\; 1\right).

After that the LR usually decays; cosine, linear, or 1/√t schedules are all reasonable. Warmup is not folklore; it is what you do when your second-moment estimator is still unreliable.

final loss ∞ (NaN)
survived? no (diverged)
12345divergencetraining steps →

Linear warmup ramps the learning rate from 0 to its peak over the first Tw steps. Post-LN's early instability (Adam's second-moment estimator hasn't stabilized yet) is what diverges when the first step is too aggressive. Warmup buys time. Pre-LN doesn't need any: the residual stream stays bounded by construction.

Start in post-LN mode with warmup at 0. Loss diverges within twenty steps. Drag warmup up to ~30; the first step is small, the gradient amplifier has time to decay, and the run survives. Switch to pre-LN and the run is stable even with no warmup. Two routes to the same fix; pre-LN is the cleaner one, which is why the field landed there.

Where on the warmup ramp

We use 200 steps of linear warmup with peak learning rate ηmax=6×104\eta_{\max} = 6 \times 10^{-4}. What is the learning rate at step 50? Express as a decimal (for example, 0.00015 for 1.5×1041.5 \times 10^{-4}).

Gradient clipping: the seatbelt, not the brakes

Sometimes a gradient comes back enormous, orders of magnitude bigger than usual. A single such step can push your weights so far from a good region that loss diverges and you cannot recover. The pragmatic guardrail is gradient clipping: if the global gradient norm exceeds some threshold τ\tau, rescale the gradient down to that threshold.

g    gmin ⁣(1,  τg2).g \;\leftarrow\; g \cdot \min\!\left(1,\; \frac{\tau}{\lVert g \rVert_2}\right).

A typical value is τ=1.0\tau = 1.0 for transformers. This is a symptom-fix, not a cure. Exploding gradients are usually caused by something more fundamental: bad init, a missing normalization layer, a recurrent dynamic with eigenvalues greater than 1. Clipping is the seatbelt. The seatbelt does not fix bad driving; it just keeps a single bad moment from killing you.

Use it. But do not assume that adding it has solved your problem if you needed it.

Putting it all together

The transformer block we will train later in this course looks like this, conceptually:

x = x + Attention(LayerNorm(x))
x = x + MLP(LayerNorm(x))

Pre-LN inside each branch. Identity shortcut on the outside. Output projections initialized with std = 0.02/√(2·N_layers) so the residual stream stays bounded with depth. AdamW with weight decay around 0.1, linear warmup over the first couple hundred steps, cosine decay after that. Gradient clipping at norm 1.0.

Every line of that comes from this module. None of it is optional polish. Without residuals the gradient dies in the deeper layers; without LayerNorm the activations drift and saturate; without the scaled init the residual stream’s variance grows with depth; without warmup AdamW’s first step is unreliable; without clipping a single bad batch can take you out.

You now have the toolbox. The next module starts the descent into actual sequence modeling.

Lesson complete

Nice tinkering.