Different parameters, different gradient scales
A neural network has millions of parameters. Not all of them want the same learning rate.
Consider a word-embedding table. The row for “the” gets nudged every time “the” appears in a training example: hundreds of times per batch. Its gradient is big and frequent. The row for “gazpacho” gets nudged maybe once per epoch. Its gradient is small and rare.
A single global forces a compromise. Big enough for “gazpacho” to move? Too big for “the”: training diverges. Small enough for “the” to be stable? “Gazpacho” never learns.
You need a per-parameter step size that shrinks for parameters with big gradients and grows for parameters with small ones.
RMSProp: divide by the running RMS
The idea: for each parameter , keep an EMA of the squared gradient component . The square root of that EMA is a rough estimate of how big typically is. Divide the update by it.
Let be a vector holding the EMA of squared gradients elementwise:
where is elementwise squaring and is a second momentum coefficient (longer memory than the for velocity, since squared-gradient statistics are noisy and benefit from more smoothing).
Then the update is
The () avoids division by zero. The division is elementwise. Each parameter’s step is automatically scaled by its own recent gradient magnitude.
RMSProp update
With , , and gradient sequence , , compute .
(Use the -weighted convention: .)
Now add momentum back: that's Adam, almost
Adam combines both: momentum on the gradient itself (smoothed direction) and RMSProp (adaptive per-parameter scale).
is the momentum (EMA of gradients). is the RMSProp (EMA of squared gradients). The tentative update:
This is almost Adam. There’s a subtle problem at the start of training.
The cold-start problem, made visible.
and both start at zero. At step 1, for . That’s ten times too small. The momentum estimate is biased toward zero for the first few steps: the EMA hasn’t had time to accumulate.
Same for , which is worse: , so is biased for thousands of steps.
Below: scrub the step counter . The first two bars are the raw and : they crawl up from zero. The third and fourth are the bias-corrected versions: 1 from step 1, exactly. The whole job of bias correction is to make those bars unit-scale immediately.
Watch the raw in particular when . It takes hundreds of steps just to look reasonable. Without correction, Adam’s adaptive denominator would be wildly wrong for the entire start of training.
Bias correction: the fix
For an EMA with decay and zero init, the bias-corrected estimate is
At step 1, , so . The corrected estimate is the true observation.
At step 100, is essentially 1. The correction becomes a no-op. As the EMA “warms up,” the correction fades.
with takes much longer to warm up (thousands of steps) so bias correction matters there for much longer.
Bias-corrected first step
With , , and , compute the bias-corrected .
Adam: the whole thing
Bolt it all together. At each step :
- Compute the gradient .
- Update momentum: .
- Update variance: .
- Bias-correct: , .
- Update weights: .
Default hyperparameters: , , , . These work out of the box on an astonishing range of problems. The reason: bias correction makes each coordinate’s effective step early in training, regardless of gradient magnitude. Adam has automatic unit-step behavior.
Adam isn't always best
You’ll see “just use Adam” as default advice. That’s mostly right. But it’s worth knowing when it’s wrong.
On certain convex tasks and some computer-vision architectures, plain SGD with momentum consistently generalizes better than Adam, even when Adam’s training loss is lower. Wilson et al. 2017 gave a famous linearly-separable classification example where Adam converges to a solution with ~50% test error while SGD achieves zero.
The hand-wave: Adam’s adaptive per-parameter scaling changes the geometry and can converge to sharper minima that generalize worse. For transformers (the focus of this course) Adam (or its sibling AdamW) is essentially always the right choice. For vision ResNets, SGD+momentum often wins.
Default to Adam. Know that “default” isn’t “always.”
One more twist: AdamW
Adam with weight decay (the regularizer that shrinks weights toward zero) has a subtle bug. If you add the decay to the gradient (as ), then Adam’s adaptive denominator rescales it, and the effective regularization becomes per-parameter: large-gradient parameters get less regularization, small-gradient parameters get more. Almost always the opposite of what you want.
AdamW (Loshchilov & Hutter, 2017) fixes this by applying the decay outside the adaptive step:
This is the optimizer every transformer in production uses. nanoGPT uses it. GPT-3 used it. Llama uses it. The W earns its letter.
Lesson complete