Schedules, Pathologies, and What nanoGPT Actually Does

Why a schedule at all?

Adam with default $\eta = 10^{-3}$ works out of the box, but for big models at scale, the learning rate isn’t one number: it’s a function of time.

Two things go wrong with a constant $\eta$ :

At the start, the model is freshly initialized and bias-corrected statistics are still settling. Large steps in this regime can blow up the run.
At the end, you want to fine-tune: smaller steps to zero in on a good minimum. A large $\eta$ keeps bouncing around in the noise ball.

The fix: start small (warm up), peak, and decay. That’s a learning-rate schedule.

Design one yourself.

Below: drag the warmup and decay endpoints, pick a decay shape, watch the curve. Three knobs decide a schedule:

Warmup endpoint $T_w$ : how long to ramp up.
Peak $\eta_{\max}$ : the highest LR you ever use.
Decay shape: cosine (modern), linear, step (classic).

Try the three shapes back-to-back. Cosine spends most of its time at high LR (exploration) and tails off smoothly. Linear is symmetric. Step drops once near $0.6 T$ : the old-school approach. Cosine wins for transformers.

Linear warmup, in formulas

The simplest start: linearly ramp the learning rate from near-zero up to its peak over the first $T_w$ steps.

\eta_t^{\text{warm}} \;=\; \eta_{\max} \cdot \frac{t}{T_w}, \qquad t \le T_w.

At step 0, effective LR is near zero. At step $T_w$ (typically 1000–10,000 for a large model), it reaches peak. Warmup prevents early blowups, especially with Adam, where the first few bias-uncorrected steps can be erratic.

Mid-warmup LR

With $T_w = 1000$ and $\eta_{\max} = 3 \times 10^{-4}$ , what is $\eta_t$ at $t = 500$ ?

Cosine decay: the post-warmup workhorse

After warmup, decay smoothly from $\eta_{\max}$ to a small $\eta_{\min}$ following a cosine curve:

\eta_t^{\text{cos}} \;=\; \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min}) \!\left(1 + \cos\!\left(\pi \cdot \tfrac{t - T_w}{T - T_w}\right)\right).

Why cosine? Three reasons:

It’s parameter-free beyond the peak and the horizon: no fiddly drop points.
It spends most of the budget at high LR, the exploration phase.
It tails off smoothly, avoiding the abrupt jolts of step decay.

nanoGPT, GPT-3, PaLM, and Llama all use warmup followed by cosine decay. It’s become the reference schedule for transformer training.

Gradient clipping: the safety net for cliffs

Real loss landscapes have cliffs: regions where the gradient magnitude jumps by orders of magnitude over a short distance. Common in recurrent nets, occasional in transformers. One step through a cliff can throw the optimizer into the void.

The fix, applied after computing the gradient but before the update: project the gradient onto a ball of radius $c$ . If $\|g\| > c$ , rescale to length $c$ :

g \leftarrow g \cdot \min\!\left(1, \;\frac{c}{\|g\|}\right).

nanoGPT uses $c = 1$ . It almost never triggers during normal training, but when a freak batch produces a gradient 100× the typical magnitude, clipping saves the run.

Saddles dominate high dimensions

Time to earn the slogan from Lesson 6.3.

In 1D, a random critical point is a min or a max with 50/50 odds. In 2D, the probability of both eigenvalues being positive (so it’s a local min) is roughly $1/4$ . Generalize: in $d$ dimensions, the probability of a random critical point being a local minimum (all $d$ eigenvalues positive) is roughly $2^{-d}$ .

At $d = 1000$ (a small neural network), this is astronomically small. The critical points you encounter in high-dimensional loss landscapes are overwhelmingly saddles, not minima.

So when training “stalls” (loss barely dropping, gradient norm small) the prior should be: we’re passing through a saddle, not sitting at a minimum. Momentum and Adam’s adaptive scaling both help escape saddles by keeping the update non-zero even when the raw gradient is small. Vanilla SGD can get stuck at saddles for a long time. Dauphin et al. (2014) is the canonical reference.

The overparameterized regime

One more fact to rewire before we close out.

Classical optimization frames training as “find the global minimum of $L(\mathbf{w})$ .” For deep networks, this picture is wrong. A modern transformer has more parameters than training examples. The loss surface has infinitely many zero-loss minima: entire manifolds of parameter settings that fit the training data perfectly.

So “reaching the global minimum” isn’t the goal. Which zero-loss minimum you reach determines how the model generalizes. Flat, wide minima generalize well. Sharp, narrow minima generalize badly. SGD’s noise (Lesson 10.2) biases convergence toward flat minima, and that’s a big part of why SGD-trained models generalize at all.

Practical implication: validation metric is the stopping criterion, not training loss. Stop when held-out performance plateaus, even if training loss is still dropping. Universal early-stopping rule.

Weight-decay contribution under AdamW

In AdamW, with $\eta = 10^{-3}$ , $\lambda = 0.1$ , and $\mathbf{w} = 1.0$ , what is the weight-decay contribution to $\Delta \mathbf{w}$ ? (Ignore the gradient term.)

Read nanoGPT's configure_optimizers

Everything in this module is now visible inside one production training loop. Karpathy’s configure_optimizers in nanoGPT does this, annotated:

# All parameters split into two groups: decay and no-decay.
decay_params = [p for n, p in params if p.dim() >= 2]       # matmul weights
no_decay_params = [p for n, p in params if p.dim() < 2]     # biases, LayerNorm

# AdamW: weight decay is decoupled from the adaptive step.
optimizer = torch.optim.AdamW(
    [
        {'params': decay_params, 'weight_decay': 0.1},        # regularize matmul weights
        {'params': no_decay_params, 'weight_decay': 0.0},     # don't regularize biases
    ],
    lr=6e-4,                                                    # peak LR for GPT-2-small
    betas=(0.9, 0.95),                                          # β₁ and β₂
    eps=1e-8,                                                   # for numerical stability
)

# Warmup for 2000 steps, then cosine decay.
# During training: torch.nn.utils.clip_grad_norm_(parameters, 1.0)   # gradient clipping

Every knob here is something you now understand. AdamW because of decoupled decay. betas=(0.9, 0.95): $\beta_2 = 0.95$ instead of the default $0.999$ because transformer gradients are less stationary than generic problems, and a shorter memory for squared gradients is better. Weight decay $0.1$ on matmul weights only. Peak LR $6 \times 10^{-4}$ with warmup + cosine. Gradient clip at 1.0.

Everything has a reason, and you know every reason.

What we didn't cover, and why not

A few things we skipped on purpose:

Second-order methods (Newton, quasi-Newton, L-BFGS). Beautiful convergence but require the Hessian, which is too big to store for neural nets. Rarely used at scale.
Adaptive LR-free methods (AdaBelief, LAMB, Sophia). Active research. Sometimes win on specific tasks; not universally better.
Sharpness-aware minimization (SAM). Actively seeks flat minima. Real effect, real cost: two forward/backward passes per step.

You now have the core vocabulary to read any optimizer paper. The rest is empirical.