What does it mean to minimize a function?

Pick a starting point. Watch SGD struggle.

The navigator below has presets; start with the narrow ravine. Drag the start point off-axis, set $\eta$ around $0.04$ , and run plain SGD (no momentum). Watch the trajectory zig-zag.

Then crank $\eta$ higher. It gets worse, not better: the iterates start oscillating across the ravine. Try $\eta$ above the divergence threshold and the trajectory shoots off the edge.

This is the ravine pathology, the defining failure mode of vanilla gradient descent. The rest of the module is the toolkit for fixing it. But first, the basics.

The gradient-descent update rule

From Module 6: the gradient $\nabla L(\mathbf{w})$ points in the direction of steepest ascent. To descend, walk against it:

\mathbf{w}_{t+1} \;=\; \mathbf{w}_t \;-\; \eta\, \nabla L(\mathbf{w}_t).

Read it aloud: new $\mathbf{w}$ equals old $\mathbf{w}$ minus a step-size $\eta$ times the gradient. The scalar $\eta$ (eta) is called the learning rate. Everything interesting about optimization is buried in how you pick that one number.

One step by hand

For $L(w) = w^2 + 3$ with $w_0 = 2$ and $\eta = 0.25$ , compute $w_1$ .

Why against the gradient, not along it

If you walk along the gradient, you’re walking toward the direction of fastest increase (the worst direction). If you walk against it, you’re walking toward the direction of fastest decrease. The sign in the update rule is load-bearing. Get it wrong and the loss explodes.

A common confusion: “gradient descent moves toward the minimum.” It doesn’t. It moves against the gradient, which in curved landscapes is almost never the same thing as “toward the minimum.” In a narrow valley that direction is roughly perpendicular to the valley’s axis. You end up zig-zagging, as you saw above. We’ll fix that with momentum in Lesson 10.3.

The three regimes of step size

On the 1D parabola $f(x) = x^2$ , GD is $x \leftarrow x - 2\eta x = (1 - 2\eta) x$ . Three behaviors, depending on $\eta$ :

$\eta < 0.5$ : monotone decay. Small, steady steps. Slow but safe.
$\eta = 0.5$ : one-shot convergence to zero. Best possible for this problem.
$0.5 < \eta < 1$ : oscillation. The iterates bounce across the minimum but still shrink in magnitude.
$\eta \ge 1$ : divergence. Iterates blow up.

The general rule: the biggest stable learning rate is $2/L$ , where $L$ is the largest eigenvalue of the Hessian (which, for our parabola, equals $2$ ). Above $2/L$ , GD diverges. Below it but close, you oscillate. Well below, you crawl.

Threshold of divergence

For $f(x) = x^2$ , what is the largest learning rate above which GD diverges?

Another curvature

For $f(x) = 5 x^2$ , what is the divergence threshold $\eta_{\max} = 2/L$ ?

(Hint: for $f = c x^2$ , the second derivative is $2c$ , so $L = 2c$ .)

2D: when the curvature isn't the same in every direction

Make the function $f(x, y) = \tfrac{1}{2}x^2 + 50 y^2$ . The Hessian has eigenvalues $1$ and $100$ . A learning rate safe for $y$ (below $2/100 = 0.02$ ) is painfully slow for $x$ (which would like $\eta$ near $2$ ). Along the $y$ -axis GD bounces; along the $x$ -axis it crawls. The trajectory becomes a tight zig-zag hugging the long axis of the valley.

This is the ravine problem you saw in step 1. Nearly every real loss landscape has one axis that is 100× or 10,000× steeper than another. Vanilla GD cannot handle that ratio well.

Switch the navigator above to SGD + momentum and run the same starting point. Watch it rocket down the ravine. Momentum is the fix for this exact failure mode.

Condition number

A 2D quadratic’s Hessian has eigenvalues $\lambda_{\min} = 1$ and $\lambda_{\max} = 100$ . Compute the condition number $\kappa = \lambda_{\max} / \lambda_{\min}$ .

(The bigger this number, the more GD struggles. A transformer’s loss-landscape condition number is typically astronomical.)

The whole module, in one sentence

Gradient descent has two problems:

Vanilla GD is bad at anisotropic landscapes. Fix: momentum (Lesson 10.3), adaptive per-parameter step sizes (Lesson 10.4).
You can’t afford to compute the full gradient. Fix: stochastic/mini-batch gradient estimation (Lesson 10.2).

Everything else in this module (Adam, AdamW, cosine schedules, warmup, gradient clipping) is engineering on top of those two fixes. By the end you’ll read nanoGPT’s configure_optimizers and understand every line.