Optimization · 16 min

What does it mean to minimize a function?

Gradients point uphill. Walk against them and you go downhill. The only subtle part is how far to step, and it turns out that one dial has a lot of opinions.

0 / 0

Pick a starting point. Watch SGD struggle.

The navigator below has presets; start with the narrow ravine. Drag the start point off-axis, set η\eta around 0.040.04, and run plain SGD (no momentum). Watch the trajectory zig-zag.

narrow ravine · f = 0.5x² + 10y²

-3-2-1123-2-112
step
0
x, y
(+1.80, +0.50)
L(x, y)
+4.120
‖∇L‖
+10.161

Then crank η\eta higher. It gets worse, not better: the iterates start oscillating across the ravine. Try η\eta above the divergence threshold and the trajectory shoots off the edge.

This is the ravine pathology, the defining failure mode of vanilla gradient descent. The rest of the module is the toolkit for fixing it. But first, the basics.

The gradient-descent update rule

From Module 6: the gradient L(w)\nabla L(\mathbf{w}) points in the direction of steepest ascent. To descend, walk against it:

wt+1  =  wt    ηL(wt).\mathbf{w}_{t+1} \;=\; \mathbf{w}_t \;-\; \eta\, \nabla L(\mathbf{w}_t).

Read it aloud: new w\mathbf{w} equals old w\mathbf{w} minus a step-size η\eta times the gradient. The scalar η\eta (eta) is called the learning rate. Everything interesting about optimization is buried in how you pick that one number.

One step by hand

For L(w)=w2+3L(w) = w^2 + 3 with w0=2w_0 = 2 and η=0.25\eta = 0.25, compute w1w_1.

Why against the gradient, not along it

If you walk along the gradient, you’re walking toward the direction of fastest increase (the worst direction). If you walk against it, you’re walking toward the direction of fastest decrease. The sign in the update rule is load-bearing. Get it wrong and the loss explodes.

A common confusion: “gradient descent moves toward the minimum.” It doesn’t. It moves against the gradient, which in curved landscapes is almost never the same thing as “toward the minimum.” In a narrow valley that direction is roughly perpendicular to the valley’s axis. You end up zig-zagging, as you saw above. We’ll fix that with momentum in Lesson 10.3.

The three regimes of step size

On the 1D parabola f(x)=x2f(x) = x^2, GD is xx2ηx=(12η)xx \leftarrow x - 2\eta x = (1 - 2\eta) x. Three behaviors, depending on η\eta:

  • η<0.5\eta < 0.5: monotone decay. Small, steady steps. Slow but safe.
  • η=0.5\eta = 0.5: one-shot convergence to zero. Best possible for this problem.
  • 0.5<η<10.5 < \eta < 1: oscillation. The iterates bounce across the minimum but still shrink in magnitude.
  • η1\eta \ge 1: divergence. Iterates blow up.

The general rule: the biggest stable learning rate is 2/L2/L, where LL is the largest eigenvalue of the Hessian (which, for our parabola, equals 22). Above 2/L2/L, GD diverges. Below it but close, you oscillate. Well below, you crawl.

Threshold of divergence

For f(x)=x2f(x) = x^2, what is the largest learning rate above which GD diverges?

Another curvature

For f(x)=5x2f(x) = 5 x^2, what is the divergence threshold ηmax=2/L\eta_{\max} = 2/L?

(Hint: for f=cx2f = c x^2, the second derivative is 2c2c, so L=2cL = 2c.)

2D: when the curvature isn't the same in every direction

Make the function f(x,y)=12x2+50y2f(x, y) = \tfrac{1}{2}x^2 + 50 y^2. The Hessian has eigenvalues 11 and 100100. A learning rate safe for yy (below 2/100=0.022/100 = 0.02) is painfully slow for xx (which would like η\eta near 22). Along the yy-axis GD bounces; along the xx-axis it crawls. The trajectory becomes a tight zig-zag hugging the long axis of the valley.

This is the ravine problem you saw in step 1. Nearly every real loss landscape has one axis that is 100× or 10,000× steeper than another. Vanilla GD cannot handle that ratio well.

Switch the navigator above to SGD + momentum and run the same starting point. Watch it rocket down the ravine. Momentum is the fix for this exact failure mode.

Condition number

A 2D quadratic’s Hessian has eigenvalues λmin=1\lambda_{\min} = 1 and λmax=100\lambda_{\max} = 100. Compute the condition number κ=λmax/λmin\kappa = \lambda_{\max} / \lambda_{\min}.

(The bigger this number, the more GD struggles. A transformer’s loss-landscape condition number is typically astronomical.)

The whole module, in one sentence

Gradient descent has two problems:

  1. Vanilla GD is bad at anisotropic landscapes. Fix: momentum (Lesson 10.3), adaptive per-parameter step sizes (Lesson 10.4).
  2. You can’t afford to compute the full gradient. Fix: stochastic/mini-batch gradient estimation (Lesson 10.2).

Everything else in this module (Adam, AdamW, cosine schedules, warmup, gradient clipping) is engineering on top of those two fixes. By the end you’ll read nanoGPT’s configure_optimizers and understand every line.

Lesson complete

Nice tinkering.