Optimization · 18 min

Momentum: Gradient Descent with a Memory

Teach gradient descent to remember its past moves. Persistent directions amplify; oscillating ones cancel. Ravines stop being slow.

0 / 0

The problem we're trying to fix

Module 10 Lesson 1 showed you the ravine problem. In a loss landscape like f(x,y)=12x2+50y2f(x, y) = \tfrac{1}{2} x^2 + 50 y^2, vanilla SGD zig-zags across the yy-axis (bouncing, because curvature is high there) while crawling along the xx-axis (slow, because curvature is low). It spends almost all its energy on the axis that doesn’t matter.

The fix is so simple it’s almost cheating. Give gradient descent a short-term memory. The gradient at step tt is a fresh observation; blend it with the running average of past gradients and use the blend as your step direction.

Exponential moving average: a 60-second refresher

An exponential moving average (EMA) of a scalar sequence x1,x2,x_1, x_2, \ldots is

yˉt  =  βyˉt1  +  (1β)xt.\bar{y}_t \;=\; \beta \, \bar{y}_{t-1} \;+\; (1 - \beta)\, x_t.

At each step, take β\beta (typically 0.9 or 0.99) fraction of the previous average and add the new observation weighted by 1β1 - \beta. Older observations decay exponentially: the weight of xtkx_{t-k} in yˉt\bar{y}_t is (1β)βk(1-\beta)\beta^k.

The effective window (how far back the average “remembers”) is roughly 1/(1β)1/(1-\beta). β=0.9\beta = 0.9 remembers about the last 10 observations; β=0.99\beta = 0.99 about the last 100; β=0.999\beta = 0.999 about the last 1000.

EMA by hand

For an EMA with β=0.99\beta = 0.99, what is the approximate effective window?

The momentum update

Now apply the EMA idea to gradients. Keep a velocity vector v\mathbf{v} that is an EMA of past gradients:

vt  =  βvt1  +  L(wt).\mathbf{v}_t \;=\; \beta\, \mathbf{v}_{t-1} \;+\; \nabla L(\mathbf{w}_t).

(Note: this form doesn’t include the (1β)(1-\beta) factor. It’s the convention used in most ML libraries for momentum, and it means the effective step size becomes η1/(1β)\eta \cdot 1/(1-\beta) along persistent directions, amplifying them. Sue the conventions, they all work.)

Then update weights using the velocity instead of the raw gradient:

wt+1  =  wt    ηvt.\mathbf{w}_{t+1} \;=\; \mathbf{w}_t \;-\; \eta\, \mathbf{v}_t.

What this does: along the ravine’s steep, oscillating axis, the gradient flips sign every step. Successive v\mathbf{v} updates partially cancel, and the velocity along that axis stays small. Along the ravine’s shallow, persistent axis, the gradient has the same sign every step. Successive updates add, and the velocity compounds.

Result: momentum naturally damps oscillations and accelerates consistent motion. It is the exact right tool for ravines.

Momentum by hand

With β=0.9\beta = 0.9, v0=0\mathbf{v}_0 = 0, and gradient history g1=g2=g3=1g_1 = g_2 = g_3 = 1 along one axis, compute v3\mathbf{v}_3 using vt=βvt1+gt\mathbf{v}_t = \beta \mathbf{v}_{t-1} + g_t.

Race it yourself

Pull up the navigator. Pick the narrow ravine. Drag the start point off-axis, somewhere like (1.8,0.5)(1.8, 0.5). Set η=0.02\eta = 0.02. Run SGD. Watch it zig-zag.

Now switch to SGD + momentum with β=0.9\beta = 0.9. Hit Reset, then Run. Watch it glide.

narrow ravine · f = 0.5x² + 10y²

-3-2-1123-2-112
step
0
x, y
(+1.80, +0.50)
L(x, y)
+4.120
‖∇L‖
+10.161

You’ll see: the momentum trajectory overshoots a bit at first (it accumulates too much velocity in the wrong direction), then settles into a fast, smooth descent. That’s the behavior you want.

Try β=0.5\beta = 0.5 (short memory, mild damping) vs β=0.95\beta = 0.95 (long memory, aggressive acceleration) and feel the difference. There’s no single right β\beta; you tune it per problem.

What β actually controls

Pedagogical warning: β=0.9\beta = 0.9 does NOT mean “90% of the update is the previous update.” It means “90% of the velocity is the previous velocity, plus the new gradient.” The recursion is on velocity, not on position.

One consequence: along a direction where the gradient is constant, the velocity approaches g/(1β)g / (1 - \beta) in the limit. So the effective step along that direction is ηg/(1β)\eta \cdot g / (1 - \beta), which for β=0.9\beta = 0.9 is 10ηg10 \eta g, ten times the vanilla SGD step. That’s where the “acceleration” comes from.

Along a direction where the gradient flips sign every step, the velocity is dominated by cancellation. It stays small. That’s where the “damping” comes from.

The formal convergence story

For a quadratic loss with Hessian eigenvalues in [\lambda_\min, \lambda_\max]:

  • Vanilla GD converges for \eta < 2 / \lambda_\max. Effective rate along the slowest direction: 1 - \eta \lambda_\min. For condition number \kappa = \lambda_\max / \lambda_\min = 100, you crawl.
  • Momentum converges for \eta < (2 + 2\beta) / \lambda_\max. It raises the ceiling on η\eta by a factor of 20\approx 20 for β=0.9\beta = 0.9. And it changes the effective rate: the slowest direction’s contraction improves from 1 - \eta \lambda_\min to roughly 1 - \sqrt{\eta \lambda_\min}. That square root is the theorem: momentum gives you asymptotically better conditioning.

You don’t need the proof. You need the picture: momentum turns the ravine from “nearly uncrossable” into “merely narrow.”

Distill’s Why Momentum Really Works by Gabriel Goh is the canonical visual write-up. Read it when you want more.

Limitations of plain momentum

Momentum fixes the zig-zag along curvature-axis pairs. But it uses one learning rate for all parameters.

In a neural network, different parameters live at different gradient scales. The weights of an embedding table that sees the word “the” get nudged by large gradients every step; the weights of an embedding that sees a rare word get tiny gradients rarely. A single η\eta forces you to choose: big enough for the rare gradients (diverges on “the”), or small enough for “the” (crawls on the rare word).

The fix is per-parameter adaptive step sizes. That’s what RMSProp does. Then combine momentum and RMSProp and you have Adam. Next lesson.

Lesson complete

Nice tinkering.