Optimization · 16 min

Batches, Stochastic, and the Noise Ball

Real datasets are too big to gradient-check every step. Use a sample. You introduce noise, which turns out to help, not hurt, in ways that matter.

0 / 0

The cost of a full-batch step

To compute the exact gradient of a neural-network loss, you need to run every training example through the model and average the per-example gradients. For GPT-2 trained on OpenWebText, that’s roughly 300 billion tokens. One full-batch step would take weeks on a single GPU.

You don’t have weeks.

So you approximate. Pick a small random batch of training examples, compute the gradient on just those, use it as a stand-in for the true gradient. Noisy, but the average of enough noisy estimates converges to the right thing. And each noisy estimate is much cheaper.

The stochastic gradient estimator

Formally: the loss is an average over training examples,

L(w)  =  1Ni=1Ni(w).L(\mathbf{w}) \;=\; \frac{1}{N} \sum_{i=1}^{N} \ell_i(\mathbf{w}).

Pick a random mini-batch BB of B|B| examples. The stochastic gradient estimator is

g^  =  1BiBi(w).\hat{\mathbf{g}} \;=\; \frac{1}{|B|} \sum_{i \in B} \nabla \ell_i(\mathbf{w}).

Two facts about g^\hat{\mathbf{g}}:

  • It’s unbiased: E[g^]=L(w)\mathbb{E}[\hat{\mathbf{g}}] = \nabla L(\mathbf{w}). On average, it’s the true gradient.
  • It has variance proportional to 1/B1/|B|. Bigger batches → less noisy estimates.

Use g^\hat{\mathbf{g}} in the GD update exactly where L\nabla L used to go:

wt+1  =  wt    ηg^t.\mathbf{w}_{t+1} \;=\; \mathbf{w}_t \;-\; \eta\, \hat{\mathbf{g}}_t.

That’s stochastic gradient descent.

Compute an estimate

A mini-batch of B=4|B| = 4 examples has per-example gradients along the xx-axis of (1,2,3,4)(1, 2, 3, 4). What is the stochastic gradient estimate g^x\hat g_x?

Three regimes: batch, mini-batch, pure SGD

Three sizes of batch, three tradeoffs:

  • Full batch (B=N|B| = N): the entire dataset. Zero variance per estimate. Infeasibly slow.
  • Pure SGD (B=1|B| = 1): one example at a time. Maximum noise, each step as cheap as possible.
  • Mini-batch (B32|B| \sim 3220482048): somewhere in between. Hardware-friendly (GPUs vectorize well), variance tamed by 1/B1/|B|.

Mini-batch is what everyone actually uses. It’s the sweet spot. Pure SGD is too noisy; full batch is too slow. Typical: 64–512 for language models.

Variance scaling

If pure SGD (B=1|B| = 1) has estimator variance 1.01.0, what’s the estimator variance at B=25|B| = 25?

The noise ball

Here’s the consequence that surprises people. With a fixed learning rate η\eta and stochastic gradients, SGD does not converge to a point. It converges to a ball around the minimum, a region within which it keeps bouncing, because every step has residual noise.

Below: 120 SGD runs from the same starting point on the same quadratic, plotted as a scatter of endpoints. The blue circle is the theoretical noise-ball radius, ησg/B\eta \sigma_g / \sqrt{|B|}. Adjust the controls and watch the cloud breathe.

SGD endpoint cloud · 120 seeds, 80 steps each

-1-0.50.51-0.6-0.4-0.20.20.40.6
empirical radius
0.145
theory: η σ / √|B|
0.050

bigger batch → smaller cloud · smaller η → smaller cloud · this is a ball, not a point.

Three ways to shrink the ball:

  • Smaller learning rate η\eta → smaller residual noise per step.
  • Bigger batch B|B| → less noisy estimates.
  • Or: start with a large ball for exploration, shrink the ball (by shrinking η\eta) when close to the minimum. That’s what a learning-rate schedule does (Lesson 10.5).

Important consequence: you cannot use “the gradient is zero” as a stopping criterion in practice; the stochastic gradient is never exactly zero even at the minimum.

Estimate the noise ball radius

With η=0.01\eta = 0.01, per-example gradient standard deviation σg=0.5\sigma_g = 0.5, and mini-batch size B=4|B| = 4, estimate the noise-ball radius ησg/B\eta \sigma_g / \sqrt{|B|}.

Noise is a feature

Counterintuitively: the noise in SGD is helpful, not just a cost you pay for cheap gradients.

First, it helps escape saddle points. At a saddle, the exact gradient is zero and full-batch GD stalls forever. A noisy gradient estimate almost never points exactly at zero, so SGD gets nudged off the saddle and keeps moving.

Second, it acts as an implicit regularizer. SGD tends to converge to flat minima (wide, shallow basins where the loss is low over a large region). Those generalize better than sharp, narrow minima, which full-batch GD is happy to fall into. There’s a whole subfield of “SGD generalization” devoted to explaining this empirical fact.

This is why “just use full-batch if you can afford it” turns out to be wrong advice for deep learning. You want some noise. You just don’t want the noise from a mini-batch of 1.

Stopping criteria in the stochastic world

Because the gradient never exactly zeros out, you need practical stopping criteria:

  • Loss plateau: track a moving average of training loss. When it stops decreasing for NN evaluations, stop.
  • Validation metric plateau: check a held-out set periodically; stop when it stops improving (this is early stopping, the most important criterion for generalization).
  • Fixed budget: run for NN steps or TT hours, then stop. Works when you’ve tuned the run before.

In a real training loop, you use some combination of all three. nanoGPT uses “fixed step budget with a cosine schedule.” You’ll see why in Lesson 10.5.

Lesson complete

Nice tinkering.