A network that won't move
Build a 10-layer MLP. 200 hidden units per layer. Tanh activations. Initialize every weight from , a “reasonable” looking standard-normal sample. Now feed in a batch of unit-variance inputs.
By layer 2 or 3, every tanh is pinned at . By layer 10, every neuron is fully saturated; derivatives are essentially zero everywhere. Now run backprop. The gradient is the chain of all those tiny derivatives multiplied together. It vanishes. The optimizer cannot move parameters whose gradient is numerically zero. The loss does nothing. Your network is dead, and the loss curve looks like the “dead curve” pathology from the previous lesson.
There is no error message. No NaN. The optimizer ran “successfully.” It just produced no learning. And the bug is one number: the standard deviation of the initial weights.
Why does it matter so much? Algebra.
Drag the slider all the way to the right (σ_W ≈ 0.4) and watch every layer’s histogram collapse to a wall at : that’s tanh saturation. Drag to the far left (σ_W ≈ 0.01) and the activations decay toward zero. Hit the He preset and the histograms stabilize to a clean Gaussian at every depth. The layer-1 gradient readout in the corner tells you whether the network can learn at all.
Variance through one linear layer
A linear layer computes , where the are i.i.d. with mean 0 and variance , and the are i.i.d. with mean 0 and variance , all independent.
Compute the variance of . Independence lets us drop cross-covariance terms:
So . The variance of the output of one layer is times the variance of the inputs, scaled by the variance of the weights.
That extra factor of is what kills you. With (the bad init we started with) and , every layer multiplies variance by 200. After ten layers, variance is . Tanh saturates; gradients die.
The variance-preservation rule
You want activations to keep roughly the same scale, layer after layer. So pick the variance of the weights to kill the factor of :
Now . Activation variance is preserved across the layer. Stack ten and the input still looks like the input.
This is Xavier initialization in its simplest form (also called Glorot, after Xavier Glorot). The forward-only version is . The symmetric version, which considers both forward and backward variance preservation, is . Either works; modern libraries default to the symmetric form.
Xavier in numbers
A fully-connected layer has 256 inputs and a tanh nonlinearity. Using Xavier (forward-only) initialization, what is ? (Answer to four decimal places is fine.)
Why Xavier breaks for ReLU
The variance derivation assumed activations had mean 0. Tanh roughly does, since it’s symmetric around the origin. But ReLU is not symmetric. ReLU zeros out everything below zero. Half the inputs, in expectation, go to zero. So the actual variance of the output of a ReLU layer is half what the linear analysis predicts.
If you use Xavier with ReLU, you’ve lost half your variance per layer, and your activations decay toward zero exponentially with depth. Same kind of failure, opposite direction.
The fix is to compensate. Double the weight variance to make up for what ReLU eats:
This is He initialization (also called Kaiming, after Kaiming He). It is the default for any layer followed by a ReLU or ReLU variant, which means it is the default for almost everything you’ll build.
He in numbers
A fully-connected layer has 512 inputs and a ReLU nonlinearity. Using He initialization, what is ?
Why you can't initialize to zero
You might think “if random matters, what about all-zeros? It’s symmetric, deterministic, easy.”
It does not work. If every weight is zero (or any constant), every neuron in a layer computes the same output. Forward pass: identical activations. Backward pass: identical gradients. Update: identical step. The neurons stay identical forever. You haven’t built a hidden layer of neurons; you’ve built one neuron times, and the layer collapses to a single effective unit.
This is symmetry breaking: randomness exists not for any deep statistical reason but because two neurons that start the same and see the same gradients will always be the same. Random init kicks the network out of that symmetry on the first step.
The transformer's initialization, exactly
Modern transformers add one more wrinkle. Even with He init in each individual layer, the residual stream (the running sum ) has variance that grows with depth. The fix from the GPT-2 paper, used by every nanoGPT-shaped codebase since:
applied to the output projection of each residual sub-block. (The factor of 2 is because each transformer block has two residual additions: one around attention, one around the MLP.) Everything else in the network uses standard .
This is what c_proj.weight in a nanoGPT codebase is doing when it scales by : it’s keeping the variance of the residual stream constant with depth, regardless of how many blocks you stack.
Initialization is one line of variance algebra. It is also the difference between a 12-layer transformer that converges and one that sits at chance forever.
Lesson complete