Two ways to keep weights small (and what they really mean)

Regularization is bigger than λ-times-norm-squared

A lot of people who took a machine-learning course will tell you “regularization means L2.” It does not. Regularization is anything that biases search toward simpler hypotheses or constrains the hypothesis space. Weight decay is one item under that umbrella. So is dropout. So is data augmentation. So is early stopping. So is the noise that mini-batch SGD injects for free.

In this lesson we cover two (weight decay and dropout) because they both give the network a smaller effective capacity than the parameter count would suggest, and they do it for completely different reasons.

Weight decay = L2

The naive idea: add a penalty to the loss that grows with the size of the weights. Pay a tax on big weights and gradient descent will keep them small.

\mathcal{L}_{\text{total}}(w) \;=\; \mathcal{L}_{\text{data}}(w) \;+\; \frac{\lambda}{2}\,\lVert w \rVert_2^2.

Differentiate with respect to $w$ and the gradient picks up an extra $\lambda w$ term. The training rule becomes

w_{t+1} \;=\; (1 - \eta\lambda)\,w_t \;-\; \eta\,\nabla \mathcal{L}_{\text{data}}(w_t).

That $(1 - \eta\lambda)$ factor is why this is also called weight decay: every step, weights shrink by a tiny multiplicative factor before the data gradient gets to push them around.

L2 has a Bayesian story

There is a deeper reason L2 works, and it isn’t “small weights are nicer.” It is that minimizing the L2-regularized loss is exactly maximum-a-posteriori estimation under a Gaussian prior on weights.

Start from $w_{\text{MAP}} = \arg\max_w p(w \mid \mathcal{D})$ . By Bayes, $p(w \mid \mathcal{D}) \propto p(\mathcal{D} \mid w)\,p(w)$ . Take negative log:

-\log p(w \mid \mathcal{D}) \;=\; -\log p(\mathcal{D} \mid w) \;-\; \log p(w) \;+\; C.

Pick a Gaussian prior $w \sim \mathcal{N}(0,\, \sigma^2 I)$ . Then $-\log p(w) = \tfrac{1}{2\sigma^2}\lVert w \rVert_2^2 + C'$ . Match it term-for-term against the regularized loss and you get

\lambda \;=\; \frac{1}{\sigma^2}.

A narrower prior (you believe weights are tightly clustered around zero) means $\sigma^2$ is small, which means $\lambda$ is large. Strong shrinkage corresponds to a confident prior. They are reciprocals.

Regularization is not a hack you bolt onto the loss. It is the prior the modeler is asserting about the world.

Prior to penalty

You believe weights should sit tightly around zero, with prior variance $\sigma^2 = 0.25$ . What is the corresponding L2 strength $\lambda$ ?

L1 picks corners; L2 picks circles

Same recipe, different prior. If you swap the Gaussian prior for a Laplace prior, $-\log p(w) = \tfrac{1}{b}\sum_i \lvert w_i \rvert$ , and you’ve derived L1:

\mathcal{L}_{\text{L1}}(w) \;=\; \mathcal{L}_{\text{data}}(w) \;+\; \lambda \sum_i \lvert w_i \rvert.

The geometric difference is the whole reason L1 is used. The prior’s level sets (the points of equal prior density) are circles for the Gaussian and diamonds for the Laplace. When the gradient of the data loss meets a constraint surface, it tends to land on whichever feature of the surface sticks out the most. The diamond has corners, and the corners sit on the axes. So L1’s MAP solution tends to push individual coordinates of $w$ all the way to zero.

That is sparsity. L2 shrinks all weights uniformly; L1 zeroes some out completely. The corners did the work.

data minimum (1.60, 0.50)

MAP w* (1.60, 0.50)

total loss 0

λ = 0.00

Blue dot is the data minimum (where the unregularized loss is zero). Coral dot is the MAP solution. Dashed shape is the prior level set passing through MAP: a circle for the Gaussian prior, a diamond for the Laplace prior. Crank λ on L1 and you'll watch the coral dot slide toward the nearest axis and stick. That sticking is sparsity, and the diamond's corners are why it happens.

The blue dot is the data minimum, sitting at $(1.6, 0.5)$ . The coral dot is the MAP solution. Toggle to L1 and slide $\lambda$ past about 1; you’ll watch the smaller coordinate snap exactly to zero. With L2, both coordinates shrink continuously; nobody hits the axis.

AdamW: the decoupling fix

A subtlety worth flagging once: when you use Adam, its per-parameter step rescaling interacts badly with adding $\lambda w$ to the gradient. The penalty effectively gets divided by Adam’s running second-moment estimate, so heavily-updated weights get shrunk less than lightly-updated ones, which is the opposite of what you want.

The fix is decoupled weight decay: subtract $\eta \lambda w$ directly from the parameter, after Adam’s adaptive step. That’s AdamW. We will use AdamW with $\lambda \approx 0.1$ when we train the transformer. This is the only correct way to combine adaptive optimizers with weight decay.

Dropout: the radical idea

The other technique. At training time, take every activation in a layer and zero it out independently with probability $p$ . Then continue forward as if those neurons didn’t exist for this step. The next step you sample a fresh random mask. The next step, again.

The effect is that no single neuron can rely on any specific other neuron being there. Co-adaptation between units (where one neuron quietly compensates for another) gets discouraged because the compensator might vanish at any moment.

Dropout is not “make the network smaller.” All neurons are present at inference. Dropout is noise injection: the model has to find a solution that is robust to random subnetworks of itself.

Why scaling matters: the expected-value contract

Now the part that trips everybody up. Suppose a neuron normally outputs $a = 4$ . Apply dropout with probability $p = 0.25$ of dropping. During training, the expected output is

\mathbb{E}[\tilde a] \;=\; (1 - p)\cdot a \;=\; 0.75 \times 4 \;=\; 3.

But during inference there’s no dropout. The output is 4. Train and test see different distributions of activations; the rest of the network was tuned for the train-time scale.

The standard fix is inverted dropout: at training time, multiply each surviving activation by $1/(1-p)$ . Now

\mathbb{E}[\tilde a] \;=\; (1 - p)\cdot \frac{a}{1 - p} \;=\; a.

Train-time and inference-time expectations match, and inference doesn’t have to do anything weird. Every modern framework defaults to inverted dropout.

In code:

mask = (torch.rand_like(a) > p).float()
a = a * mask / (1.0 - p)   # training only; eval skips this entirely

Inverted dropout

A neuron outputs $a = 5.0$ and survives a dropout step with $p = 0.2$ . What is its scaled value during training under inverted dropout?

Why dropout and L2 are not the same thing

They both regularize. They are not interchangeable.

L2 shrinks all weights uniformly: every coordinate of $w$ pays the same multiplicative tax per step. Dropout’s regularization is adaptive: it punishes co-adapted features (neurons that rely on others to compensate) much more than independently useful ones. In linear models, dropout can be shown to correspond to a data-dependent L2 penalty proportional to the input variance. Plain L2 is data-independent.

In practice you use both. The transformer block in this course will apply dropout to the attention output and to the MLP output, and AdamW will apply weight decay across all projection matrices.

dropout p 0.40

scale 1/(1−p) 1.667

samples 0

|coral − gray| 0.000

1

2

3

4

5

6

7

8

p = 0.40

Each step: a binary mask drops each unit with probability p; surviving units multiply by 1/(1−p). The expectation lands on the gray bars by construction. With p = 0, scaling does nothing and samples equal eval. As p grows, individual samples get noisier; but their average still locks onto the eval-mode value within a few dozen draws.

Hit sample × 50 a few times and watch the coral bars (running mean of training-mode samples) lock onto the gray bars (eval-mode true values). That’s what the $1/(1-p)$ scaling is for: the train-time expectation matches the eval-time output exactly.