A friend with the wrong dictionary
You are sending coded telegrams to a friend. The truth is that you write the letter e half the time and t the other half. So you’d want a code optimized for those frequencies: short codeword for e, short codeword for t.
But your friend has the wrong model of your letter habits. They assume you write e 90% of the time and t only 10%. They tune their code accordingly: a very short codeword for e, a punitively long one for t.
Every telegram you send costs more than it should. The extra cost, per letter, has a name: KL divergence. That’s the whole lesson.
Cross-entropy is the cost of believing Q when the truth is P
The formal definition mirrors entropy exactly. Replace the inner with :
In words: the average surprise you actually suffer when the outcomes come from but you score them under . Note the asymmetry baked into the formula: provides the weights, provides the log. Swap them and you get a different number.
The widget below has two pmfs side by side. The red bars are (the truth). The blue bars are (your model). The right panel renders and as nested bars; the orange band is the gap, which is exactly . Drag any bar.
Gibbs' inequality, visible as a barrier
Lock on the widget (the checkbox), then drag around. Try to make the orange bar shorter than the red bar. You can’t. As approaches , the gap closes; as , the two bars touch and the KL band disappears. But never drops below .
This is Gibbs’ inequality:
The cleanest mental model: is the entropy of the truth, the irreducible average surprise you’d suffer with a perfect model. Any imperfect costs you extra. That extra cost is the KL.
Truth-matches-model
and , a fair coin matched perfectly by a fair-coin model.
What is in bits?
Model thinks the coin is biased
, a fair coin. , a model that thinks heads is overwhelmingly likely.
What is in bits? (Three decimals.)
KL divergence: the cost of being wrong
The amount by which cross-entropy exceeds entropy has its own name:
Three properties to internalize:
- Non-negative. , with equality iff . This is Gibbs’ inequality restated.
- Asymmetric. in general. Click the Swap P ↔ Q button and the numbers change.
- Infinite when Q has a zero where P doesn’t. If but , the model said “this is impossible” and reality went and did it. The cost is unbounded. This is why label smoothing exists in classification: you never want your model to assign exact zero to any class.
Compute the KL
Using the values from the previous problems ( bits and bit), what is ?
Asymmetry is not a bug, it's a different question
KL is not a distance. The word “divergence” exists precisely to warn you off that intuition. Cross-entropy is not a distance either; , not zero.
Why does differ from ? Because they are asking different things. asks “if the truth is , how badly does the model describe it?” (weights are over ). asks “if reality is and we tried to use as a code, how much extra would it cost?” (weights are over ). Two questions, two answers, two numbers.
The continuous version makes this picture geometric. Below, (red) is a bimodal truth with two bumps. (blue) is a single Gaussian, draggable. Both KL directions are computed by numerical integration.
Try the two “Best Q” buttons. They snap to the optima for the two KL directions. Notice: the forward-KL (“mass-covering”) optimum is a wide Gaussian centered between the two bumps; it has to put mass everywhere does. The reverse-KL (“mode-seeking”) optimum is a narrow Gaussian parked entirely on one bump; it’s allowed to ignore mass it isn’t responsible for.
Why supervised ML uses forward KL (and not reverse)
You’ll see “forward KL” everywhere in this course. Here’s why: when you minimize over your model’s parameters (which is what we do on every loss step) the empirical stays fixed and only moves. Subtract (a constant in ) and you get , or forward KL. So the gradient descent step is mass-covering by construction.
Reverse KL () shows up in variational inference and in some RL formulations; those are mode-seeking, sometimes intentionally. For our purposes (training generative LMs), forward KL is the right tool and it falls out of MLE automatically.
You don’t need to derive any of this in m9. Just know that “cross-entropy loss” and “forward KL up to an additive constant” are the same gradient.
Where this lands: every loss curve is a cross-entropy
The number your transformer’s training loop prints at every iteration is , the average cross-entropy of the model on the current batch. Subtract the (constant) data entropy and you have KL. Both interpretations are correct; pick the one that’s useful at the moment.
You now know enough about distributions-on-distributions to read 90% of the ML literature. The remaining piece is the specific case of classification: when is a one-hot label, what does this whole apparatus simplify to? That’s the next lesson, where softmax stops being a magic incantation.
Lesson complete