Information Theory Basics · 15 min

Two distributions in the same room

When the truth is P but your model is Q, the average surprise you suffer is cross-entropy. The excess over H(P) is KL divergence, which is non-negative, asymmetric, and the thing label smoothing exists to fix.

0 / 0

A friend with the wrong dictionary

You are sending coded telegrams to a friend. The truth is that you write the letter e half the time and t the other half. So you’d want a code optimized for those frequencies: short codeword for e, short codeword for t.

But your friend has the wrong model of your letter habits. They assume you write e 90% of the time and t only 10%. They tune their code accordingly: a very short codeword for e, a punitively long one for t.

Every telegram you send costs more than it should. The extra cost, per letter, has a name: KL divergence. That’s the whole lesson.

Cross-entropy is the cost of believing Q when the truth is P

The formal definition mirrors entropy exactly. Replace the inner logP(x)\log P(x) with logQ(x)\log Q(x):

H(P,Q)  =  xP(x)logQ(x)  =  ExP ⁣[logQ(x)].H(P, Q) \;=\; -\sum_{x} P(x)\,\log Q(x) \;=\; \mathbb{E}_{x \sim P}\!\bigl[-\log Q(x)\bigr].

In words: the average surprise you actually suffer when the outcomes come from PP but you score them under QQ. Note the asymmetry baked into the formula: PP provides the weights, QQ provides the log. Swap them and you get a different number.

The widget below has two pmfs side by side. The red bars are PP (the truth). The blue bars are QQ (your model). The right panel renders H(P)H(P) and H(P,Q)H(P, Q) as nested bars; the orange band is the gap, which is exactly DKL(PQ)D_{\mathrm{KL}}(P \,\|\, Q). Drag any bar.

P (truth)Q (model)readoutheads0.50tails0.50heads0.90tails0.10H(P,Q) = 1.737H(P) = 1.000KL = 0.73701.91
H(P)1.000 bits
H(P, Q)1.737 bits
D_KL(P‖Q)0.737 bits

Gibbs' inequality, visible as a barrier

Lock PP on the widget (the checkbox), then drag QQ around. Try to make the orange H(P,Q)H(P, Q) bar shorter than the red H(P)H(P) bar. You can’t. As QQ approaches PP, the gap closes; as Q=PQ = P, the two bars touch and the KL band disappears. But H(P,Q)H(P, Q) never drops below H(P)H(P).

This is Gibbs’ inequality:

H(P,Q)    H(P),equality iff P=Q.H(P, Q) \;\ge\; H(P), \qquad \text{equality iff } P = Q.

The cleanest mental model: H(P)H(P) is the entropy of the truth, the irreducible average surprise you’d suffer with a perfect model. Any imperfect QQ costs you extra. That extra cost is the KL.

Truth-matches-model

P=(0.5,0.5)P = (0.5, 0.5) and Q=(0.5,0.5)Q = (0.5, 0.5), a fair coin matched perfectly by a fair-coin model.

What is H(P,Q)H(P, Q) in bits?

Model thinks the coin is biased

P=(0.5,0.5)P = (0.5, 0.5), a fair coin. Q=(0.9,0.1)Q = (0.9, 0.1), a model that thinks heads is overwhelmingly likely.

What is H(P,Q)H(P, Q) in bits? (Three decimals.)

KL divergence: the cost of being wrong

The amount by which cross-entropy exceeds entropy has its own name:

DKL(PQ)  =  H(P,Q)H(P)  =  xP(x)logP(x)Q(x).D_{\mathrm{KL}}(P \,\|\, Q) \;=\; H(P, Q) - H(P) \;=\; \sum_x P(x) \log \frac{P(x)}{Q(x)}.

Three properties to internalize:

  1. Non-negative. DKL(PQ)0D_{\mathrm{KL}}(P \,\|\, Q) \ge 0, with equality iff P=QP = Q. This is Gibbs’ inequality restated.
  2. Asymmetric. DKL(PQ)DKL(QP)D_{\mathrm{KL}}(P \,\|\, Q) \ne D_{\mathrm{KL}}(Q \,\|\, P) in general. Click the Swap P ↔ Q button and the numbers change.
  3. Infinite when Q has a zero where P doesn’t. If P(x)>0P(x) > 0 but Q(x)=0Q(x) = 0, the model said “this is impossible” and reality went and did it. The cost is unbounded. This is why label smoothing exists in classification: you never want your model to assign exact zero to any class.

Compute the KL

Using the values from the previous problems (H(P,Q)1.737H(P, Q) \approx 1.737 bits and H(P)=1H(P) = 1 bit), what is DKL(PQ)D_{\mathrm{KL}}(P \,\|\, Q)?

Asymmetry is not a bug, it's a different question

KL is not a distance. The word “divergence” exists precisely to warn you off that intuition. Cross-entropy is not a distance either; H(P,P)=H(P)H(P, P) = H(P), not zero.

Why does DKL(PQ)D_{\mathrm{KL}}(P \,\|\, Q) differ from DKL(QP)D_{\mathrm{KL}}(Q \,\|\, P)? Because they are asking different things. DKL(PQ)D_{\mathrm{KL}}(P \,\|\, Q) asks “if the truth is PP, how badly does the model QQ describe it?” (weights are over PP). DKL(QP)D_{\mathrm{KL}}(Q \,\|\, P) asks “if reality is QQ and we tried to use PP as a code, how much extra would it cost?” (weights are over QQ). Two questions, two answers, two numbers.

The continuous version makes this picture geometric. Below, PP (red) is a bimodal truth with two bumps. QQ (blue) is a single Gaussian, draggable. Both KL directions are computed by numerical integration.

-4-2024μσP (bimodal truth)Q (unimodal, drag to fit)
D_KL(P ‖ Q) 0.756 nats
D_KL(Q ‖ P) 1.243 nats
μ = 0.00 σ = 1.50

Try the two “Best Q” buttons. They snap to the optima for the two KL directions. Notice: the forward-KL (“mass-covering”) optimum is a wide Gaussian centered between the two bumps; it has to put mass everywhere PP does. The reverse-KL (“mode-seeking”) optimum is a narrow Gaussian parked entirely on one bump; it’s allowed to ignore mass it isn’t responsible for.

Why supervised ML uses forward KL (and not reverse)

You’ll see “forward KL” everywhere in this course. Here’s why: when you minimize H(P^data,Qθ)H(\hat P_{\text{data}}, Q_\theta) over your model’s parameters θ\theta (which is what we do on every loss step) the empirical P^data\hat P_{\text{data}} stays fixed and only QθQ_\theta moves. Subtract H(P^data)H(\hat P_{\text{data}}) (a constant in θ\theta) and you get DKL(P^dataQθ)D_{\mathrm{KL}}(\hat P_{\text{data}} \,\|\, Q_\theta), or forward KL. So the gradient descent step is mass-covering by construction.

Reverse KL (DKL(QθP)D_{\mathrm{KL}}(Q_\theta \,\|\, P)) shows up in variational inference and in some RL formulations; those are mode-seeking, sometimes intentionally. For our purposes (training generative LMs), forward KL is the right tool and it falls out of MLE automatically.

You don’t need to derive any of this in m9. Just know that “cross-entropy loss” and “forward KL up to an additive constant” are the same gradient.

Where this lands: every loss curve is a cross-entropy

The number your transformer’s training loop prints at every iteration is H(P^batch,Qθ)H(\hat P_{\text{batch}}, Q_\theta), the average cross-entropy of the model on the current batch. Subtract the (constant) data entropy and you have KL. Both interpretations are correct; pick the one that’s useful at the moment.

You now know enough about distributions-on-distributions to read 90% of the ML literature. The remaining piece is the specific case of classification: when PP is a one-hot label, what does this whole apparatus simplify to? That’s the next lesson, where softmax stops being a magic incantation.

Lesson complete

Nice tinkering.