Information Theory Basics · 12 min

How surprised should you be?

Rare events carry more information than common ones. Turn that intuition into a single number, and then average that number over a distribution to get entropy, the loss every language model in this course minimizes.

0 / 0

Which coin flip tells you more?

Two coins are flipped. The first is fair (P(H)=0.5P(H) = 0.5). The second is a trick coin that always lands heads (P(H)=1.0P(H) = 1.0).

The fair coin lands heads. Mildly interesting; you expected this half the time. The trick coin lands heads. Completely unsurprising; you knew this with certainty before it happened.

Information content is the name for how much news an outcome carries. The trick coin coming up heads has zero news in it. The fair coin coming up heads has one bit of news. Module 9 is about putting numbers on that intuition, and then averaging them into the single number every loss curve in this course will be measuring.

Information content: I(x) = −log P(x)

The formal definition has exactly the properties you want:

I(x)  =  logP(x).I(x) \;=\; -\log P(x).

Common outcomes (large PP) carry little information (log-\log of something near 1 is near 0). Rare outcomes carry a lot (log-\log of something tiny is huge). And for independent events AA and BB, information adds: I(AB)=log(P(A)P(B))=I(A)+I(B)I(A \cap B) = -\log\bigl(P(A) P(B)\bigr) = I(A) + I(B). Two independent surprises are twice as surprising as one. The formula respects every intuition you had in the last step.

(Convention: 0log0:=0-0 \log 0 \,{:}{=}\, 0, because limp0+plogp=0\lim_{p \to 0^+} p \log p = 0. We need this so impossible events don’t blow up the average. It’s a limit, not a definition added for convenience.)

A 1-in-8 outcome

An outcome has probability P(x)=1/8P(x) = 1/8.

What is I(x)I(x) in bits?

Bits, nats, bans: same number, three units

You may have noticed I didn’t say what base the logarithm uses. That’s because the answer changes with the base, but it’s the same quantity. Three conventions:

  • log2\log_2 \to bits. Most natural for communication and storage.
  • ln\ln \to nats. Most natural for calculus, because d(lnx)/dx=1/xd(\ln x)/dx = 1/x.
  • log10\log_{10} \to bans (or “hartleys”). Used by Turing’s group at Bletchley Park; almost extinct otherwise.

Modern ML defaults to nats, because torch.log and numpy.log are natural logs and the gradients of common losses are cleaner. The choice of base is a unit choice, like Celsius vs Fahrenheit. The widget below converts among them; play with it for a moment.

bits 3.0000
nats 2.0794
bans 0.9031
perplexity = e^(nats) = 2^(bits) 8.000
Try:

3 bits in nats

Convert 3 bits to nats. (Two decimals is fine.)

Entropy is expected surprise

Information content is a property of a single outcome. Entropy is a property of a whole distribution. The move is exactly the one from module 8: take an expectation.

H(P)  =  ExP ⁣[I(x)]  =  xP(x)logP(x).H(P) \;=\; \mathbb{E}_{x \sim P}\!\bigl[I(x)\bigr] \;=\; -\sum_{x} P(x) \log P(x).

In words: the average information content of a draw from PP. The widget below shows three panels for the same pmf. The left panel is P(xi)P(x_i). The middle panel is the per-outcome surprise log2P(xi)-\log_2 P(x_i). The right panel is the product P(xi)(log2P(xi))P(x_i) \cdot (-\log_2 P(x_i)), stacked into a single bar whose total height is exactly H(P)H(P) in bits.

P(xᵢ) (drag)−log₂ P(xᵢ) bitsP·(−log₂ P) stack = Hx₁0.25x₂0.25x₃0.25x₄0.25x₁2.00x₂2.00x₃2.00x₄2.000.500.500.500.50log₂ 4 = 2.00H(P) = 2.000 bits
H(P) 2.000 bits · 1.386 nats

Drag any bar on the left; everything else updates in the same frame. The middle bars get taller as you shrink an outcome (rare means more surprising). But the right-panel stack (the entropy) depends on the product, so rarer-but-more-surprising contributions don’t simply increase HH. The two factors fight each other.

Maximum certainty

Click the One-hot (H = 0) preset. The pmf is now concentrated entirely on one outcome.

What is H(P)H(P) in bits?

A slightly biased coin is still nearly maximum-entropy

Now switch the widget to two outcomes (you’ll need to imagine collapsing the panels in your head, or just note the binary entropy curve in the formula). The fair coin P=(0.5,0.5)P = (0.5, 0.5) has H=1H = 1 bit. A coin with P=(0.6,0.4)P = (0.6, 0.4) has

H2(0.6)  =  0.6log20.60.4log20.4    0.971 bits.H_2(0.6) \;=\; -0.6 \log_2 0.6 - 0.4 \log_2 0.4 \;\approx\; 0.971 \text{ bits.}

Tilt the coin a full 60/40 and you barely move the entropy. Entropy is a concave function of the pmf; small perturbations from uniformity have second-order effect on HH. This is why models can have qualitatively wrong predictions and still post a respectable cross-entropy.

Why ML defaults to nats (and why it doesn't matter)

You will see the same loss reported in three units by three different papers. PyTorch reports loss in nats by default (because log is natural). The compression community reports in bits per byte or bits per character. Information-theoretic textbooks switch freely. Two facts hold:

  1. The numerical value of the loss differs by a constant factor between units: 1 bit=ln20.693 nats1 \text{ bit} = \ln 2 \approx 0.693 \text{ nats}.
  2. The perplexity (which we’ll meet in lesson 9.4) is base-invariant. 2bits=enats2^{\text{bits}} = e^{\text{nats}}. So when you compare models by perplexity, you don’t have to care.

In practice for the rest of this course: nats is the default; bits/character is reported for language-modeling benchmarks; the widget above will let you flip between them anytime.

Uniform on 32

A uniform distribution assigns probability 1/321/32 to each of 3232 outcomes.

What is its entropy in bits?

Where this lands: entropy is the floor of the loss curve

The training loss your transformer minimizes is literally E[logQ(x)]\mathbb{E}[-\log Q(x)], the average surprise per token under your model QQ. That’s the next lesson’s punchline: when the true distribution is PP and your model is QQ, the average surprise you suffer is cross-entropy H(P,Q)H(P, Q), which is bounded below by H(P)H(P), the entropy of the data itself.

So entropy isn’t an abstraction. It’s the asymptote of every loss curve in this course. The Shannon estimate of English entropy (≈ 1.0–1.3 bits per character) is the floor a language model can never break. The m18 capstone’s training run is a race toward that floor.

Next lesson: when two distributions meet (the truth PP and the model QQ), what numbers describe their disagreement?

Lesson complete

Nice tinkering.