Probability & Statistics · 12 min

What does probability measure?

Probability is a rule that hands numbers in [0, 1] to sets of outcomes. Random variables, pmfs, and densities are how we drag that rule onto the real line so we can do math.

0 / 0

Roll two dice. How many ways to get a 7?

You probably know the answer: six. (1,6),(2,5),(3,4),(4,3),(5,2),(6,1)(1,6), (2,5), (3,4), (4,3), (5,2), (6,1). There are 3636 possible rolls total. So the probability of summing to seven is 6/36=1/66/36 = 1/6.

You just used the only definition of probability we need to start: count the ways the thing could happen, divide by the total number of ways anything could happen. Everything in this module is a polished version of that move.

Two ingredients are hiding in that sentence. First, the set of all outcomes (every pair (d1,d2)(d_1, d_2)), which we’ll call Ω\Omega (capital omega). Second, the event we cared about: the subset where d1+d2=7d_1 + d_2 = 7. Call that AA. Then P(A)=A/ΩP(A) = |A| / |\Omega| when every outcome is equally likely.

Sample space, events

Ω\Omega (“omega”) is the sample space: the set of every possible outcome of whatever random thing we’re doing. For two dice, Ω\Omega has 36 elements. For a coin flip, Ω={H,T}\Omega = \{H, T\}. For “pick a real number between 0 and 1,” Ω\Omega is uncountable, but we’ll deal with that later.

An event is any subset AΩA \subseteq \Omega. “Sum is 7” is one event. “First die is even” is another. “Both dice are 1” is an event with one element. “The dice land” is the event A=ΩA = \Omega, which obviously happens with probability 1.

The whole game is: pick a sample space, pick events you care about, assign each event a number in [0,1][0, 1] that says how likely it is. Probability is just that: a rule for handing out numbers to sets.

The three axioms (don't be scared)

The rule isn’t allowed to do whatever it wants. It has to satisfy three constraints:

  1. Non-negative. P(A)0P(A) \ge 0 for every event. No negative probabilities.
  2. Normalized. P(Ω)=1P(\Omega) = 1. Something definitely happens.
  3. Additive on disjoint events. If AA and BB can’t both happen at once (AB=A \cap B = \emptyset), then P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B).

That’s it. Those three sentences are the entire foundation of probability. Everything else (Bayes’ theorem, expectations, the Gaussian, perplexity, the loss curve your language model will climb in module 14) is a consequence of three lines.

(There’s a fourth, deeper foundation called measure theory that gives this its full mathematical home. For everything you’ll ever do with a transformer, those three lines are enough. We mention measure theory so you know the rabbit hole exists. We are not going in.)

Add two events that overlap

The third axiom only handles disjoint events. When the two events overlap, you have to subtract the double-count.

Given P(A)=0.6P(A) = 0.6, P(B)=0.5P(B) = 0.5, P(AB)=0.3P(A \cap B) = 0.3, what is P(AB)P(A \cup B)?

A random variable is just a function

So far events are subsets of Ω\Omega. That makes sense for “is the sum seven?” But to do math we want to talk about the sum as a number we can add, average, square.

A random variable is a function X:ΩRX: \Omega \to \mathbb{R} that turns each outcome into a number. For the two-dice example, X=D1+D2X = D_1 + D_2 maps the outcome (3,4)(3, 4) to the number 77. XX isn’t random; it’s a deterministic function, but the input is random, so the output ends up random too.

This is the most common notational sleight-of-hand in probability. When we write P(X=7)P(X = 7), we mean P({ωΩ:X(ω)=7})P(\{\omega \in \Omega : X(\omega) = 7\}), the probability of the event “those outcomes for which XX spits out 7.” The function makes the event invisible. Now you know it’s there.

Pmf and pdf: two shapes, one job

Almost every random variable you’ll meet in this course is one of two shapes.

Discrete XX takes values in a countable set, like the integers {0,1,,50256}\{0, 1, \ldots, 50256\} which is literally the vocabulary of a GPT-style tokenizer. Its distribution is a probability mass function pX(x)=P(X=x)p_X(x) = P(X = x). Non-negative. Sums to 1.

Continuous XX takes values in R\mathbb{R}, like the weight of a single parameter in your model. Its distribution is a probability density function fX(x)f_X(x). Non-negative. Integrates to 1: fX(x)dx=1\int_{-\infty}^{\infty} f_X(x)\,dx = 1. The probability of any single value is exactly zero; you have to ask about an interval: P(aXb)=abfX(x)dxP(a \le X \le b) = \int_a^b f_X(x)\,dx.

Here is the trap that gets everyone. fX(x)f_X(x) is not a probability. It is a density, probability per unit xx. Densities are allowed to exceed 1. Watch this:

Densities can be huge, they're not probabilities

Drag the inflection point of this Gaussian inward to shrink σ\sigma. The peak of the curve climbs straight past 1.01.0 (the dashed line). If density were probability, that’d be a violation. But it’s fine: the curve gets narrower as it gets taller, and the area stays at 1.

1.0-4-2024μσ
μ0.00
σ1.00
n/a
sn/a
peak0.399

Stop confusing density with probability; that’s the entire move. For a pmf, the height of a bar is the probability of that value. For a pdf, the height is just a density; only the area under the curve is a probability. Two different shapes, two different jobs.

Make a pdf valid

Consider f(x)=cxf(x) = c x on the interval [0,2][0, 2], and f(x)=0f(x) = 0 elsewhere. For this to be a valid pdf, the area under the curve has to equal 11.

What is cc?

The three discrete distributions you'll see forever

Three pmfs underlie almost everything that follows in this course. Memorize their shapes; you’ll meet their names from now until the capstone.

Bernoulli Bern(p)\mathrm{Bern}(p) is a single biased coin. P(X=1)=pP(X=1) = p, P(X=0)=1pP(X=0) = 1-p. One parameter. The entire universe of binary classification (is this email spam?) is a Bernoulli with pp learned by a neural net.

Categorical Cat(p)\mathrm{Cat}(\mathbf{p}) generalizes Bernoulli to KK options. A probability vector p=(p1,,pK)\mathbf{p} = (p_1, \ldots, p_K), each pk0p_k \ge 0, summing to 1. The output of a language model at every position is a categorical over the vocabulary. This is the most important distribution in this course. Drag the bars to set p\mathbf{p}, then click Drop one; a uniform random number falls onto the cumulative-sum staircase and lands in one of the bins. That is exactly how we’ll sample tokens from a trained transformer in module 18.

p (drag the bars)CDF + usamples so fara0.100b0.350c0.250d0.200e0.100abcde10abcde
drop a ball 0 samples

Uniform over {1,,K}\{1, \ldots, K\} is the special case pk=1/Kp_k = 1/K: perfect ignorance. Every token equally likely. We’ll meet it in module 9 as the maximum-entropy distribution: the categorical that knows literally nothing.

The Gaussian, in three minutes

The continuous distribution that shows up everywhere in ML is the Gaussian (also called normal), parameterized by a mean μ\mu and a standard deviation σ\sigma:

f(x)  =  1σ2πexp ⁣((xμ)22σ2).f(x) \;=\; \frac{1}{\sigma\sqrt{2\pi}} \,\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).

Two parameters, one shape, enormous downstream consequences. Weight initialization draws from a Gaussian. BatchNorm and LayerNorm assume Gaussian-ish activations. The “noise” in stochastic gradient descent is approximately Gaussian by the Central Limit Theorem, which we’ll meet two lessons from now.

Drag the red dot horizontally to move μ\mu. Drag the blue dot to widen or narrow σ\sigma. Sample 10k draws and watch the histogram (teal bars) settle onto the curve. The empirical mean xˉ\bar x and empirical standard deviation ss are how you estimate μ\mu and σ\sigma from data; they’re chasing the true values as the sample grows.

1.0-4-2024μσ
μ0.50
σ1.20
n/a
sn/a
peak0.332

We’ll come back to this widget in lessons 8.3 and 8.5 to extract more out of it. For now, just notice: changing μ\mu slides the bell. Changing σ\sigma squishes or spreads it. The empirical histogram converges on the curve. Three different sentences, one picture.

Where this lands: the transformer view

Every line we’ll write about probability, expectations, sampling, and likelihood lives inside one of these shapes. The final layer of a GPT-style transformer takes a context and produces a probability vector over the vocabulary: a Cat(p)\mathrm{Cat}(\mathbf{p}) with K50,000K \approx 50{,}000. “Sampling a token” is dropping a uniform on the staircase you just played with. “Cross-entropy loss” (module 9) is built on the log of that vector. “Temperature” (module 17) is a one-line transform on the same vector before sampling.

The transformer is huge. The math underneath it is not. The Gaussian shows up when we initialize weights. The categorical shows up at every position of every forward pass. Bernoulli underlies any binary decision. These three distributions, plus the rules from the three axioms, get you to module 18.

Next lesson: when two random variables share a world (which is what a context is for a language model), how do joint, marginal, and conditional probabilities connect them? Bayes’ theorem is the punchline.

Lesson complete

Nice tinkering.