Buy a lottery ticket. What is it worth?
A lottery ticket costs $1. With probability (one in ten million) you win $1,000,000. Otherwise you win nothing.
Almost every individual ticket loses you a dollar. But what’s the average outcome, weighted by how often each outcome happens? Let be the net winnings:
That number (about minus 90 cents per ticket) is the expectation of . Even though you’ll almost never see the outcome on a single ticket, it’s the right summary of “what happens on average over many tickets.”
Expectations are how distributions become single numbers. We’re about to see why three of them (mean, variance, covariance) are the only statistics the math underneath weight initialization, BatchNorm, and the noise in SGD actually care about.
Expectation is a probability-weighted sum
For a discrete random variable with pmf , the expectation is just
Multiply each possible value by its probability; add them up. For a continuous random variable with density , replace the sum with an integral:
Same idea, two notations. The expectation has a name because it shows up everywhere, but at heart it is the same probability-weighted sum every time.
Expectation is linear (this saves you constantly)
The single most-used fact about expectations:
It does not require independence. It does not require anything. Linearity of expectation holds for every pair of random variables, including ones that are correlated, including ones where is just a copy of . Whenever a calculation has a sum inside an expectation, you can split it.
(A common trap, for later: is the product rule, and it only holds under independence. Linearity holds always; multiplicativity does not. Don’t confuse the two.)
Expectation of a categorical
is categorical over the values with probability vector .
What is ?
Variance: spread, in squared units
Two distributions can have the same mean and look completely different. The mean of a coin flip is . The mean of “always return ” is also . They are not the same distribution.
Variance measures how much a distribution spreads out around its mean:
The right-hand identity is worth knowing: sometimes the squared form is easier to compute than the centered form. The standard deviation has the same units as itself, which is why we usually report standard deviations to humans and variances to formulas.
Variances add (under independence). Standard deviations don't.
Take independent, each with variance . Let and . Then
The first formula is why neural-network weight initialization scales by : when inputs each contribute a Gaussian unit of noise, the variance of the sum scales with , so its standard deviation scales with . Dividing by keeps the signal’s variance roughly constant from layer to layer. That single fact (variances add under independence) is the entire derivation of He/Xavier initialization, which we’ll re-derive carefully in module 13.
The second formula is the reason averages over more samples are tighter estimates. Quadruple your sample size, you halve your error bar.
Variance of a Bernoulli
Let .
What is ?
Covariance: how two variables co-vary
Variance is one variable’s spread. Covariance generalizes to two:
Positive covariance means “when is above its mean, tends to be too.” Negative covariance means “when is high, tends to be low.” Zero covariance means no linear association, though it does not mean independence; nonlinear relationships can have zero covariance.
Notice the symmetric structure with variance: . They are the same operation; variance is just covariance with itself.
Covariance is a dot product of centered data (pay attention here)
Here’s the surprise. Given paired observations , define the centered vectors and likewise. Then the sample covariance is
Covariance is literally a dot product. The same operation you spent a whole module on in module 7. The “similarity” interpretation of the dot product, “do these two vectors point the same way?”, is exactly the covariance question, “do these two variables tend to move together?”
The sister identity is just as good. Correlation (the Pearson coefficient) is the cosine of the angle between and :
So lives in , by the Cauchy–Schwarz inequality you already know.
Drag points; watch r move
Drag the red points into a straight diagonal. Watch approach . Drag them into a perpendicular line. Watch approach . Drag them into the opposite diagonal. Watch approach .
The coral arrows from the teal centroid are the centered vectors as 2D displacements. The number is the running average of the products of their -component and -component. The number is the cosine of the angle between and viewed as vectors in . Two ways of looking at the same arithmetic.
Sample versus population: what changes when you only have data
So far the formulas have been about the population, the underlying distribution, with and and a known pmf. In practice you have a sample: data points, period.
The natural estimators are
The sample mean estimates . The sample variance estimates . The reason the variance denominator is rather than is “Bessel’s correction”: by using instead of the (unknown) true , we’ve eaten one degree of freedom, and dividing by makes the estimator unbiased. We’ll come back to this when we derive MLE for a Gaussian in the next lesson; the maximum-likelihood estimator of variance divides by and is biased.
The law of large numbers, in one line
Sample averages converge to the true mean as the sample grows:
Roll the same die 10 times; might be 2.7. Roll it 10,000 times; will be very close to 3.5. The LLN is what makes empirical estimation work at all.
Use the Gaussian widget below to see it. Pick a and . Click “Sample 10k.” Watch the empirical mean and empirical standard deviation in the readout converge on the true values you set.
The central limit theorem, in one line
Take any distribution with a finite variance (any distribution). Average i.i.d. draws from it. As grows, the distribution of the (properly scaled) sample mean approaches a Gaussian:
This is the central limit theorem (CLT). It is the reason error bars in physics, the noise in stochastic gradient descent, and the histogram of “average of a hundred rolls” all look Gaussian, regardless of what they started as. Sums and averages of many independent things forget their individual shape.
(Big caveat: the CLT is about averages, not individual data points. Token frequencies, file sizes, incomes: those are not Gaussian. Tradeoff intuition built on the CLT often does not transfer; we’ll see this trip up loss-curve interpretations in module 13.)
We are not going to use the CLT to do confidence intervals; those are out of scope. We mention it because every “noise is Gaussian” sentence in the rest of this course derives from this one line.
Where this lands: three numbers, used everywhere
is in every loss function: the loss we minimize is the expectation of the per-example NLL. is in every weight initializer: the unit-variance rule keeps signal-flow predictable. is in every BatchNorm / LayerNorm: both compute a mean and subtract; both compute a variance and divide.
The transformer is doing one of these three things on almost every line. Next lesson: how all of statistical learning (every loss, every “train this thing to fit data” command) is one principle, applied over and over. Maximum likelihood. That’s the next stop.
Lesson complete