Probability & Statistics · 18 min

What you can say about a random variable in one number

Mean, variance, covariance: three summary statistics that capture almost everything we need from a distribution. Weight initialization, BatchNorm, and the noise in stochastic gradient descent all live inside these three numbers.

0 / 0

Buy a lottery ticket. What is it worth?

A lottery ticket costs $1. With probability 10710^{-7} (one in ten million) you win $1,000,000. Otherwise you win nothing.

Almost every individual ticket loses you a dollar. But what’s the average outcome, weighted by how often each outcome happens? Let XX be the net winnings:

E[X]  =  107999,999win case  +  (1107)(1)lose case    0.9.\mathbb{E}[X] \;=\; \underbrace{10^{-7} \cdot 999{,}999}_{\text{win case}} \;+\; \underbrace{(1 - 10^{-7}) \cdot (-1)}_{\text{lose case}} \;\approx\; -0.9.

That number (about minus 90 cents per ticket) is the expectation of XX. Even though you’ll almost never see the 0.9-0.9 outcome on a single ticket, it’s the right summary of “what happens on average over many tickets.”

Expectations are how distributions become single numbers. We’re about to see why three of them (mean, variance, covariance) are the only statistics the math underneath weight initialization, BatchNorm, and the noise in SGD actually care about.

Expectation is a probability-weighted sum

For a discrete random variable with pmf pp, the expectation is just

E[X]  =  xxp(x).\mathbb{E}[X] \;=\; \sum_x x \cdot p(x).

Multiply each possible value by its probability; add them up. For a continuous random variable with density ff, replace the sum with an integral:

E[X]  =  xf(x)dx.\mathbb{E}[X] \;=\; \int x \, f(x) \, dx.

Same idea, two notations. The expectation has a name because it shows up everywhere, but at heart it is the same probability-weighted sum every time.

Expectation is linear (this saves you constantly)

The single most-used fact about expectations:

E[aX+bY]  =  aE[X]+bE[Y].\mathbb{E}[a X + b Y] \;=\; a\, \mathbb{E}[X] + b\, \mathbb{E}[Y].

It does not require independence. It does not require anything. Linearity of expectation holds for every pair of random variables, including ones that are correlated, including ones where YY is just a copy of XX. Whenever a calculation has a sum inside an expectation, you can split it.

(A common trap, for later: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] is the product rule, and it only holds under independence. Linearity holds always; multiplicativity does not. Don’t confuse the two.)

Expectation of a categorical

XX is categorical over the values {0,1,2}\{0, 1, 2\} with probability vector (0.1,0.5,0.4)(0.1, 0.5, 0.4).

What is E[X]\mathbb{E}[X]?

Variance: spread, in squared units

Two distributions can have the same mean and look completely different. The mean of a coin flip is 0.50.5. The mean of “always return 0.50.5” is also 0.50.5. They are not the same distribution.

Variance measures how much a distribution spreads out around its mean:

Var(X)  =  E ⁣[(Xμ)2]  =  E[X2]E[X]2.\mathrm{Var}(X) \;=\; \mathbb{E}\!\left[(X - \mu)^2\right] \;=\; \mathbb{E}[X^2] - \mathbb{E}[X]^2.

The right-hand identity is worth knowing: sometimes the squared form is easier to compute than the centered form. The standard deviation σ=Var(X)\sigma = \sqrt{\mathrm{Var}(X)} has the same units as XX itself, which is why we usually report standard deviations to humans and variances to formulas.

Variances add (under independence). Standard deviations don't.

Take X1,X2,,XnX_1, X_2, \ldots, X_n independent, each with variance σ2\sigma^2. Let Sn=X1++XnS_n = X_1 + \cdots + X_n and Xˉn=Sn/n\bar X_n = S_n / n. Then

Var(Sn)=nσ2,Var(Xˉn)=σ2n.\mathrm{Var}(S_n) = n \sigma^2, \qquad \mathrm{Var}(\bar X_n) = \frac{\sigma^2}{n}.

The first formula is why neural-network weight initialization scales by 1/fan-in1 / \sqrt{\text{fan-in}}: when nn inputs each contribute a Gaussian unit of noise, the variance of the sum scales with nn, so its standard deviation scales with n\sqrt{n}. Dividing by n\sqrt{n} keeps the signal’s variance roughly constant from layer to layer. That single fact (variances add under independence) is the entire derivation of He/Xavier initialization, which we’ll re-derive carefully in module 13.

The second formula is the reason averages over more samples are tighter estimates. Quadruple your sample size, you halve your error bar.

Variance of a Bernoulli

Let XBern(0.3)X \sim \mathrm{Bern}(0.3).

What is Var(X)\mathrm{Var}(X)?

Covariance: how two variables co-vary

Variance is one variable’s spread. Covariance generalizes to two:

Cov(X,Y)  =  E ⁣[(XμX)(YμY)].\mathrm{Cov}(X, Y) \;=\; \mathbb{E}\!\left[(X - \mu_X)(Y - \mu_Y)\right].

Positive covariance means “when XX is above its mean, YY tends to be too.” Negative covariance means “when XX is high, YY tends to be low.” Zero covariance means no linear association, though it does not mean independence; nonlinear relationships can have zero covariance.

Notice the symmetric structure with variance: Var(X)=Cov(X,X)\mathrm{Var}(X) = \mathrm{Cov}(X, X). They are the same operation; variance is just covariance with itself.

Covariance is a dot product of centered data (pay attention here)

Here’s the surprise. Given nn paired observations {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n, define the centered vectors x~=(x1xˉ,,xnxˉ)\tilde x = (x_1 - \bar x, \ldots, x_n - \bar x) and y~\tilde y likewise. Then the sample covariance is

Cov^(X,Y)  =  1n1i=1n(xixˉ)(yiyˉ)  =  1n1x~,y~.\widehat{\mathrm{Cov}}(X, Y) \;=\; \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) \;=\; \frac{1}{n-1} \langle \tilde x, \tilde y \rangle.

Covariance is literally a dot product. The same operation you spent a whole module on in module 7. The “similarity” interpretation of the dot product, “do these two vectors point the same way?”, is exactly the covariance question, “do these two variables tend to move together?”

The sister identity is just as good. Correlation rr (the Pearson coefficient) is the cosine of the angle between x~\tilde x and y~\tilde y:

r  =  Cov(X,Y)σXσY  =  cosθ(x~,y~).r \;=\; \frac{\mathrm{Cov}(X, Y)}{\sigma_X \, \sigma_Y} \;=\; \cos \theta(\tilde x, \tilde y).

So rr lives in [1,+1][-1, +1], by the Cauchy–Schwarz inequality you already know.

Drag points; watch r move

Drag the red points into a straight diagonal. Watch rr approach +1+1. Drag them into a perpendicular line. Watch rr approach 00. Drag them into the opposite diagonal. Watch rr approach 1-1.

-4-224-3-2-1123
-0.17
ȳ -0.13
sx 1.70
sy 1.21
Cov(X,Y) 2.04
r 0.992
∠ = arccos r 7.2°

Drag points. The teal crosshair is the centroid (x̄, ȳ). Coral arrows are the centered vectors (xi − x̄, yi − ȳ). Cov(X, Y) is the average of the products of those arrow components; r is their cosine similarity. Try pulling points into a tight diagonal and watch r approach ±1.

The coral arrows from the teal centroid are the centered vectors x~i,y~i\tilde x_i, \tilde y_i as 2D displacements. The number Cov(X,Y)\mathrm{Cov}(X, Y) is the running average of the products of their xx-component and yy-component. The number rr is the cosine of the angle between x~\tilde x and y~\tilde y viewed as vectors in Rn\mathbb{R}^n. Two ways of looking at the same arithmetic.

Sample versus population: what changes when you only have data

So far the formulas have been about the population, the underlying distribution, with μ\mu and σ\sigma and a known pmf. In practice you have a sample: nn data points, period.

The natural estimators are

xˉ  =  1ni=1nxi,s2  =  1n1i=1n(xixˉ)2.\bar x \;=\; \frac{1}{n}\sum_{i=1}^n x_i, \qquad s^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2.

The sample mean estimates μ\mu. The sample variance estimates σ2\sigma^2. The reason the variance denominator is n1n - 1 rather than nn is “Bessel’s correction”: by using xˉ\bar x instead of the (unknown) true μ\mu, we’ve eaten one degree of freedom, and dividing by n1n - 1 makes the estimator unbiased. We’ll come back to this when we derive MLE for a Gaussian in the next lesson; the maximum-likelihood estimator of variance divides by nn and is biased.

The law of large numbers, in one line

Sample averages converge to the true mean as the sample grows:

Xˉn  n  E[X].\bar X_n \;\xrightarrow{n \to \infty}\; \mathbb{E}[X].

Roll the same die 10 times; xˉ\bar x might be 2.7. Roll it 10,000 times; xˉ\bar x will be very close to 3.5. The LLN is what makes empirical estimation work at all.

Use the Gaussian widget below to see it. Pick a μ\mu and σ\sigma. Click “Sample 10k.” Watch the empirical mean xˉ\bar x and empirical standard deviation ss in the readout converge on the true values you set.

1.0-4-2024μσ
μ-1.00
σ1.50
n/a
sn/a
peak0.266

The central limit theorem, in one line

Take any distribution with a finite variance (any distribution). Average nn i.i.d. draws from it. As nn grows, the distribution of the (properly scaled) sample mean approaches a Gaussian:

Xˉnμσ/n  d  N(0,1).\frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \;\xrightarrow{d}\; \mathcal{N}(0, 1).

This is the central limit theorem (CLT). It is the reason error bars in physics, the noise in stochastic gradient descent, and the histogram of “average of a hundred rolls” all look Gaussian, regardless of what they started as. Sums and averages of many independent things forget their individual shape.

(Big caveat: the CLT is about averages, not individual data points. Token frequencies, file sizes, incomes: those are not Gaussian. Tradeoff intuition built on the CLT often does not transfer; we’ll see this trip up loss-curve interpretations in module 13.)

We are not going to use the CLT to do confidence intervals; those are out of scope. We mention it because every “noise is Gaussian” sentence in the rest of this course derives from this one line.

Where this lands: three numbers, used everywhere

E[X]\mathbb{E}[X] is in every loss function: the loss we minimize is the expectation of the per-example NLL. Var(X)\mathrm{Var}(X) is in every weight initializer: the unit-variance rule keeps signal-flow predictable. Cov(X,Y)\mathrm{Cov}(X, Y) is in every BatchNorm / LayerNorm: both compute a mean and subtract; both compute a variance and divide.

The transformer is doing one of these three things on almost every line. Next lesson: how all of statistical learning (every loss, every “train this thing to fit data” command) is one principle, applied over and over. Maximum likelihood. That’s the next stop.

Lesson complete

Nice tinkering.