One biased coin. Ten flips. What's your best guess?
You flip a coin 10 times and get 7 heads, 3 tails. Someone hands you a Bernoulli with parameter and asks: what’s your best estimate of ?
You’d say and you’d be right. The formal reason is maximum likelihood estimation. The idea has one sentence: the best estimate of a parameter is the value that makes the data you actually saw most probable. That sentence is the whole machinery of supervised learning. We’re about to make it concrete.
Likelihood: data fixed, parameter varying
For an i.i.d. dataset drawn from a distribution with parameter , the likelihood is
It looks like a probability but it isn’t, exactly. The data is fixed; the unknown is . We’re asking, for each candidate , “how plausible is the dataset I have under that choice?”
Note the i.i.d. assumption hiding in the product form. It says: the data points are independent draws from the same distribution. That’s a real assumption. It doesn’t hold for tokens in a sentence (that’s why language models exist), but it’s the right starting point.
Why take logs first: every time, without exception
The likelihood is a product. Products of small numbers underflow fast: ten flips of give a likelihood around ; a hundred flips give . Your floating-point arithmetic dies. Worse, derivatives of products are a chain-rule nightmare.
Logs save us. Define the log-likelihood
Three things matter here. First, the product became a sum: no underflow, easy derivatives. Second, is monotonic, so : the maximizer is the same. Third, we usually minimize negative log-likelihood (NLL) instead, so it looks like a loss function:
Memorize that last line. It’s the entire shape of supervised learning.
MLE for a Bernoulli: derive it once, use it forever
For a single Bernoulli with parameter , observe heads and tails in flips. The log-likelihood is
Set the derivative to zero:
The MLE is the empirical frequency. Count the heads; divide by the total. That’s it.
Below, click any flip to toggle it. The red curve is , visibly peaked; the blue curve is , same maximum, much friendlier shape. Drag the dashed sun-colored line across either plot; “Snap to MLE” jumps you to the optimum.
7 heads, 3 tails: find p̂
You observe 7 heads in 10 flips. What is the MLE ?
Categorical: same move, more options
The categorical generalization is immediate. With classes and observed counts summing to , the log-likelihood is
A Lagrange-multiplier minute (we’ll skip the algebra) gives
Count each class; divide by the total. Same shape as Bernoulli. This is why training a language model on a corpus and “just counting bigrams” produce exactly the same model in the limit: that’s the bigram MLE result we’re about to demonstrate.
Gaussian: MLE picks the sample mean and variance
For data drawn from , the same calculus gives
The MLE of the mean is the sample mean. The MLE of the variance divides by , not : it is biased (it slightly underestimates in small samples). The unbiased estimator from the last lesson divides by . Both estimators exist for good reasons; MLE just gives the biased one.
MLE under Gaussian noise IS least squares
This is the punchline that connects MLE to half of classical ML.
Suppose your model says with i.i.d. The log-likelihood, dropping constants in , becomes
Maximizing is the same as minimizing the sum of squared errors. The two phrases, “fit by maximum likelihood under Gaussian noise” and “fit by least squares,” are exactly the same procedure. Linear regression, your first ML algorithm, was MLE all along.
A language model is a categorical MLE, literally
Here’s the move that earns this lesson its place in the course. A bigram language model says: given the previous character , the next character is categorical with some probability vector . That’s separate categoricals (one per row), each with its own probability vector to estimate.
By the result two steps up, the MLE for each row is just count-and-normalize: count how many times each character followed in the corpus, divide by the total. No gradient descent. No neural network. Just MLE.
Type a few names below. Each name updates the bigram count matrix. Click a row label on the left (e.g. a) to see the conditional distribution : that’s exactly for the row “characters that follow a.”
What you’re looking at (the row of bars on the right) is the language model. A 27-row table of categorical MLEs. Module 14 will replace this table with a neural network, and the punchline of that lesson is that the network, trained by gradient descent on the right loss, converges to exactly this table in the limit. The neural-network approach is just a fancier way of doing the same MLE.
Bigram MLE on a tiny corpus
Train a bigram model on the two-word corpus ab, ac (each implicitly bracketed by start/end markers ·).
What is the MLE ?
NLL is the form you'll actually minimize
We never talk about “maximizing the likelihood” in code. We minimize the negative log-likelihood
This is the loss every classifier in this course will minimize. Cross-entropy loss in PyTorch is literally this expression for a categorical distribution with one-hot targets. The “loss” the training loop spits out at iter 200 of the capstone is the average NLL of the model’s predictions on a batch of tokens.
Next lesson is module 9. It will rename NLL as cross-entropy and explain why has a beautiful interpretation as “the model’s uncertainty in bits.” For now, the name to remember is NLL, and the takeaway is: every loss curve you’ll watch for the rest of this course is a NLL going down.
After that, one more module 8 lesson on the inverse problem: given a trained categorical, how do you actually generate samples from it? We’ve been counting to learn the distribution; next we’ll roll dice to draw from it.
Lesson complete