Sequence Modeling: Bigrams to RNNs · 14 min

Predicting the next character

A language model is just a table, one row per "what came before" and one column per "what could come next." Learn the table by counting; sample from a row to generate text.

0 / 0

A language model, with no neural anything

We will build a language model in the next ten minutes. No tensors, no GPUs, no gradient descent. Just counting.

The corpus is a list of 32,000 baby names. The model will read it and learn one fact: given a character, which character tends to come next. That fact is a table, and that table is the model.

Start small. Type a name into the box below. Watch every two-character pair you typed light up a cell of the matrix.

corpus 0 words
bigrams counted 0
Try 'emma'. Each adjacent pair (including the start/end token ·) adds 1 to a cell.

Corpus is empty. Add a word above and watch the matrix fill in.

The dot · is the start/end marker. So emma becomes ·emma·: five characters, four bigrams: ·→e, e→m, m→m, m→a, a→·. Add a few more names and you can already see structure: a shows up in a lot of last-character-of-name positions, so the column under · is dense in that row.

The vocabulary

A token is a unit our model knows about. For this whole lesson, a token is one character: the 26 lowercase letters, plus a special start/end marker ·.

So our vocabulary has V=27|V| = 27 tokens. That gives us a 27×2727 \times 27 matrix of “this token, then that token” pairs: 729 cells. Every name in the corpus contributes a handful of bigrams; the matrix is just the running count.

Why a single start/end marker for both? Because then ·a is “the model sees a name starting with a” and is “the model sees a name ending with a.” One token does both jobs, and it lets the model learn endings as gracefully as it learns beginnings.

Sequences need the chain rule of probability

We want the model to assign a probability to a whole sequence of characters, a whole word like ada. Not to single characters in isolation. Probability of a word, not probability of a letter.

The trick is one identity from probability class: the joint probability of a sequence factors into a product of conditionals.

P(x1,x2,,xT)=t=1TP(xtx1,,xt1).P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1}).

That is exact, with no approximation. The probability of ada (treated as the sequence ·, a, d, a, ·) is the probability of the first token, times the probability of the second given the first, times the probability of the third given the first two, and so on.

The factorization buys us something important: instead of needing the probability of every possible word (there are infinitely many) we only need to know one thing per step: given what I’ve seen so far, what’s likely to come next.

A short sequence, by hand

A model gives you these three conditional probabilities:

P(a)=0.13,P(da)=0.16,P(d)=0.40.P(a \mid \cdot) = 0.13, \qquad P(d \mid a) = 0.16, \qquad P(\cdot \mid d) = 0.40.

The full sequence for ad is ·, a, d, ·. Use the chain rule to compute the joint probability P(‘ad’)P(\,\text{`ad'}\,). Round to four decimals.

The Markov assumption: context shrunk to one token

The chain rule has a problem in practice: P(xtx1,,xt1)P(x_t \mid x_1, \ldots, x_{t-1}) depends on the whole history. As soon as the history is long, no corpus on earth contains every possible history often enough to estimate that conditional reliably.

So we cheat. We truncate the history to just the most recent token:

P(xtx1,,xt1)    P(xtxt1).P(x_t \mid x_1, \ldots, x_{t-1}) \;\approx\; P(x_t \mid x_{t-1}).

This is the Markov assumption, and a model that uses it is a bigram model. “Bigram” because the conditional only ever looks at pairs.

It is a real approximation: e after m in emma does not actually depend only on the m. But it’s tractable: we can fill in P(xtxt1)P(x_t \mid x_{t-1}) for every pair (xt1,xt)(x_{t-1}, x_t) from the corpus by counting. That is what the matrix you’ve been building is.

A row is a probability distribution

Click any row label on the left of the matrix below. The row lights up, and a bar chart appears showing P(row letter)P(\,\cdot\,|\,\text{row letter}\,): the probability of every possible next character given that the current character is the row letter.

corpus 50 words
bigrams counted 341
row a
Row a is focused; hover any cell to read its conditional probability. Distribution below.
P( · | a ) row sum = 1.000
· 35%
b 6%
c 2%
d 2%
h 6%
i 6%
l 6%
m 6%
n 11%
r 9%
t 2%
u 4%
v 6%
y 2%
  • emma
  • olivia
  • ava
  • isabella
  • sophia
  • charlotte
  • amelia
  • mia
  • harper
  • evelyn
  • abigail
  • emily
  • ella
  • elizabeth
  • sofia
  • avery
  • scarlett
  • grace
  • chloe
  • victoria
  • riley
  • aria
  • lily
  • aubrey
  • zoey
  • penelope
  • lillian
  • addison
  • layla
  • natalie
  • camila
  • hannah
  • brooklyn
  • zoe
  • nora
  • leah
  • savannah
  • audrey
  • claire
  • eleanor
  • skylar
  • ellie
  • samantha
  • stella
  • paisley
  • violet
  • mila
  • allison
  • alexa
  • anna

Three things to notice. First, the bars in the row sum to 1: every row of the matrix, after dividing by its row sum, is a categorical probability distribution. Second, that distribution has structure: after a comes n and r and · (end of name) more often than q or z. Third, every row is a different distribution; the model knows that what follows a is different from what follows q.

The row is the model’s answer to “what’s likely to come after this character.” Pick any row, and you’ve picked the model’s prediction.

The model is the matrix

There is no fancier statement of what a bigram language model is. It is one matrix:

NRV×V,Nij=number of times token j followed token i in the corpus.N \in \mathbb{R}^{|V| \times |V|}, \qquad N_{ij} = \text{number of times token } j \text{ followed token } i \text{ in the corpus}.

You sample text from it like this. Start with ·. Look at row ·. That row, normalized, is a distribution over what comes first. Sample one character from that distribution: say it’s a. Now look at row a. Sample again: say it’s n. Look at row n. Sample again. Keep going until you sample ·, which means “stop.” That’s a generated name.

That’s the entire algorithm. Counting builds the table; sampling reads from it. There are no parameters to learn beyond what counting gives you, and there are no neural networks anywhere. We will spend the next few lessons making this model strictly better, but the shape will not change much: we will always be predicting the next token from a row of probabilities.

A tiny row by sight

Here’s a tiny corpus of four names: emma, olivia, ava, ada. In this corpus, the row for m has exactly two observed next-character counts: m → m once and m → a once.

Click the row for m in the matrix below to see the same thing visually.

corpus 4 words
bigrams counted 20
row m
Row m is focused; hover any cell to read its conditional probability. Distribution below.
P( · | m ) row sum = 1.000
a 50%
m 50%
  • emma
  • olivia
  • ava
  • ada

The two bars in that row have equal mass. The check below asks you to turn that visual fact into a probability.

Read a probability off the matrix

In the tiny corpus emma, olivia, ava, ada, the row for m has exactly two observed next-character counts: m → m once and m → a once.

What is P(am)P(a \mid m)? Express your answer as a decimal.

Why this matters past this lesson

Two things will return.

First, the row-of-a-matrix-as-distribution shape is the output shape of every language model we will ever build in this course, including the transformer at the end. The transformer’s last operation is softmax(logits) applied to a vector of length V|V|. That vector is a row. We will get fancier about how we compute that row (eventually it will depend on thousands of previous tokens through layers of attention), but it will still be a row, and it will still be sampled from in exactly the way we sampled from a bigram row.

Second, the attention weights inside the transformer are themselves rows of a matrix, normalized to sum to 1, that tell the model “given that I’m at position tt, how much should I look at each earlier position?” The mental move you just made (“this row is a categorical distribution; the column with the most mass is the one most likely to be picked”) is the same mental move you’ll make when you read your first attention heatmap.

The neural networks come next. The matrix view does not go away.

Lesson complete

Nice tinkering.