How a tokenizer is built

The merge loop

Byte-pair encoding (Sennrich et al. 2016, adapted from Gage 1994) is one for-loop. Start from the alphabet of the corpus. Count how often every adjacent pair of tokens appears. Pick the most frequent pair, fuse it into a new token, replace it everywhere. Repeat until you’ve made enough new tokens.

That’s the entire algorithm. It is deterministic, it is greedy, and the only state it produces is an ordered list of merges.

Five word-counts is enough to see it work. The widget below holds the canonical Hugging Face / Sennrich teaching corpus:

\{\text{hug}{:}10,\ \text{pug}{:}5,\ \text{pun}{:}12,\ \text{bun}{:}4,\ \text{hugs}{:}5\}

The histogram on the right is the only thing the algorithm consults at each step. It picks the tallest bar, fuses that pair, and recomputes. Hit “merge once” twice and watch what happens to the corpus on the left.

Counting pairs by hand

At step 0, before any merges, every word is just a sequence of characters. Counts of adjacent pairs:

(u, g): 10 (in hug) + 5 (in pug) + 5 (in hugs) = 20
(p, u): 5 (in pug) + 12 (in pun) = 17
(u, n): 12 (in pun) + 4 (in bun) = 16
(h, u): 10 (in hug) + 5 (in hugs) = 15
(g, s): 5 (in hugs) = 5

(u, g) wins. Merge ① is (u, g) → ug. Every u g becomes a single token ug. Now hug is [h, ug], pug is [p, ug], hugs is [h, ug, s].

The pair counts change. Recompute against the new token sequences:

(p, ug): 5 (in the merged pug) = 5
(u, n): still 16; pun and bun were not touched.

Watch the widget bar move.

Pair count after merge ①

After running merge ① on the corpus above, what is the count of the pair (p, ug)?

The merge list is the tokenizer

The output of training isn’t a model. It’s a list:

1.  (u, g) → ug
2.  (u, n) → un
3.  (h, ug) → hug
…

To tokenize a new word at inference time, replay the merges in order. Start with single characters. Apply merge 1 wherever it matches. Apply merge 2. Apply merge 3. Stop when you reach the end of the list.

The widget exposes this. After two merges ((u, g) then (u, n)) type new words into the bottom box and watch them tokenize against the current merge table.

Tokenize 'hugs' after two merges

With merges $(u, g) \to ug$ and $(u, n) \to un$ applied in that order, how many tokens does 'hugs' produce?

The tokenizer is a frozen artifact

Three things follow from this construction that learners frequently get wrong:

The tokenizer is not learned. No gradients. No optimizer. The merge list is computed once by a frequency-counting batch job and never updated. It is a fixed pre-processor.
The merge list ships with the weights. If you swap the tokenizer for a different one, every embedding-table row is now mapped to the wrong token, and the model becomes noise. You can’t change the tokenizer without retraining the model.
The order of merges matters. Applying merge ② before merge ① can produce a different tokenization. The list is replayed in the order it was trained.

You will see this hard rule a lot in practice: when a model card says “tokenizer: cl100k_base,” that means exact binary-identical merge list, or the model breaks.

Bytes, not codepoints

GPT-2 made one important variation. Instead of starting from “the set of distinct Unicode characters in the corpus,” it started from the 256 raw UTF-8 bytes. Every input string (Cyrillic, emoji, whitespace, control characters) turns into bytes the same way, and BPE runs over the byte stream.

Why does that matter?

The tokenizer is total. Every possible input is representable, because every input is bytes. No <UNK> token, no failures at inference.
Cross-language behavior. Languages whose alphabet is not in the training corpus still work; they just tokenize into more pieces. Korean, Arabic, emoji.
The base vocabulary size is fixed at 256. All learned merges go on top of that.

GPT-2 trained 50,001 merges on top of those 256 bytes (vocab = 50,257 once you add a handful of special tokens). GPT-4 trained more, with cl100k_base ≈ 100,000.

Byte-level base vocab

What is the size of GPT-2’s byte-level BPE vocabulary before any merges have been applied?

The strawberry failure, in full

Now strawberry is fully explained. The byte-level BPE tokenizer that GPT-4 was trained with happens to have learned merges for st, raw, and berry (three common subword pieces) but never learned a single token for strawberry. So at inference, the word decomposes into three ids: [496, 675, 15717] for cl100k_base.

Run the inspector below, leave it on BPE mode, and try strawberry. Try unbelievable. Try a date. Try a code snippet.

tokens 9

bytes 10

bytes/token 1.11

vocab 64 merges

character word BPE (trained on Shakespeare)

strawberry

show merge table (64 merges)

1 t + h → th
2 r + e → re
3 i + t → it
4 th + e → the
5 e + n → en
6 a + n → an
7 o + u → ou
8 i + r → ir
9 it + i → iti
10 iti + z → itiz
11 itiz + en → itizen
12 i + s → is
13 u + s → us
14 o + r → or
15 ir + s → irs
16 irs + t → irst
17 w + e → we
18 l + l → ll
19 v + e → ve
20 F + irst → First
21 C + itizen → Citizen
22 Citizen + : → Citizen:
23 e + a → ea
24 t + o → to
25 e + s → es
26 i + n → in
27 i + e → ie
28 n + o → no
29 i + c → ic
30 p + ea → pea
31 pea + k → peak
32 A + ll → All
33 All + : → All:
34 o + l → ol
35 ve + d → ved
36 k + no → kno
37 kno + w → know
38 l + e → le
39 ' + t → 't
40 a + t → at
41 ou + r → our
42 o + n → on
43 e + r → er
44 c + e → ce
45 m + e → me
46 s + peak → speak
47 a + re → are
48 ol + ved → olved
49 a + r → ar
50 y + , → y,
51 e + c → ec
52 g + o → go
53 s + u → su
54 l + d → ld
55 the + y → they
56 b + u → bu
57 v + en → ven
58 f + or → for
59 o + re → ore
60 p + r → pr
61 e + d → ed
62 the + r → ther
63 ea + r → ear
64 speak + . → speak.

The tokenizer in the widget is a small one, trained on a paragraph of Shakespeare. The exact tokens differ from GPT-4’s. The shape of the failure is identical: single, opaque, multi-character chunks where the model has no operation that introspects the letters inside them.

When the model is asked “how many R’s in strawberry,” it sees [496, 675, 15717] (three integers) and is asked to perform a counting operation on the characters of those tokens, an operation the embedding layer does not expose. Letter-counting tasks are off-distribution. So is reversing strings. So is ASCII-art. So is multi-digit arithmetic, partly because BPE merges multi-digit numbers into chunks unevenly (123 might be one token, 1234 might be two: 12 and 34).

This is the price of subword tokenization. It is not the model being careless. It is the model being given different inputs than the inputs the question demands. Next lesson: how the model uses these inputs to learn.

Lesson complete