Tokenization, Training & Sampling · 14 min

Text becomes integers

The model never sees characters or bytes. It sees integer ids into a fixed vocabulary, picked once when the tokenizer was trained. Three flavors of tokenizer (character, word, subword) and the costs of each.

0 / 0

The interface, restated

For the entire transformer arc you have been thinking about token sequences. Time to look at the boundary where text turns into them.

The model never sees a string. It never sees a UTF-8 byte. The forward pass starts with an embedding lookup, and the input to that lookup is an integer, an index into the model’s vocabulary. The vocabulary is a fixed list of strings, picked once at tokenizer-training time and shipped alongside the weights.

Two functions live at this boundary:

encode:stringlist of ids,decode:list of idsstring\mathrm{encode}: \text{string} \to \text{list of ids},\qquad \mathrm{decode}: \text{list of ids} \to \text{string}

That’s the whole interface. Everything in M11–M16 sits on the right of encode. Everything in this lesson sits on the left.

The default failure mode

Type your name into the widget below. Then try strawberry. Then try a number. The default mode is BPE, a real subword tokenizer trained inline on a paragraph of Shakespeare.

tokens 24
bytes 27
bytes/token 1.13
vocab 64 merges
How·many·R's·in·strawberry?
show merge table (64 merges)
  1. 1 t + hth
  2. 2 r + ere
  3. 3 i + tit
  4. 4 th + ethe
  5. 5 e + nen
  6. 6 a + nan
  7. 7 o + uou
  8. 8 i + rir
  9. 9 it + iiti
  10. 10 iti + zitiz
  11. 11 itiz + enitizen
  12. 12 i + sis
  13. 13 u + sus
  14. 14 o + ror
  15. 15 ir + sirs
  16. 16 irs + tirst
  17. 17 w + ewe
  18. 18 l + lll
  19. 19 v + eve
  20. 20 F + irstFirst
  21. 21 C + itizenCitizen
  22. 22 Citizen + :Citizen:
  23. 23 e + aea
  24. 24 t + oto
  25. 25 e + ses
  26. 26 i + nin
  27. 27 i + eie
  28. 28 n + ono
  29. 29 i + cic
  30. 30 p + eapea
  31. 31 pea + kpeak
  32. 32 A + llAll
  33. 33 All + :All:
  34. 34 o + lol
  35. 35 ve + dved
  36. 36 k + nokno
  37. 37 kno + wknow
  38. 38 l + ele
  39. 39 ' + t't
  40. 40 a + tat
  41. 41 ou + rour
  42. 42 o + non
  43. 43 e + rer
  44. 44 c + ece
  45. 45 m + eme
  46. 46 s + peakspeak
  47. 47 a + reare
  48. 48 ol + vedolved
  49. 49 a + rar
  50. 50 y + ,y,
  51. 51 e + cec
  52. 52 g + ogo
  53. 53 s + usu
  54. 54 l + dld
  55. 55 the + ythey
  56. 56 b + ubu
  57. 57 v + enven
  58. 58 f + orfor
  59. 59 o + reore
  60. 60 p + rpr
  61. 61 e + ded
  62. 62 the + rther
  63. 63 ea + rear
  64. 64 speak + .speak.

strawberry is not a token. It is three tokens, none of which are words. The colored chunks are the actual integers the model receives. If you ask GPT-4 “how many R’s are in strawberry,” the model is being handed three ids and asked to count letters in chunks it has no built-in operation to look inside.

The model isn’t bad at counting. It is doing something else entirely.

Character-level: every codepoint is a token

The simplest possible tokenizer. The vocabulary is just the set of distinct characters that appear in the corpus. For Shakespeare, that’s 65: 52 letters (cased), 10 digits, 13 symbols/whitespace. Switch the widget above to character and watch each glyph become its own colored chunk.

The capstone (M18) trains on the tiny-shakespeare corpus with exactly this tokenizer. 1,115,394 characters total, split 90/10 into a 1,003,854-character training stream and a 111,540-character validation stream.

Two strengths of character-level:

  • No out-of-vocabulary problem. Every input is representable.
  • Tiny vocabulary. V=65\lvert V \rvert = 65 means the embedding table and the unembedding matrix are minuscule. At dmodel=128d_{\text{model}} = 128, that’s 65×128=8,32065 \times 128 = 8{,}320 parameters per matrix. Practically free.

One serious cost: every word becomes ~5 tokens, so the sequence the transformer has to process is ~5× longer than it would be with a word-level tokenizer. Attention is O(T2)O(T^2). Five times the tokens means twenty-five times the attention cost. Long sequences are how character-level pays for the rest.

Char-level tiny-shakespeare

What is vocab_size for the character-level tokenizer on tiny-shakespeare?

Word-level: vocab explodes, OOV strikes

Switch the widget to word. Now every whitespace-separated unit is its own token. Punctuation gets its own token. This feels intuitive, because it’s how humans segment language.

Two reasons it doesn’t scale:

  • Vocabulary explosion. English has hundreds of thousands of distinct word forms (run, runs, ran, running, runner, runners, rerun, outrun…). Modern dictionary-style tokenizers built this way sit around V=500,000\lvert V \rvert = 500{,}000 or more.
  • Out-of-vocabulary at inference. Any string the tokenizer didn’t see during training has no id. Old NLP systems had a special <UNK> token for this. Models trained with <UNK> learn to predict <UNK> in places where the real next token was rare, which means they hallucinate “unknown” instead of producing the actual word.

The vocab cost is real. The embedding table at the top of the model is V×dmodel\lvert V \rvert \times d_{\text{model}}. The unembedding at the bottom (often weight-tied, same matrix transposed) is the same shape. Doubling the vocabulary doubles both.

What word-level costs you

A word-level English tokenizer with V=500,000\lvert V \rvert = 500{,}000 entries paired with dmodel=128d_{\text{model}} = 128 in float32. What’s the embedding-table size in megabytes?

Subword: the compromise that won

Character-level is too long. Word-level is too big and has an OOV problem. Subword tokenization picks pieces somewhere in between: common words become a single token, rare words decompose into smaller chunks the tokenizer has seen before.

The dominant subword algorithm is byte-pair encoding (BPE). Train it once on a corpus, ship the resulting merge list with the model, and you get:

  • Common words (the, and, with) become single tokens.
  • Inflections (runs, running) usually share a stem token plus a suffix token.
  • Brand-new words still tokenize without <UNK>; they just decompose into more pieces.

GPT-2 and GPT-3 use BPE over raw UTF-8 bytes (vocab ≈ 50,257). GPT-4 uses a larger byte-level BPE (cl100k_base, vocab ≈ 100,000). LLaMA uses a different family (SentencePiece) with similar economics.

The tokenizer in this widget’s BPE mode is real. It was trained inline on a paragraph of Shakespeare, with 64 merges. It is small: the merge table fits on one screen, and you can see exactly which pieces it learned. Click “show merge table” to inspect what it’s actually doing under the hood. Next lesson: how it got there.

Lesson complete

Nice tinkering.