Now make it talk

The model is a distribution, not a sentence

You spent the last three lessons training a transformer. The output of that transformer, at every position, is not a character. It is a probability distribution over all 65 characters in the tokenizer. The model says: “given the last 64 characters, here is how likely each of the 65 next characters is.”

To get an actual character out, something downstream of the model has to pick one. That picker is the sampler. The sampler is not part of the model. It is a small piece of code that runs after the model has done its work, takes the distribution, and converts it into a single token.

The three knobs you are about to drag (temperature, top-k, top-p) all live inside the sampler. None of them change the model. The 206,016 weights are frozen. What changes is which slice of the model’s distribution you allow the sampler to draw from.

The widget below loads a reference checkpoint that we trained once for you (2,000 iters, seed hamlet, ~840 KB), and exposes a prompt plus those three knobs. Press regenerate.

idle

prompt

temperature 0.80 0.1 = greedy-ish · 2.0 = entropy explosion top-k 40 1 = argmax · 65 = no truncation top-p 0.95 nucleus mass · 1.0 = no truncation sampler seed same seed + same knobs = identical sample

nucleus (sampled) top-k but outside nucleus outside top-k

generated

press regenerate to begin sampling.

The histogram in the middle is the model’s actual output distribution at the most recent generated position, sorted from most-probable character (leftmost bar) to least-probable. The blue bars are the nucleus, the set of characters the sampler is actually allowed to draw from. Orange-dim bars are inside top-k but outside the nucleus. Grey bars are outside top-k entirely.

If you want to use the checkpoint you trained in the last lesson instead of the reference one, hit “load your own.”

Drag temperature to 0.1

Set the temperature slider all the way down to 0.1 and regenerate.

The output collapses. You will see something like the the the the the the the the the the, or nd, and the the the the the the. The model has become aggressively repetitive.

Here is what temperature does. Before the softmax, each logit $z_i$ is divided by $\tau$ :

$p_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}$

At $\tau = 1.0$ that is just the model’s raw distribution. At $\tau = 0.1$ every logit is multiplied by 10 before the softmax. That makes the biggest logit hugely bigger than the rest, and the softmax collapses to nearly all its mass on the single most-probable character. The histogram shows one tall bar and 64 invisible ones.

The model still has a distribution over 65 characters. It is just that the distribution is now a near-Dirac, all the mass at one location. The sampler “samples” from it but always returns the same character. From there, the next position’s distribution depends on the (now repetitive) prefix, so the model keeps proposing the same continuation, which makes the next argmax the same character, which is how the the the the happens.

This is called degenerate repetition, and it is the canonical failure mode of low-temperature greedy decoding. Greedy decoding is just $\tau \to 0$ , top-k = 1. It is also called argmax sampling.

High-confidence softmax

The model is a two-character vocabulary for the sake of this question (this is not the real capstone vocab; we are just doing math on a small case). It outputs logits [5.0, 3.0]. The sampler runs with temperature 0.5 and no top-k or top-p truncation.

What is the probability assigned to the top-1 token?

Drag temperature to 2.0

Now the other direction. Slide the temperature to 2.0 and regenerate.

The output looks like JuX 9 vK,!Lq.Q?Z3Pj Y?xK i! Tn4. Random characters with no structure. This is the entropy explosion failure mode.

The math is symmetric to the temperature-0.1 case. Dividing each logit by 2.0 makes every logit half as decisive. The softmax flattens. At $\tau = 2.0$ the model still outputs a distribution, but now the top character only has, say, 5% probability instead of 30%. The sampler draws roughly at random from a near-uniform distribution. The 65-character vocabulary contains a lot of punctuation and capital letters that English usage forbids in most positions, and at high temperature the sampler picks them freely.

The histogram visualizes this. At $\tau = 0.1$ you saw one tall bar. At $\tau = 2.0$ you see 65 short bars of comparable height. The model’s underlying preferences are still there (the leftmost bar is still the tallest) but the gap between most-probable and least-probable has been compressed.

So temperature is a single number that controls the trade-off between coherence and surprise. Low temperature: coherent but repetitive. High temperature: surprising but incoherent. The sweet spot for character-level Shakespeare tends to be around 0.8.

Reset temperature to 0.8 before moving on.

Top-k: throw away the long tail

Set top-k to 1 and regenerate. The output is identical to low-temperature decoding (you get the argmax every time, even at $\tau = 0.8$ ). The histogram shows 64 grey bars and 1 tall blue one.

Slide top-k up to 5. Now only the five most-probable characters are eligible. The histogram shows 5 colored bars and 60 grey ones. The output is still pretty coherent because the bottom-60 characters of the distribution at any given position are almost always terrible candidates that the model has correctly assigned low probability to. Truncating them away doesn’t hurt much.

Slide top-k up to 40. The histogram shows 40 bars in color, 25 in grey. The output is essentially identical to “no truncation at all” at this checkpoint, because the model already concentrates almost all of its mass in the top ~30 characters.

This is what top-k does. After the temperature softmax, it sorts the distribution and zeros out everything outside the top k. The remaining probabilities are then renormalized to sum to 1. The sampler draws from that renormalized distribution.

Top-k is a hard cutoff. It says “the 26th-likeliest character is never allowed, regardless of context.” This works fine on average but can be wrong in specific contexts: sometimes the right character genuinely is the 26th most probable, and top-k will refuse to sample it.

Top-p: throw away the long tail more carefully

Set top-k back to 65 (no top-k truncation). Now slide top-p from 1.0 down toward 0.5.

Watch the histogram. At $p = 1.0$ all bars are blue (the nucleus is the whole distribution). At $p = 0.95$ the trailing 1-3 bars fade out of nucleus color: the model concentrates 95% of its probability mass on the leading characters, so the rest is the trailing 5%, which top-p discards. At $p = 0.5$ only the first few bars remain in nucleus color; the rest are dimmed.

The nucleus-boundary marker (the teal dashed line) shows where the prefix-sum of probabilities crosses your $p$ threshold. The set of characters to the left of that line is the nucleus. The sampler draws from inside the nucleus, renormalized.

Top-p is the adaptive cousin of top-k. Instead of a hard count, it asks “how much probability mass do I want?” If the model is very confident at this position (one tall bar, many short ones), top-p 0.95 will give you a small nucleus (maybe 3-5 characters). If the model is genuinely uncertain (many bars of comparable height), top-p 0.95 will give you a larger nucleus. The sampler adapts to the model’s confidence at every position.

Most modern samplers (the GPT API default included) use top-p with $\tau \approx 0.8$ and $p \approx 0.9$ to $0.95$ .

When sampling stops being random

You set temperature very close to 0 and top-k to 1. There is no randomness left; the sampler always picks the single most-probable character. People call this greedy decoding (or argmax sampling).

What is the probability the sampler assigns to the chosen character at every step?

Prompt engineering

Type ROMEO: into the prompt textbox (default), regenerate. The model produces dialog-formatted continuation, character name followed by a line of speech, because tiny-shakespeare is largely written in that format and the model picked up the convention.

Try QUEEN:. Same structural pattern. Try let me not to the marriage, the opening of Sonnet 116. The model will not produce the rest of the sonnet (it does not have anywhere near the capacity to memorize specific lines), but the continuation will be vaguely sonnet-like in shape.

The “prompt” is just the context the model conditions on. The transformer reads the last 64 characters and emits a distribution for the next one. Whatever you type seeds the autoregressive loop. The model does not understand your prompt the way a chat model does; it just continues from it.

This is what people call in-context learning in the small. The model adapts its next-token distribution to match the patterns in the input. It is not learning anything in the gradient-update sense (the weights are frozen). It is conditioning. And conditioning is most of what a language model does.

Same knobs, different seed

Change the sampler seed from go to flow (or anything). Regenerate. Keep the prompt and the three knobs identical.

You get a different sample.

Now change the seed back to go. Regenerate. You get the exact same sample as the first time.

This is the same byte-for-byte determinism story from lesson 18.3. The model’s distribution at every position is a deterministic function of the input. The sampler’s draws are pseudo-random, but the PRNG is seeded by your seed string. Same seed, same draws, same output. Different seed, different draws, different output, but the space of outputs the sampler is choosing from is identical because the model is the same.

This is how research reproducibility works. Reporting “we sampled with $\tau = 0.8$ , top-p $= 0.95$ , seed = 42” is a complete specification: anyone with the same checkpoint can reproduce the exact same generation.

What you're not editing

You have been turning knobs for 25 minutes. You have changed the temperature, the top-k, the top-p, the prompt, the seed. None of these changed the model. The 206,016 weights are the same weights they were when the page loaded.

You are editing the post-processing of the model’s distribution. The model produces 65 probabilities at every position; the sampler decides which one to draw. That is the whole sampling story.

If you wanted to change what the model knows, you would have to train more (with more data, or for more steps, or on different text). If you wanted to change what the model says given what it knows, you have these three knobs. Production LLMs expose the same three knobs, plus a few others (frequency penalty, presence penalty, logit bias), and that is the entire sampler surface.

Next lesson: the credits roll. The course is closing on itself.

Lesson complete