The Transformer Block · 18 min

The residual stream as the object

Stop thinking of the residual stream as plumbing and start thinking of it as the noun the model is operating on. Every block reads from it, computes a delta, writes the delta back. The forward pass is a sum of corrections layered onto a bigram floor.

0 / 0

A reframe before the math

The previous two lessons treated the residual stream as plumbing: a wire that carries information between blocks, with the blocks doing the “real” work. That framing is fine for getting through the architecture diagram. It’s wrong for understanding what the model is actually doing.

The reframe (due to Anthropic’s mech-interp lineage, Elhage et al. 2021) is this: the residual stream is the noun. The blocks are read/write operations on it. Every sub-layer reads the current state of the stream, computes a delta, and adds the delta back. The state of the model at any layer is the stream, not the block.

This isn’t a stylistic choice. It changes what you can do.

Composition by addition

Pre-LN’s residual structure means every sub-layer’s output is added to the trunk. So after NN blocks the residual stream is:

hN  =  h0  +  =1NΔh_N \;=\; h_0 \;+\; \sum_{\ell=1}^{N} \Delta_\ell

where each Δ\Delta_\ell is the sum of that block’s two sub-layer contributions. The whole forward pass, the entire computation that turns input embeddings into output logits, is a sum of vectors. There is no nonlinear composition between blocks; the only nonlinearity inside any one Δ\Delta_\ell is GELU and softmax-attention.

This has two consequences worth pinning down:

  1. You can attribute any final-layer activation to a particular block by reading off that block’s contribution to the relevant dimension. If block 3’s Δ\Delta wrote +0.7+0.7 into dimension 17 of the residual stream at position 5, that contribution is unambiguous and additive.
  2. You can decompose the forward pass into “paths”: direct path (just embedding → unembedding), single-block paths, two-block paths. This is the substrate of circuit-level analysis.

Look at the stream

Below is the same scripted toy model from lesson 16.1, with a new mode flipped on: decompose. Click any cell in the grid to pick a residual-stream state. Then flip the toggle in the vector pane to see the full additive breakdown: which sources contributed what to that vector.

residual stream: click any cell rows: states inside the stack · columns: token positions
pos 0 "h"
pos 1 "e"
pos 2 "l"
pos 3 "l"
pos 4 "o"
h₀ (embed)
after attn 1
after ffn 1
after attn 2
after ffn 2
after attn 3
after ffn 3
after attn 4
after ffn 4
x4b at position 0 ("h") after ffn 4
dim 0 dim 15

Pick a cell deep in the stack (say, position 4 at the top row). In raw mode you see the cumulative vector: sea bars positive, coral bars negative. Switch to decompose. The same vector is now split into one row per source: h0h_0 at the top (orange), then attn-1 (sea), ffn-1 (coral), attn-2, ffn-2, all the way up. Each row is one sub-layer’s contribution to this exact vector. They sum component-wise to the raw vector you saw before.

Notice that block 3’s FFN contributes much more to position 4 than to other positions; that’s the scripted late-layer “output specialization” baked into the demo data. In a real trained model you would see real specialization, but the structural rule is the same: every block’s contribution at every position is additive, isolated, and inspectable.

When a block does nothing

At some position, block \ell‘s delta vector is exactly the zero vector. What does block \ell contribute to that token’s residual stream, in scalar terms?

The bigram floor

There is a particularly tidy special case of the additive picture: what does the model predict if you turn off every block?

With N=0N = 0 (every block replaced by the identity map) the forward pass collapses to:

logits  =  LN(WE[t]+pt)WE\mathrm{logits} \;=\; \mathrm{LN}(W_E[t] + p_t)\, W_E^\top

If we ignore the positional encoding and the LayerNorm gain, this is exactly the bigram model from m14: take the token embedding for the input token, dot it against every other token’s embedding (because of weight tying, those are the same matrix), softmax. The result is P(nextthis token)P(\text{next} \mid \text{this token}) as a function of nothing but the current token.

Every transformer is a bigram model with a stack of additive corrections. Each block buys you context-dependent edits to the bigram baseline. Turn the blocks off, you get the bigram. Turn them all on, you get a full transformer.

input token "h" predict next
block budget k = 0 / 4
k = 0: identity-only; every block replaced with the identity map. The output is exactly the bigram floor: softmax(WE[h] · WE).
a
4.3%
e
5.3%
h
46.4%
l
8.9%
o
2.9%
r
13.2%
t
9.2%
·
9.8%

Drag the block budget from k=0k = 0 (bigram floor) to k=4k = 4 (full model). At k=0k = 0, the next-token distribution after “h” is whatever the embedding geometry gives you, basically random, because token embeddings at initialization don’t know about co-occurrence. As kk grows, each block’s delta nudges the residual stream toward the direction of the right next token. By k=4k = 4, the model has committed to “e” with high probability.

The point isn’t which block does what: the answer is built up additively from a bigram baseline. You can stop the computation at any depth and read off “the model’s current best guess.” That trick is called the logit lens (nostalgebraist 2020) and it works because of weight tying plus pre-LN: every intermediate residual-stream vector is decodable by the same unembedding matrix.

Bandwidth contention

One more consequence of the additive framing, slightly more advanced but important enough to flag.

The residual stream has dd dimensions. Each block’s two sub-layers want to write into it. With hh heads per block and an FFN of width 4d4d, the number of computational dimensions that want to deposit information into the stream at each layer is much larger than dd. In GPT-2-small (with h=12h = 12, dhead=64d_{\text{head}} = 64, dff=3072d_{\text{ff}} = 3072, d=768d = 768):

  • Attention writes via WOW_O, which has dd=d2d \cdot d = d^2 params and outputs dd dims.
  • FFN writes via W2W_2, which has 4dd=4d24d \cdot d = 4 d^2 params and outputs dd dims.

That’s 5d25 d^2 parameters trying to fit information into a dd-dimensional output stream per block. They have to negotiate. The result, empirically, is that heads and FFN neurons specialize: some neurons appear to delete information already in the stream (writing in the negative of an existing direction), some heads’ WOW_O deliberately operates in subspaces that don’t overlap with other heads.

This bandwidth pressure is the reason architectures with very many small heads tend to underperform fewer larger ones at fixed parameter count: each head gets a tinier subspace to work in, and the residual stream is the same width regardless. Modern LLMs settle on dhead[64,128]d_{\text{head}} \in [64, 128] for this reason.

How many ways into a stream?

GPT-2-small has d=768d = 768 and FFN width 4d=30724d = 3072. What is the ratio (FFN output dims) : (residual-stream dims), i.e., how many “computational dimensions” want to write at each layer relative to dd?

Why this matters past M16

Three reasons to keep the residual-stream framing in your head, even though we won’t directly use it for the next two lessons:

  1. Interpretability research lives here. Every interpretability technique worth knowing (logit lens, tuned lens, attribution patching, sparse autoencoders, induction-head circuits) is some operation on the residual stream. They don’t make sense if you think of the model as “blocks transforming a vector.”
  2. Editing the model becomes possible. Once you see that block 4 wrote a specific direction into the stream at position 5, you can intervene: zero out that specific direction and observe what changes. This is how Anthropic’s circuits work proceeds.
  3. Stacking is associative. hN=h0+Δh_N = h_0 + \sum_\ell \Delta_\ell doesn’t care about order. The model does (each Δ\Delta_\ell depends on the cumulative state up to its layer), but the bookkeeping doesn’t. This is why thinking in terms of paths through the stream is tractable.

Lesson 16.4 takes the residual stream and the block as given, stacks the block NN times, adds the final LayerNorm and the unembedding, and finishes the architecture. Lesson 16.5 counts what it costs.

Lesson complete

Nice tinkering.