The wall, restated
The Bengio MLP needs to grow with the context window . Twice the context, twice the parameters. That works at but not at : the model gets too big to fit, too slow to train, and too data-hungry to learn.
We want context that scales without exploding parameters. The recurrent neural network’s answer: instead of storing the context as a concatenation of the last characters, store it as a single vector that gets updated at every step.
Same parameters, every step. The vector remembers; the parameters don’t change.
The cell, in one line
The RNN cell takes the current input and the previous hidden state, and produces a new hidden state:
That’s the whole architecture.
- is the input at step (for us, a one-hot character vector).
- is the hidden state carried over from the previous step. is initialized to zeros (or, sometimes, learned).
- , , are learned parameters. Same , same , same at every timestep.
- keeps each component of bounded in .
Note one quiet thing: because is one-hot, is just one column of (the column indexed by the character). So acts as the input-side embedding lookup, exactly like the embedding matrix in the Bengio MLP.
Roll. Then unroll.
There is one cell. One. The widget starts in “rolled” view: a single box with a self-loop, because the output of the cell becomes its own next input. That self-loop is the recurrence.
Press Unrolled and the widget redraws the same recurrence as a chain of cells, one per timestep. They are the same cell. They share their parameters. The cell boxes you see in the unrolled view are not separate models; they’re the same model evaluated at different times. The arrow between them is the wire that carries forward.
This is unrolling. It’s how we draw an RNN for the purpose of running backprop on it: take the temporal recurrence and lay it out spatially as a deep feedforward graph. Once it’s flat, the chain rule from m12 applies to it without modification.
Scrub the timestep cursor. Notice that (the bar vector under each cell) gets overwritten every step. There is no growing memory bank. There is one vector that the cell rewrites each time it sees a new input.
One cell, by hand
The cell has hidden dim . Suppose:
Compute the first component of . (Use .) Round to three decimals.
Hidden state is just a vector
A common misconception is that the “hidden state” of an RNN is some special, mystical “memory” object, different in kind from a layer activation in a feedforward network.
It is not. Look at the unrolled graph above. is the output of one tanh layer of one cell. It’s a vector of numbers. The next “layer” (the cell at step ) happens to be the same cell that produced it. That’s the only special thing.
The bars under each cell in the widget make this concrete. There is no growing memory bank. There is no list of past tokens kept somewhere. There is one vector of numbers, and at each step the cell does:
…and writes it on top of the old vector. That is the entire memory mechanism. Anything the model “remembers” about timestep 1 by timestep 50 has had to survive 49 such overwrites.
That last point is going to matter a lot in the next lesson.
The output: one more matrix
The hidden state is the model’s internal state. To turn it into a prediction over the next character, project to vocabulary size and softmax:
is shape . Notice the output also has shape-: a row of probabilities, like every model in this course. That row is what we sample from at generation time and what we evaluate the NLL loss against at training time.
Total parameters of the whole RNN-LM:
- : (the input-side embedding lookup)
- : (the recurrence)
- :
- : (the output projection)
- :
The total is , independent of how long the sequence is. That’s the entire payoff for switching from the Bengio MLP: the parameter count no longer scales with context window.
W_y parameter count
For a character-level RNN with and hidden size , how many parameters are in just the output projection ?
The forward pass, all together
Putting it together for a sequence :
h = zeros(d) # h_0
loss = 0
for t in 1..T:
x_t = onehot(token[t])
h = tanh(W_x @ x_t + W_h @ h + b)
p = softmax(W_y @ h + b_y)
loss += -log(p[target[t]]) # NLL of this step
loss /= TThat’s the whole forward pass of an RNN language model. Notice three things.
One: W_x, W_h, b, W_y, b_y are referenced inside the loop. The same five tensors get used at every iteration. Weight sharing falls out of writing the recurrence, you don’t have to enforce it manually.
Two: h gets reassigned every iteration. The recurrence overwrites it. There is no list of past hidden states stored unless we ask for one (which we will, in order to backprop through them; that’s bookkeeping, not memory).
Three: The loss accumulates additively across timesteps. We’re predicting every next character, so each step contributes its own NLL.
This is the model. Next lesson: how to train it (BPTT) and what catastrophically goes wrong with it on long sequences (the bottleneck that motivates attention).
Lesson complete