Forward or reverse: pick a direction.
Below is a 2-to-2 function . The Jacobian at the marked base point is
Two modes:
- Forward (JVP): drag the input tangent on the left. The widget computes the output tangent as on the right.
- Reverse (VJP): drag the output cotangent on the right. The widget computes the input cotangent as on the left.
Same Jacobian, opposite direction of action. The pedagogy of this whole lesson is in that toggle.
The loss, as a function
A neural network has parameters: weights and biases. Collectively, : a single tuple with one entry per parameter.
For a given training example, the network produces a prediction, compares it to the target, and outputs a single number: the loss. So the loss is a function
Many inputs (millions for a real model). One output (a scalar).
Training means: nudge in the direction that reduces . The direction that reduces fastest is . Training is: compute , step against it, repeat.
The only real question: how do you compute when is the composition of thirty operations, each itself a vector-valued function?
A net is a composition
The network as a chain:
where is the -th layer (linear + nonlinearity) and is the final loss function. The input is the example’s features; the output is one number.
By the chain rule, the Jacobian of with respect to the parameters is the product of all those layer Jacobians:
has one output, so is a row, which is exactly , written sideways.
The trick: don't materialize the Jacobians
If each layer’s Jacobian were and you multiplied them all together, you’d be moving giant matrices for nothing. You don’t need the product matrix. You only need its top row: the gradient.
Here’s the move. The leftmost factor, , is a row (because is scalar-output). When you multiply a row by an matrix, you get a row. That row is a vector–Jacobian product (VJP):
Instead of building the full layer Jacobian and multiplying the next one in, you keep a row around and shrink/resize it by VJP-ing it with each layer’s Jacobian from right to left. You work left-to-right in the expression, but consuming layers right-to-left from the graph’s perspective: output first, input last.
That’s why it’s called the backward pass. At the end, the row you have is . Every entry is a . loss.backward() walks exactly this path and writes those numbers into parameter.grad.
Forward mode vs reverse mode
Two ways to walk a chain of Jacobians. You played with both:
- Forward (JVP): multiply from the right. Push a tangent vector from the input side through each layer. Good when (inputs) is small.
- Reverse (VJP): multiply from the left. Pull a cotangent from the output side back through each layer. Good when (outputs) is small. Perfect for scalar-output losses.
A neural-network loss has output and millions of inputs. Reverse mode wins by a factor of . That’s why deep learning frameworks all use reverse-mode autodiff, and why .backward() exists on the scalar loss tensor, not on individual parameters.
Work a real example by hand
Let’s do one. Loss , for a fixed training example , . Evaluate at .
Forward pass:
- .
- Error .
- Loss .
Backward pass (chain rule, walked from the loss outward):
- .
- .
- .
- .
So . Each coordinate was computed by multiplying “the gradient so far” (a single number at this scale, because is scalar) by the local partial at one edge. At scale, those scalars become rows and the partials become layer Jacobians, but the algorithm is identical.
This is, literally, every time loss.backward() runs.
Compute ∂L/∂w
Same loss with , , . Confirm .
And ∂L/∂b
Same setup. Confirm .
Where this lands
You just did, by hand, what loss.backward() does for a network of two parameters. In Module 12 we’ll build micrograd, a tiny scalar autograd engine that walks a computation graph and fills in .grad fields using exactly this procedure, then vectorize it from scalars to tensors. Module 15 wires it through an attention layer, and Module 18 uses all of it to train a character-level transformer in your browser.
The algorithm is already known to you. Everything else is bookkeeping at scale.
Lesson complete
Nice tinkering.
Before you go