Drag x. Watch two rates become one.
The composition is two jobs in a row: first , then square. Drag on the widget below. The left readouts show the inner rate () and the outer rate (, evaluated at ). The right readout shows their product, which is the actual rate of the whole composition.
Two things to notice:
- The total rate is the inner rate times the outer rate. Always.
- When the inner rate hits zero (at , where ), the total rate goes to zero too. The chain is only as strong as any link that hits zero.
That product rule (multiply rates along a chain of compositions) is the most important formula in differential calculus, and not by coincidence the mechanism by which every neural network in the world is trained.
Functions that eat functions
When you write , you’re doing two jobs in a row. First, take , pass it to ; call the result . Then take , square it.
We call this composition and write it or , where runs first (the inner function) and runs second (the outer). In our case and .
The derivative problem: nudge by a tiny amount. How does the final output change? Both functions are amplifying (or dampening) that nudge. The output’s sensitivity to depends on both rates at once.
A simpler case first: watch the rates multiply
Forget for a second. Pick (a simple stretch) and .
The composition is , which we already know how to differentiate:
Now try it the long way, using both rates:
- (the inner rate, a tripler stays a tripler).
- , so at , .
- Multiply them:
Same answer. The rates multiplied. Not a coincidence: that’s the whole game.
The chain rule
For a composition , the derivative is:
Read it left to right: the rate of the outer function, evaluated at the inner function’s output, times the rate of the inner function at .
The nickname chain rule (you multiply along a chain of rates) will haunt you. Get comfortable.
Your turn
Let and .
Compute .
Why this IS backpropagation
Here is the secret the rest of this course is organized around.
A neural network is a long function composition. The input goes through a layer, then another layer, then another, then a softmax, then a loss function. The output (the loss) is
To train it, we need the rate of the loss with respect to every single parameter . And the chain rule says the way to find it is:
Walk the chain from output back to input, multiplying local rates as you go.
That walk is called backpropagation. Every deep learning framework is, under the hood, a really efficient implementation of the chain rule. loss.backward() in PyTorch is the chain rule applied across a graph of operations, a few million times over.
In Module 12 we’ll build this ourselves, from one + node at a time. But the idea you just practised (rates multiply through compositions) is the whole thing. You now know how neural networks learn. The rest is plumbing.
Lesson complete