Single-variable Calculus: Derivatives & the Chain Rule · 16 min

The Chain Rule

When functions nest, their rates multiply. That multiplication is what trains every neural network.

0 / 0

Drag x. Watch two rates become one.

The composition (sinx)2(\sin x)^2 is two jobs in a row: first sin\sin, then square. Drag xx on the widget below. The left readouts show the inner rate (g(x)=cosxg'(x) = \cos x) and the outer rate (f(u)=2uf'(u) = 2u, evaluated at u=sinxu = \sin x). The right readout shows their product, which is the actual rate of the whole composition.

f(g(x)) = (sin x

drag the slider; watch the two slopes multiply

inner g(x) = sin x
-3-2-1123-1-0.50.51
outer f(u) = u²  ·  u = g(x) = 0.84
-1-0.50.51-0.20.20.40.60.811.21.4
g′(x)
+0.54
rate of the inner
×
f′(g(x))
+1.68
rate of the outer, at u=0.84
=
(fg)′(x)
+0.91
total rate (they multiplied)

Two things to notice:

  • The total rate is the inner rate times the outer rate. Always.
  • When the inner rate hits zero (at x=π/2x = \pi/2, where cosx=0\cos x = 0), the total rate goes to zero too. The chain is only as strong as any link that hits zero.

That product rule (multiply rates along a chain of compositions) is the most important formula in differential calculus, and not by coincidence the mechanism by which every neural network in the world is trained.

Functions that eat functions

When you write (sinx)2(\sin x)^2, you’re doing two jobs in a row. First, take xx, pass it to sin\sin; call the result uu. Then take uu, square it.

x      sin      u      ()2      outputx \;\longrightarrow\; \boxed{\;\sin\;} \;\longrightarrow\; u \;\longrightarrow\; \boxed{\;(\cdot)^2\;} \;\longrightarrow\; \text{output}

We call this composition and write it f(g(x))f(g(x)) or (fg)(x)(f \circ g)(x), where gg runs first (the inner function) and ff runs second (the outer). In our case g(x)=sin(x)g(x) = \sin(x) and f(u)=u2f(u) = u^2.

The derivative problem: nudge xx by a tiny amount. How does the final output change? Both functions are amplifying (or dampening) that nudge. The output’s sensitivity to xx depends on both rates at once.

A simpler case first: watch the rates multiply

Forget sin\sin for a second. Pick g(x)=3xg(x) = 3x (a simple stretch) and f(u)=u2f(u) = u^2.

The composition is f(g(x))=(3x)2=9x2f(g(x)) = (3x)^2 = 9x^2, which we already know how to differentiate:

ddx9x2  =  18x.\frac{d}{dx}\, 9x^2 \;=\; 18x.

Now try it the long way, using both rates:

  • g(x)=3g'(x) = 3 (the inner rate, a tripler stays a tripler).
  • f(u)=2uf'(u) = 2u, so at u=g(x)=3xu = g(x) = 3x,   f(g(x))=23x=6x\;f'(g(x)) = 2 \cdot 3x = 6x.
  • Multiply them:   6x3=18x.\;6x \cdot 3 = 18x.

Same answer. The rates multiplied. Not a coincidence: that’s the whole game.

The chain rule

For a composition f(g(x))f(g(x)), the derivative is:

(fg)(x)  =  f ⁣(g(x))g(x)(f \circ g)'(x) \;=\; f'\!\big(g(x)\big) \cdot g'(x)

Read it left to right: the rate of the outer function, evaluated at the inner function’s output, times the rate of the inner function at xx.

The nickname chain rule (you multiply along a chain of rates) will haunt you. Get comfortable.

Your turn

Let g(x)=3xg(x) = 3x and f(u)=u2f(u) = u^2.

Compute (fg)(2)(f \circ g)'(2).

Why this IS backpropagation

Here is the secret the rest of this course is organized around.

A neural network is a long function composition. The input goes through a layer, then another layer, then another, then a softmax, then a loss function. The output (the loss) is

L  =  loss(softmax(Wnrelu(W2relu(W1x)))).L \;=\; \mathrm{loss}\big(\mathrm{softmax}\big(W_n \cdots \mathrm{relu}(W_2 \,\mathrm{relu}(W_1 x)) \cdots\big)\big).

To train it, we need the rate of the loss with respect to every single parameter WiW_i. And the chain rule says the way to find it is:

Walk the chain from output back to input, multiplying local rates as you go.

That walk is called backpropagation. Every deep learning framework is, under the hood, a really efficient implementation of the chain rule. loss.backward() in PyTorch is the chain rule applied across a graph of operations, a few million times over.

In Module 12 we’ll build this ourselves, from one + node at a time. But the idea you just practised (rates multiply through compositions) is the whole thing. You now know how neural networks learn. The rest is plumbing.

Lesson complete

Nice tinkering.