Backpropagation from Scratch · 16 min

Walk it backward

The backward pass is the forward pass in reverse. Seed the root with grad 1, walk the graph upstream, and use each edge's local derivative to distribute one number per node. The order matters more than anyone tells you.

0 / 0

Seed the root with grad 1

The forward pass filled in .data at every node. The backward pass fills in .grad at every node — where .grad at a node means the partial derivative of the final output with respect to that node.

Start at the root. The root’s gradient with respect to itself is, by definition, 11:

LL=1.\frac{\partial L}{\partial L} = 1.

That is the seed. Now walk the graph in reverse. At each node, you have an upstream gradient (the grad of the loss w.r.t. this node, which downstream nodes have already deposited). You use the local derivative on each incoming edge to push a contribution back to that parent:

  gradparent+=gradthisthisparent  \boxed{\;\text{grad}_{\text{parent}} \mathrel{+}= \text{grad}_{\text{this}} \cdot \frac{\partial \text{this}}{\partial \text{parent}}\;}

The += is not a typo. We will see why in two minutes.

Run it on the (a+b)·c graph

Below is the same widget from the previous lesson. Make sure you are on the (a + b) · c preset and click step forward four times until every node has a .data value. Then click start backward and step through.

a·data·gradb·data·gradc·data·grade = a + b·data·gradf = e · c·data·grad
phase: idle
edit leaf values

Click step forward to fill .data at each node from leaves to root. Then step backward to fill .grad from root to leaves. Click any edge to read its local derivative.

The backward sweep visits nodes in reverse topological order: ff first, then cc and ee, then aa and bb.

  • Seed: grad[f]=1\text{grad}[f] = 1.
  • Visit ff (which is ece \cdot c). Edge to ee has local derivative c=10c = 10, so grad[e]+=110=10\text{grad}[e] \mathrel{+}= 1 \cdot 10 = 10. Edge to cc has local derivative e=1e = -1, so grad[c]+=1(1)=1\text{grad}[c] \mathrel{+}= 1 \cdot (-1) = -1.
  • Visit ee (which is a+ba + b). Both incoming edges have local derivative 11. So grad[a]+=101=10\text{grad}[a] \mathrel{+}= 10 \cdot 1 = 10 and grad[b]+=101=10\text{grad}[b] \mathrel{+}= 10 \cdot 1 = 10.

Read the final grads off the widget. They are the partial derivatives of ff with respect to each input.

Pick a leaf

With a=2,b=3,c=10a = 2, b = -3, c = 10, run the backward pass.

What is f/a\partial f / \partial a?

And the sibling

What is f/b\partial f / \partial b?

Read it off the widget, or compute it the same way.

The pattern, in one phrase

Two patterns you should now feel in your fingers, because they will repeat at every node of every backward pass for the rest of this course:

  • An addition node copies the upstream gradient to all of its inputs unchanged.
  • A multiplication node swaps and scales: the gradient that flows back to xx is upstream times the value of the other parent. (Read that twice.)

That is half of all the cases in micrograd. The other half (tanh\text{tanh}, exp\exp, relu\text{relu}, power) all follow the same template: upstreamlocal\text{upstream} \cdot \text{local}, where the local is something you can compute from the parent’s data.

Run a neuron backward

Switch the widget to the neuron preset above. Step all the way forward, then all the way backward. At the end, every leaf carries its own gradient.

The interesting one is w1w_1. With x2=0x_2 = 0, the bottom branch contributes nothing, so the backward pass through the upper branch alone determines o/w1\partial o / \partial w_1. The math (which the widget agrees with):

on=1tanh2(n)=10.5=0.5.\frac{\partial o}{\partial n} = 1 - \tanh^2(n) = 1 - 0.5 = 0.5.

Then o/(x1w1)=0.5\partial o / \partial(x_1 w_1) = 0.5 (addition copies). Then o/w1=0.5x1=0.52=1.0\partial o / \partial w_1 = 0.5 \cdot x_1 = 0.5 \cdot 2 = 1.0 (multiplication swaps).

Karpathy chose the bias b6.8814b \approx 6.8814 specifically so that n0.8814n \approx 0.8814 and tanh(n)0.7071\tanh(n) \approx 0.7071, which makes 1tanh2(n)=0.51 - \tanh^2(n) = 0.5 exactly. Tidy numbers in service of clean intuition.

Read one neuron-leaf grad

For the one-neuron preset, what is o/w1\partial o / \partial w_1?

When one node is used twice: the diamond

Switch to the a · a preset and step forward. With a=3a = 3, the result is 99. Easy.

Now step backward. Seed grad[f]=1\text{grad}[f] = 1. Visit ff (which is aaa \cdot a). The edge from aa as “left parent” gets local derivative right parent’s data =a=3= a = 3. The edge from aa as “right parent” gets left parent’s data =a=3= a = 3. Both contributions go to the same node aa.

If we wrote a.grad = upstream * other.data (i.e., overwrote), the second contribution would clobber the first and we would get grad[a]=3\text{grad}[a] = 3. That is wrong: f(a)=a2f(a) = a^2, so f/a=2a=6\partial f / \partial a = 2a = 6.

The += saves the day. Both contributions add, and we get the correct grad[a]=3+3=6\text{grad}[a] = 3 + 3 = 6. This is the entire reason _backward is written with += and not = inside the engine. Every reused node, every weight-tied layer, every diamond graph: same fix.

When the visit order is wrong: the diamond, dramatized

The += only works if everyone gets a chance to contribute. That requires each node to be visited after every child that pushes gradient into it. The right order is the reverse topological order: root first, every parent after all its children.

The widget below has a richer diamond:

a=2,b=a+1,c=a3,d=bc.a = 2, \quad b = a + 1, \quad c = a \cdot 3, \quad d = b \cdot c.

So b=3b = 3, c=6c = 6, d=18d = 18. The correct d/a=15\partial d / \partial a = 15 (you can derive this in your head: d=(a+1)(3a)=3a2+3ad = (a+1)(3a) = 3a^2 + 3a, so d=6a+3=15d' = 6a + 3 = 15 at a=2a = 2).

a = 2grad 0.00b=a+10.00c=a·30.00d=b·c0.00
visit order (first → last):
click to place:

The graph: a=2, b=a+1, c=a·3, d=b·c. We seed grad[d]=1 and walk the order you place, calling grad[parent] += grad[child] · local at each visit. To get the right answer 15 for grad[a], each parent must be visited after its children — that is reverse-topological order.

Place the visit order in any order you like and click run all. If you place aa before bb or cc, watch what happens — the widget will tell you which path’s contribution got lost and why. Then try the right order (dd first, then bb and cc in either order, then aa last) and watch grad[a] hit 1515.

Topological sort

The fix is a single piece of bookkeeping: before you start the backward pass, sort the nodes so each node comes after all of its descendants. That ordering is called a topological sort, and it is what every autograd library — micrograd, PyTorch, JAX — runs in some form before reverse-mode autodiff.

The standard recipe is depth-first search from the root with a visited set:

def build_topo(root):
    order, visited = [], set()
    def visit(v):
        if v in visited: return
        visited.add(v)
        for p in v._prev: visit(p)
        order.append(v)
    visit(root)
    return order

Then backward() is:

def backward(root):
    order = build_topo(root)
    root.grad = 1.0
    for v in reversed(order):
        v._backward()  # pushes contributions to parents using +=

Two lines of machinery and the diamond gives 15 every time. The next lesson is writing exactly this.

Sanity-check the diamond

For the diamond graph above, in the correct visit order, what is d/a\partial d / \partial a?

What you now know how to do

You have walked three graphs backward by hand: a small product, a full neuron, and the diamond that exposes the += and topological-sort bookkeeping.

That walk has a name: reverse-mode automatic differentiation. The chain rule, factored across a graph, computed in one sweep, accumulating sums along shared subexpressions. It is what every deep learning library calls backward(). It is the algorithm that puts the rest of this course on the hot path.

The next lesson stops walking by hand. You write the code that does the walk — the Value class, the per-op _backward closures, the topo sort — and test it against PyTorch.

Lesson complete

Nice tinkering.