Seed the root with grad 1
The forward pass filled in .data at every node. The backward pass fills in .grad at every node — where .grad at a node means the partial derivative of the final output with respect to that node.
Start at the root. The root’s gradient with respect to itself is, by definition, :
That is the seed. Now walk the graph in reverse. At each node, you have an upstream gradient (the grad of the loss w.r.t. this node, which downstream nodes have already deposited). You use the local derivative on each incoming edge to push a contribution back to that parent:
The += is not a typo. We will see why in two minutes.
Run it on the (a+b)·c graph
Below is the same widget from the previous lesson. Make sure you are on the (a + b) · c preset and click step forward four times until every node has a .data value. Then click start backward and step through.
The backward sweep visits nodes in reverse topological order: first, then and , then and .
- Seed: .
- Visit (which is ). Edge to has local derivative , so . Edge to has local derivative , so .
- Visit (which is ). Both incoming edges have local derivative . So and .
Read the final grads off the widget. They are the partial derivatives of with respect to each input.
Pick a leaf
With , run the backward pass.
What is ?
And the sibling
What is ?
Read it off the widget, or compute it the same way.
The pattern, in one phrase
Two patterns you should now feel in your fingers, because they will repeat at every node of every backward pass for the rest of this course:
- An addition node copies the upstream gradient to all of its inputs unchanged.
- A multiplication node swaps and scales: the gradient that flows back to is upstream times the value of the other parent. (Read that twice.)
That is half of all the cases in micrograd. The other half (, , , power) all follow the same template: , where the local is something you can compute from the parent’s data.
Run a neuron backward
Switch the widget to the neuron preset above. Step all the way forward, then all the way backward. At the end, every leaf carries its own gradient.
The interesting one is . With , the bottom branch contributes nothing, so the backward pass through the upper branch alone determines . The math (which the widget agrees with):
Then (addition copies). Then (multiplication swaps).
Karpathy chose the bias specifically so that and , which makes exactly. Tidy numbers in service of clean intuition.
Read one neuron-leaf grad
For the one-neuron preset, what is ?
When one node is used twice: the diamond
Switch to the a · a preset and step forward. With , the result is . Easy.
Now step backward. Seed . Visit (which is ). The edge from as “left parent” gets local derivative right parent’s data . The edge from as “right parent” gets left parent’s data . Both contributions go to the same node .
If we wrote a.grad = upstream * other.data (i.e., overwrote), the second contribution would clobber the first and we would get . That is wrong: , so .
The += saves the day. Both contributions add, and we get the correct . This is the entire reason _backward is written with += and not = inside the engine. Every reused node, every weight-tied layer, every diamond graph: same fix.
When the visit order is wrong: the diamond, dramatized
The += only works if everyone gets a chance to contribute. That requires each node to be visited after every child that pushes gradient into it. The right order is the reverse topological order: root first, every parent after all its children.
The widget below has a richer diamond:
So , , . The correct (you can derive this in your head: , so at ).
Place the visit order in any order you like and click run all. If you place before or , watch what happens — the widget will tell you which path’s contribution got lost and why. Then try the right order ( first, then and in either order, then last) and watch grad[a] hit .
Topological sort
The fix is a single piece of bookkeeping: before you start the backward pass, sort the nodes so each node comes after all of its descendants. That ordering is called a topological sort, and it is what every autograd library — micrograd, PyTorch, JAX — runs in some form before reverse-mode autodiff.
The standard recipe is depth-first search from the root with a visited set:
def build_topo(root):
order, visited = [], set()
def visit(v):
if v in visited: return
visited.add(v)
for p in v._prev: visit(p)
order.append(v)
visit(root)
return orderThen backward() is:
def backward(root):
order = build_topo(root)
root.grad = 1.0
for v in reversed(order):
v._backward() # pushes contributions to parents using +=Two lines of machinery and the diamond gives 15 every time. The next lesson is writing exactly this.
Sanity-check the diamond
For the diamond graph above, in the correct visit order, what is ?
What you now know how to do
You have walked three graphs backward by hand: a small product, a full neuron, and the diamond that exposes the += and topological-sort bookkeeping.
That walk has a name: reverse-mode automatic differentiation. The chain rule, factored across a graph, computed in one sweep, accumulating sums along shared subexpressions. It is what every deep learning library calls backward(). It is the algorithm that puts the rest of this course on the hot path.
The next lesson stops walking by hand. You write the code that does the walk — the Value class, the per-op _backward closures, the topo sort — and test it against PyTorch.
Lesson complete