Backpropagation from Scratch · 14 min

Draw the graph

Every formula you can write down is also a picture. Nodes hold values, edges hold operations, and every edge has a local derivative. This is the picture backprop walks.

0 / 0

A formula is also a picture

Take a small expression: f=(a+b)cf = (a + b) \cdot c. You evaluate it by combining smaller pieces. Add aa and bb first. Call that intermediate value ee. Then multiply ee by cc.

Draw that as a picture. Two leaves on the left (a,ba, b) feed an addition node ee. Another leaf (cc) and ee both feed a multiplication node ff. The lines connecting them are edges. The boxes are nodes. The whole structure is a directed acyclic graphdirected (the edges flow one way), acyclic (no loops).

Click start forward below. Watch the value at each node fill in, one at a time, in the order forced by the graph: leaves first, then ee, then ff at the very end. This is the forward pass: walk the graph from leaves to root, computing each node from its parents.

a·data·gradb·data·gradc·data·grade = a + b·data·gradf = e · c·data·grad
phase: idle
edit leaf values

Click step forward to fill .data at each node from leaves to root. Then step backward to fill .grad from root to leaves. Click any edge to read its local derivative.

Run it in your head

With a=2a = 2, b=3b = -3, c=10c = 10, walk the graph yourself.

What is ff?

Edges carry local derivatives

Every edge in the graph carries one more piece of information besides “this value flows here.” Each edge has a local derivative: the partial derivative of the child node’s output with respect to this particular parent, expressible from the operation alone.

For an addition node z=x+yz = x + y, the local derivative on each incoming edge is 11. (A small bump in either parent shows up as the same bump in the sum.)

For a multiplication node z=xyz = x \cdot y, the local derivative on the edge from xx is the other parent’s value, yy. And the local derivative on the edge from yy is xx. (A small bump in one parent is amplified by the other.)

That is the entire rule. Every elementary operation has a local derivative formula, expressible without needing to know what is downstream.

Click any edge in the widget above to see its local derivative. Try the edge from ee to ff — since f=ecf = e \cdot c, the local derivative there is cc.

Read one local derivative

For f=ecf = e \cdot c with c=10c = 10, what is f/e\partial f / \partial e?

(That is the local derivative on the edge from ee to ff.)

Why this picture matters

The chain rule, written in the abstract, says: if you change one input by a tiny amount, multiply the local derivatives along the path to the output to find out how much the output changes.

In a graph picture, “the path to the output” is something you can literally trace with your finger. And when there are multiple paths (because one node feeds into more than one place downstream), the chain rule says: sum the contributions over all paths.

For a small graph this is fine. But neural networks have graphs with millions of nodes and billions of paths. Computing the output’s sensitivity to every input by enumerating paths would take longer than the age of the universe.

The whole trick of backpropagation, which we will build in the next two lessons, is to compute every input’s sensitivity in one sweep across the graph — not one sweep per input. The graph is the data structure that makes that trick possible.

One more, by hand

Switch the widget to the neuron preset above. The graph is bigger now: five leaves (x1,w1,x2,w2,bx_1, w_1, x_2, w_2, b), three internal nodes (two multiplications and two additions), and a final tanh\tanh at the root.

Karpathy picked these specific leaf values so the math is friendly:

x1=2,  w1=3,  x2=0,  w2=1,  b6.8814.x_1 = 2,\; w_1 = -3,\; x_2 = 0,\; w_2 = 1,\; b \approx 6.8814.

Step forward through it. The two multiplications give 6-6 and 00. The sum is 6-6. Adding the bias gives 0.8814\approx 0.8814. And tanh(0.8814)0.7071\tanh(0.8814) \approx 0.7071. That is the output of one neuron, computed as a forward pass over its computational graph.

You just evaluated a neuron without anyone telling you the formula for “a neuron.” You walked a graph.

What you have, what is next

Three things from this lesson:

  1. Any expression you can write down can be drawn as a directed acyclic graph: nodes hold values, edges hold operations.
  2. The forward pass evaluates the graph in topological order, leaves to root.
  3. Every edge has a local derivative that depends only on the operation, not on what is downstream.

That is enough structure for the backward pass. In the next lesson you walk the same graph in reverse, distributing one number — the gradient of the loss with respect to each node — using only these local derivatives. That walk has a name: reverse-mode autodiff. Karpathy named his implementation of it after the small thing that does it. Micrograd.

Lesson complete

Nice tinkering.