Backpropagation from Scratch · 18 min

From scalars to tensors

Stack 64 scalars into a vector, stack 768 vectors into a matrix, and the same chain rule still works. The forward is faster. The backward gradients are now matrices. The bookkeeping (broadcasting, axis-sums, the famous softmax+CE collapse) is what this lesson teaches.

0 / 0

Why scalars do not scale

A linear layer that maps 768 inputs to 64 outputs holds 768×64=49,152768 \times 64 = 49{,}152 weights, plus 64 biases. With micrograd, that is roughly fifty thousand separate Value objects, each with their own _backward closure, evaluated one at a time.

That is fine for teaching. It is catastrophic for actually running anything. The fix is to repackage all of those scalars into one matrix and do the entire layer’s worth of arithmetic in a single matmul. The same chain rule applies — but now the gradients on the edges are matrices, not scalars, and the closures have to handle that.

The good news: the rules don’t change. Local derivative times upstream, summed across paths. Only the types of the things being multiplied changed.

Matmul backward, by shape-check

For a single linear layer, the forward is

Y=XW,XRN×D,  WRD×M,  YRN×M.Y = X W, \qquad X \in \mathbb{R}^{N \times D}, \; W \in \mathbb{R}^{D \times M}, \; Y \in \mathbb{R}^{N \times M}.

The chain rule says we need L/X\partial L / \partial X and L/W\partial L / \partial W in terms of L/Y\partial L / \partial Y (which we know, because it was deposited by everything downstream). You can derive these from the scalar rule plus index sums, but here is the cheap trick: shape-check.

We want L/X\partial L / \partial X to have the same shape as XX, which is (N,D)(N, D). We have L/Y\partial L / \partial Y with shape (N,M)(N, M) and WW with shape (D,M)(D, M). The only matmul that produces (N,D)(N, D) from those two is L/YW\partial L / \partial Y \cdot W^\top, since WW^\top has shape (M,D)(M, D).

  LX=LYW  \boxed{\;\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \, W^\top\;}

The same argument forces L/W=XL/Y\partial L / \partial W = X^\top \cdot \partial L / \partial Y, which has the right shape (D,M)(D, M).

  LW=XLY  \boxed{\;\frac{\partial L}{\partial W} = X^\top \, \frac{\partial L}{\partial Y}\;}

You can derive these from first principles by writing Yij=kXikWkjY_{ij} = \sum_k X_{ik} W_{kj} and computing L/Xik\partial L / \partial X_{ik} via the chain rule. The shape-check argument arrives at the same answer in five seconds.

Shape of dW

With X.shape=(32,100)X.\text{shape} = (32, 100), W.shape=(100,10)W.\text{shape} = (100, 10), and dY.shape=(32,10)dY.\text{shape} = (32, 10):

What is the second dimension of dW=XdYdW = X^\top dY?

(The shape will be (100, ?). Type the second number.)

Elementwise nonlinearities, unchanged

For an elementwise operation like Y=ϕ(X)Y = \phi(X)relu, tanh, sigmoid applied to every entry — the backward is also elementwise. The chain rule per entry, written in tensor form:

LX=LYϕ(X),\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \odot \phi'(X),

where \odot is the elementwise (Hadamard) product. No matmul, no shape acrobatics — just pointwise multiplication by the local derivative. tanh'(x) = 1 - tanh²(x), sigmoid'(x) = σ(x)(1 - σ(x)), relu'(x) = [x > 0]. The scalar formulas you already wrote in lesson 12.3, applied per entry.

Broadcasting: forward replicates, backward sums

Bias vectors expose the trickiest tensor case. The forward is straightforward:

Y=XW+b,bRM.Y = X W + b, \qquad b \in \mathbb{R}^M.

The bias has shape (M,)(M,) but XWXW has shape (N,M)(N, M). PyTorch “broadcasts” bb across the batch — it is conceptually replicated NN times so each row of XWXW gets the same bias.

What is L/b\partial L / \partial b? You have L/Y\partial L / \partial Y with shape (N,M)(N, M). You need something of shape (M,)(M,) to match bb.

The rule, which generalizes across every tensor library and every broadcast pattern:

Every axis that was broadcast in the forward must be reduce-summed in the backward.

For bias broadcast across the batch axis, the gradient is dY.sum(axis=0). Try it.

forward: Y = X · W + b

b has shape (M,) = (4,). Adding it to a batch matrix replicates it across 3 rows. That replication is a broadcast along axis 0.

backward: given dY of shape (3, 4), what is db?

dY
-1.000.450.620.47
-0.13-0.71-0.91-0.58
-0.60-0.840.86-0.58
(3, 4)
pick the axis to sum over:

Rule of thumb: every axis that was broadcast in the forward must be reduce-summed in the backward. Forward broadcast = backward sum.

Pick the wrong axis and the result has the wrong shape; the widget tells you. Pick axis 0 and you get the right (M,)(M,) shape.

The bias backward

Forward: Y=X+bY = X + b with XX shape (32,64)(32, 64) and bb shape (64,)(64,).

You have dYdY with shape (32,64)(32, 64). What is the only number in the shape of dbdb?

The softmax + cross-entropy collapse

The output layer of every classifier in deep learning looks the same: a linear layer produces logits zRKz \in \mathbb{R}^K, softmax turns those into probabilities pp, cross-entropy with a one-hot target yy produces a scalar loss LL:

pi=ezikezk,L=iyilogpi.p_i = \frac{e^{z_i}}{\sum_k e^{z_k}}, \qquad L = -\sum_i y_i \log p_i.

A naive autograd implementation runs the backward through both functions: first the gradient of LL w.r.t. pp (which is y/p-y/p component-wise), then the Jacobian of softmax (a full K×KK \times K matrix), multiply them.

The algebra collapses miraculously. Working through the chain rule:

Lzj=iyipipi(δijpj)=yj+pjiyi=pjyj,\frac{\partial L}{\partial z_j} = -\sum_i \frac{y_i}{p_i} \cdot p_i (\delta_{ij} - p_j) = -y_j + p_j \sum_i y_i = p_j - y_j,

using iyi=1\sum_i y_i = 1 for a one-hot target. The whole ugly Jacobian-times-gradient stack reduces to:

  Lz=py  \boxed{\;\frac{\partial L}{\partial z} = p - y\;}

The “gradient of softmax-cross-entropy with respect to the logits is the prediction error.” Every classifier in deep learning trains using this one identity. PyTorch’s nn.CrossEntropyLoss is a single fused op specifically so that this collapse is exploited and so numerical stability (log-sum-exp) is handled in one place.

One backward pass through the classifier head

Logits z=[2.0,1.0,0.1]z = [2.0, 1.0, 0.1], target y=[1,0,0]y = [1, 0, 0] (one-hot, class 0).

Compute p=softmax(z)p = \text{softmax}(z) and then pyp - y.

What is the middle entry of pyp - y, to three decimals?

Three identities, one engine

You now have the three tensor identities that show up in every neural network’s backward pass:

LX=LYW,LW=XLY,Lb=unbroadcast ⁣(LY).\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^\top, \qquad \frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial Y}, \qquad \frac{\partial L}{\partial b} = \text{unbroadcast}\!\left(\frac{\partial L}{\partial Y}\right).

Plus the elementwise rule for nonlinearities, plus the softmax+CE collapse. That is the entire tensor backward toolkit for a modern feed-forward network. A transformer layer adds a few more shapes (multi-head reshape, layer norm) but every individual op uses one of these patterns.

PyTorch and JAX hide all of this behind a tape that records ops as you run them. But the tape’s contents — the rules for each op — are exactly what you just learned. There is no other secret.

One lesson left: the four classic backprop bugs and the test that catches all of them. After that, you have shipped m12.

Lesson complete

Nice tinkering.