Multivariable Calculus: Partial Derivatives, Gradients, Jacobians · 20 min

loss.backward() is a Vector–Jacobian Product

A neural network is a long function composition. Its derivative is a long product of Jacobians. We never build those Jacobians; we multiply them onto a row vector from right to left. That walk is backprop.

0 / 0

Forward or reverse: pick a direction.

Below is a 2-to-2 function F(x,y)=(x2+y,  xy)\mathbf{F}(x, y) = (x^2 + y,\; xy). The Jacobian at the marked base point p\mathbf{p} is

J=[2x1yx].J = \begin{bmatrix}\, 2x & 1 \,\\\, y & x \,\end{bmatrix}.

F(x,y) = (x²+y, xy) at p = (1, 0.8)

input space (drag d v)

-0.50.511.52-1-0.50.511.52

dv = (0.60, 0.40)

output space (computed: J · d v)

1234-1-0.50.511.52

w = (1.60, 0.88)

forward mode: drag the coral arrow on the left. The output tangent updates as J · dv. Same Jacobian, opposite direction. Reverse mode is what loss.backward() does.

Two modes:

  • Forward (JVP): drag the input tangent on the left. The widget computes the output tangent as JdvJ \cdot d\mathbf{v} on the right.
  • Reverse (VJP): drag the output cotangent on the right. The widget computes the input cotangent as JwJ^\top \cdot \mathbf{w} on the left.

Same Jacobian, opposite direction of action. The pedagogy of this whole lesson is in that toggle.

The loss, as a function

A neural network has parameters: weights w1,,wPw_1, \dots, w_P and biases. Collectively, wRP\mathbf{w} \in \mathbb{R}^P: a single tuple with one entry per parameter.

For a given training example, the network produces a prediction, compares it to the target, and outputs a single number: the loss. So the loss is a function

L:RPR.L: \mathbb{R}^P \longrightarrow \mathbb{R}.

Many inputs (millions for a real model). One output (a scalar).

Training means: nudge w\mathbf{w} in the direction that reduces LL. The direction that reduces LL fastest is L-\nabla L. Training is: compute L\nabla L, step against it, repeat.

The only real question: how do you compute L\nabla L when LL is the composition of thirty operations, each itself a vector-valued function?

A net is a composition

The network as a chain:

L  =  fkfk1f2f1L \;=\; \ell \,\circ\, f_k \,\circ\, f_{k-1} \,\circ\, \cdots \,\circ\, f_2 \,\circ\, f_1

where fif_i is the ii-th layer (linear + nonlinearity) and \ell is the final loss function. The input is the example’s features; the output is one number.

By the chain rule, the Jacobian of LL with respect to the parameters is the product of all those layer Jacobians:

JL  =  JJfkJfk1Jf2Jf1.J_L \;=\; J_\ell \cdot J_{f_k} \cdot J_{f_{k-1}} \cdots J_{f_2} \cdot J_{f_1}.

LL has one output, so JLJ_L is a 1×P1 \times P row, which is exactly L\nabla L, written sideways.

The trick: don't materialize the Jacobians

If each layer’s Jacobian were 10,000×10,00010{,}000 \times 10{,}000 and you multiplied them all together, you’d be moving giant matrices for nothing. You don’t need the product matrix. You only need its top row: the gradient.

Here’s the move. The leftmost factor, JJ_\ell, is a 1×n1 \times n row (because \ell is scalar-output). When you multiply a 1×n1 \times n row by an n×mn \times m matrix, you get a 1×m1 \times m row. That row is a vector–Jacobian product (VJP):

row  ×  J  =  row.\text{row} \;\times\; J \;=\; \text{row}.

Instead of building the full layer Jacobian and multiplying the next one in, you keep a row around and shrink/resize it by VJP-ing it with each layer’s Jacobian from right to left. You work left-to-right in the expression, but consuming layers right-to-left from the graph’s perspective: output first, input last.

That’s why it’s called the backward pass. At the end, the row you have is L\nabla L. Every entry is a L/wi\partial L / \partial w_i. loss.backward() walks exactly this path and writes those numbers into parameter.grad.

Forward mode vs reverse mode

Two ways to walk a chain of Jacobians. You played with both:

  • Forward (JVP): multiply from the right. Push a tangent vector v\mathbf{v} from the input side through each layer. Good when nn (inputs) is small.
  • Reverse (VJP): multiply from the left. Pull a cotangent from the output side back through each layer. Good when mm (outputs) is small. Perfect for scalar-output losses.

A neural-network loss has m=1m = 1 output and PP \approx millions of inputs. Reverse mode wins by a factor of PP. That’s why deep learning frameworks all use reverse-mode autodiff, and why .backward() exists on the scalar loss tensor, not on individual parameters.

Work a real example by hand

Let’s do one. Loss L(w,b)=(wx+by)2L(w, b) = (wx + b - y)^2, for a fixed training example x=2x = 2, y=5y = 5. Evaluate at (w,b)=(1,1)(w, b) = (1, 1).

Forward pass:

  • y^=wx+b=12+1=3\hat y = wx + b = 1 \cdot 2 + 1 = 3.
  • Error e=y^y=35=2e = \hat y - y = 3 - 5 = -2.
  • Loss L=e2=4L = e^2 = 4.

Backward pass (chain rule, walked from the loss outward):

  • Le=2e=4\dfrac{\partial L}{\partial e} = 2e = -4.
  • Ly^=41=4\dfrac{\partial L}{\partial \hat y} = -4 \cdot 1 = -4.
  • Lw=4x=42=8\dfrac{\partial L}{\partial w} = -4 \cdot x = -4 \cdot 2 = -8.
  • Lb=41=4\dfrac{\partial L}{\partial b} = -4 \cdot 1 = -4.

So L=(8,4)\nabla L = (-8, -4). Each coordinate was computed by multiplying “the gradient so far” (a single number at this scale, because LL is scalar) by the local partial at one edge. At scale, those scalars become rows and the partials become layer Jacobians, but the algorithm is identical.

This is, literally, every time loss.backward() runs.

Compute ∂L/∂w

Same loss with x=2x = 2, y=5y = 5, (w,b)=(1,1)(w, b) = (1, 1). Confirm Lw\dfrac{\partial L}{\partial w}.

And ∂L/∂b

Same setup. Confirm Lb\dfrac{\partial L}{\partial b}.

Where this lands

You just did, by hand, what loss.backward() does for a network of two parameters. In Module 12 we’ll build micrograd, a tiny scalar autograd engine that walks a computation graph and fills in .grad fields using exactly this procedure, then vectorize it from scalars to tensors. Module 15 wires it through an attention layer, and Module 18 uses all of it to train a character-level transformer in your browser.

The algorithm is already known to you. Everything else is bookkeeping at scale.

Lesson complete

Nice tinkering.