Functions that return vectors
So far, has eaten tuples but returned a single number. Now make the output a tuple too.
A vector-valued function eats an -tuple and returns an -tuple. You can write it as separate scalar functions stacked up:
Concrete example: . Two inputs, two outputs.
A neural-network layer is exactly this: a vector-in, vector-out function. Every layer.
The Jacobian, as a stack of gradients
Each of the output components has its own gradient: an -tuple of partials. Stack those gradients into a grid:
This is the Jacobian matrix. It has rows (one per output) and columns (one per input). Row is the gradient of , written sideways.
Convention pledge (this matters and we’re picking a side): numerator layout, rows indexed by the output, columns indexed by the input. Some textbooks use the transpose. Don’t mix.
For (scalar output), the Jacobian is a single row: . That row is the gradient, written sideways. Gradient and Jacobian aren’t different animals; the gradient is the Jacobian of a scalar-output function.
Fill in a Jacobian
For , compute (row 1, column 1) at the point .
And another entry
Same . Now compute (row 2, column 2) at .
The multivariable chain rule: one output
Take a composed function. Input goes through a curve in 2D, output of that curve gets fed into a scalar function . In symbols, .
We want . Multivariable version of the chain rule: differentiate along the curve’s path.
Read it: the rate at which the scalar changes along the curve equals the gradient of dotted with the curve’s velocity. The gradient says “which direction increases fastest, and how fast.” says “which direction you’re actually going.” Dot product them to get “how fast actually changes given your actual motion.”
This is the chain rule The Gradient leaned on for its perpendicularity proof: if stays on a level set, is constant, so , which forces . Gradient perpendicular to tangent. Proven.
Move on a circle, watch the height
Let where and . Compute .
(No decimals. The answer is an integer.)
The chain rule in full generality, as a graph.
Compose two vector-valued functions: . The chain rule says the Jacobian of the composition is the product of the Jacobians:
Entry-by-entry:
For each input , sum over all the intermediate variables the products of “how depends on ” times “how depends on .” Sum over paths, multiply along edges.
The widget below makes this literal. Click edges to highlight partials, or press a button to assemble the chain-rule sum for or as a sum over the two paths from input to output.
The shape of the formula is the shape of the graph. If you’ve ever drawn a computation graph, this is exactly what you were tracing.
Chain something small
Let and . Compute at .
Try it two ways: compose first and differentiate, or multiply the Jacobians. Same answer.
Shape hygiene
A tip that will save you weeks of debugging later: when you compute a Jacobian or a chain-rule product, look at the shape first. Does the result have the shape it should?
Rule of thumb: differentiating an -output function with respect to inputs gives you an grid. If your answer’s shape doesn’t match, you’ve got a bug before checking any numbers.
In ML frameworks, shape errors are the single most common bug. Track the shapes, always.
Teaser: what this unlocks
Every neural network is a long composition: input → linear layer → nonlinearity → linear layer → … → loss. Dozens of functions in series. The derivative of the loss with respect to any parameter, by the chain rule, is a long product of Jacobians.
But here’s the twist we’ll develop in the next lesson: we never compute the full Jacobians. We multiply them onto a row vector (the gradient of the loss), working from the output backward to the input. That multiplication (row times matrix) is called a vector–Jacobian product, and it is exactly what loss.backward() computes.
Next lesson pulls that thread.
Lesson complete