Multivariable Calculus: Partial Derivatives, Gradients, Jacobians · 20 min

Jacobians and the Multivariable Chain Rule

When a function returns multiple numbers, stack its gradients into a grid, and that grid is the Jacobian. Compose two functions and their Jacobians multiply. Neural networks are exactly this, about a hundred times in a row.

0 / 0

Functions that return vectors

So far, ff has eaten tuples but returned a single number. Now make the output a tuple too.

A vector-valued function F:RnRm\mathbf{F}: \mathbb{R}^n \to \mathbb{R}^m eats an nn-tuple and returns an mm-tuple. You can write it as mm separate scalar functions stacked up:

F(x,y)  =  (F1(x,y),  F2(x,y),  ,  Fm(x,y)).\mathbf{F}(x, y) \;=\; \bigl( F_1(x, y),\; F_2(x, y),\; \dots,\; F_m(x, y) \bigr).

Concrete example: F(x,y)=(x2y,  xy)\mathbf{F}(x, y) = (x^2 - y,\; x y). Two inputs, two outputs.

A neural-network layer is exactly this: a vector-in, vector-out function. Every layer.

The Jacobian, as a stack of gradients

Each of the mm output components FiF_i has its own gradient: an nn-tuple of partials. Stack those mm gradients into a grid:

JF(x)  =  (F1x1F1xnFmx1Fmxn).J_{\mathbf{F}}(\mathbf{x}) \;=\; \begin{pmatrix} \dfrac{\partial F_1}{\partial x_1} & \cdots & \dfrac{\partial F_1}{\partial x_n} \\[0.4em] \vdots & \ddots & \vdots \\[0.4em] \dfrac{\partial F_m}{\partial x_1} & \cdots & \dfrac{\partial F_m}{\partial x_n} \end{pmatrix}.

This is the Jacobian matrix. It has mm rows (one per output) and nn columns (one per input). Row ii is the gradient of FiF_i, written sideways.

Convention pledge (this matters and we’re picking a side): numerator layout, rows indexed by the output, columns indexed by the input. Some textbooks use the transpose. Don’t mix.

For f:RnRf: \mathbb{R}^n \to \mathbb{R} (scalar output), the Jacobian is a single row: 1×n1 \times n. That row is the gradient, written sideways. Gradient and Jacobian aren’t different animals; the gradient is the Jacobian of a scalar-output function.

Fill in a Jacobian

For F(x,y)=(x2y,  xy)\mathbf{F}(x, y) = (x^2 - y,\; x y), compute J1,1J_{1,1} (row 1, column 1) at the point (1,2)(1, 2).

And another entry

Same F\mathbf{F}. Now compute J2,2J_{2,2} (row 2, column 2) at (1,2)(1, 2).

The multivariable chain rule: one output

Take a composed function. Input tt goes through a curve r(t)\mathbf{r}(t) in 2D, output of that curve gets fed into a scalar function ff. In symbols, z(t)=f(r(t))z(t) = f(\mathbf{r}(t)).

We want dz/dtdz/dt. Multivariable version of the chain rule: differentiate ff along the curve’s path.

ddtf(r(t))  =  f(r(t))r(t).\frac{d}{dt} f(\mathbf{r}(t)) \;=\; \nabla f(\mathbf{r}(t)) \cdot \mathbf{r}'(t).

Read it: the rate at which the scalar ff changes along the curve equals the gradient of ff dotted with the curve’s velocity. The gradient says “which direction increases ff fastest, and how fast.” r(t)\mathbf{r}'(t) says “which direction you’re actually going.” Dot product them to get “how fast ff actually changes given your actual motion.”

This is the chain rule The Gradient leaned on for its perpendicularity proof: if r(t)\mathbf{r}(t) stays on a level set, f(r(t))f(\mathbf{r}(t)) is constant, so dz/dt=0dz/dt = 0, which forces fr(t)=0\nabla f \cdot \mathbf{r}'(t) = 0. Gradient perpendicular to tangent. Proven.

Move on a circle, watch the height

Let z=x2+y2z = x^2 + y^2 where x=costx = \cos t and y=sinty = \sin t. Compute dz/dtdz/dt.

(No decimals. The answer is an integer.)

The chain rule in full generality, as a graph.

Compose two vector-valued functions: xgyfz\mathbf{x} \xrightarrow{\mathbf{g}} \mathbf{y} \xrightarrow{f} z. The chain rule says the Jacobian of the composition is the product of the Jacobians:

Jfg(x)  =  Jf(g(x))Jg(x).J_{f \circ \mathbf{g}}(\mathbf{x}) \;=\; J_f(\mathbf{g}(\mathbf{x})) \cdot J_{\mathbf{g}}(\mathbf{x}).

Entry-by-entry:

zxj  =  izyiyixj.\frac{\partial z}{\partial x_j} \;=\; \sum_{i} \frac{\partial z}{\partial y_i}\, \frac{\partial y_i}{\partial x_j}.

For each input xjx_j, sum over all the intermediate variables yiy_i the products of “how zz depends on yiy_i” times “how yiy_i depends on xjx_j.” Sum over paths, multiply along edges.

The widget below makes this literal. Click edges to highlight partials, or press a button to assemble the chain-rule sum for z/x\partial z / \partial x or z/y\partial z / \partial y as a sum over the two paths from input to output.

click edges, or pick a path

∂u/∂x∂u/∂y∂v/∂x∂v/∂y∂z/∂u∂z/∂vxyuvz

click an edge to highlight a partial, or use a button to assemble a chain-rule sum.

The shape of the formula is the shape of the graph. If you’ve ever drawn a computation graph, this is exactly what you were tracing.

Chain something small

Let g(x)=(x,  x2)\mathbf{g}(x) = (x,\; x^2) and f(u,v)=u+vf(u, v) = u + v. Compute ddx(fg)(x)\dfrac{d}{dx}(f \circ \mathbf{g})(x) at x=1x = 1.

Try it two ways: compose first and differentiate, or multiply the Jacobians. Same answer.

Shape hygiene

A tip that will save you weeks of debugging later: when you compute a Jacobian or a chain-rule product, look at the shape first. Does the result have the shape it should?

Rule of thumb: differentiating an mm-output function with respect to nn inputs gives you an m×nm \times n grid. If your answer’s shape doesn’t match, you’ve got a bug before checking any numbers.

In ML frameworks, shape errors are the single most common bug. Track the shapes, always.

Teaser: what this unlocks

Every neural network is a long composition: input → linear layer → nonlinearity → linear layer → … → loss. Dozens of functions in series. The derivative of the loss with respect to any parameter, by the chain rule, is a long product of Jacobians.

But here’s the twist we’ll develop in the next lesson: we never compute the full Jacobians. We multiply them onto a row vector (the gradient of the loss), working from the output backward to the input. That multiplication (row times matrix) is called a vector–Jacobian product, and it is exactly what loss.backward() computes.

Next lesson pulls that thread.

Lesson complete

Nice tinkering.