Multivariable Calculus: Partial Derivatives, Gradients, Jacobians · 16 min

Local Linear Models and Saddle Points

Some critical points are bowls. Some are caps. Some are neither (saddles, where the function goes up in one direction and down in another). In high dimensions, saddles are everywhere.

0 / 0

Pick a function shape. Fire a ray. What is this?

f(x,y)=ax2+by2f(x, y) = a\, x^2 + b\, y^2. The signs of aa and bb decide what’s happening at the origin. The ray slider sweeps a direction θ\theta; the right panel plots ff along that ray.

f(x, y) = 1.0 x² − 1.0 y²

saddle
-1.5-1-0.50.511.5-1.5-1-0.50.511.5

θ = 30°

-2-112-2-112

f along the ray (origin = 0)

Three shapes worth finding:

  1. Both aa and bb positive. Every ray you fire shows the function climbing away from zero. Local minimum.
  2. Both negative. Every ray shows it falling. Local maximum.
  3. Opposite signs (one positive, one negative). Some rays climb, some fall. There’s no consistent “down” or “up.” Saddle.

The first two cases have analogues you’ve seen in 1D (a parabola opens up or opens down). The saddle is new. It has no single-variable counterpart, and it’s the case that matters most for ML.

The 1D recall

Single-variable calculus told you: near a point aa, a smooth function looks linear. Formally,

f(a+h)    f(a)+f(a)hf(a + h) \;\approx\; f(a) + f'(a)\, h

when hh is small. The derivative f(a)f'(a) is the slope of the line that best matches ff at aa. Zoom in far enough and the curve becomes that line.

That’s the local linear model in 1D. Now we generalize.

The tangent plane (2D linearization)

For f(x,y)f(x, y) near a point (a,b)(a, b):

f(x,y)    f(a,b)+fx(a,b)(xa)+fy(a,b)(yb).f(x, y) \;\approx\; f(a, b) + f_x(a, b)\,(x - a) + f_y(a, b)\,(y - b).

Same shape, two variables instead of one. The right-hand side is a plane, tilted by the partial derivatives and anchored at the height f(a,b)f(a, b). Near (a,b)(a, b), the surface is indistinguishable from that plane.

Using the gradient notation from the last lesson:

f(x)    f(a)+f(a)(xa).f(\mathbf{x}) \;\approx\; f(\mathbf{a}) + \nabla f(\mathbf{a}) \cdot (\mathbf{x} - \mathbf{a}).

This is the local linear model. Every step of gradient descent trusts this approximation for one tiny ϵ\epsilon, then reapproximates at the new point. Unreasonably effective.

Linearize something

For f(x,y)=xy2f(x, y) = x y^2, use linearization at (1,2)(1, 2) to approximate f(1.03,1.98)f(1.03, 1.98).

(Two decimal places.)

Critical points

A critical point is a point where the gradient is zero: f(p)=0\nabla f(\mathbf{p}) = \mathbf{0}.

In 1D, f(a)=0f'(a) = 0 meant “flat spot,” a candidate for a local min or max. In 2D it means the same thing, except “flat” now means flat in every direction simultaneously. No direction is uphill; no direction is downhill.

Three shapes a 2D flat spot can take:

  • A local minimum (bowl).
  • A local maximum (cap).
  • A saddle (up in some directions, down in others).

You found all three in step 1. Now formalize.

The second-derivative test

Compute the Hessian discriminant at the critical point:

D  =  fxxfyyfxy2D \;=\; f_{xx}\, f_{yy} \,-\, f_{xy}^{\,2}

Then:

  • D>0D > 0 and fxx>0f_{xx} > 0: local minimum (both axes curve up).
  • D>0D > 0 and fxx<0f_{xx} < 0: local maximum (both curve down).
  • D<0D < 0: saddle (directions disagree).
  • D=0D = 0: the test is inconclusive. Probe by hand.

The matrix of second partials (fxxfxyfyxfyy)\begin{pmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{pmatrix} is called the Hessian. We’ll meet it as a matrix in Module 7. For now: sign of DD, sign of fxxf_{xx}. That’s the test.

Classify by the numbers

For f(x,y)=x23y2f(x, y) = x^2 - 3 y^2, compute the discriminant D=fxxfyyfxy2D = f_{xx} f_{yy} - f_{xy}^{\,2} at the origin.

The sign of your answer tells you what kind of critical point it is.

Why saddles matter for ML

Intuition from one or two variables says: most flat spots are minima. Bowl, cap, occasional saddle. In high dimensions this stops being true. In dimensions ~millions (which is where neural-network loss functions live) saddles dominate. True local minima are rare; flat regions where the gradient is nearly zero but the function is actually a saddle are common.

This is why plain gradient descent can feel slow: it’s not stuck at a real minimum, it’s crawling through a high-dimensional saddle where the gradient is small in most directions and only a tiny sub-space escapes. Module 10 covers the tricks for dealing with this: momentum, Adam, learning-rate schedules. All of them are about detecting and escaping saddles.

For now the single takeaway: f=0\nabla f = \mathbf{0} doesn’t mean “you won.” It means “stop and look more carefully.”

Lesson complete

Nice tinkering.