Single-variable Calculus: Derivatives & the Chain Rule · 12 min

Second derivatives and finding optima

f' tells you which way the curve is heading. f'' tells you whether that heading is itself speeding up or slowing down. Together they find every peak, every valley, and tee up the entire job of training a neural network.

0 / 0

The derivative of the derivative

ff' is a function. It has its own slope. The slope of ff' is called the second derivative of ff, written ff'' or d2y/dx2d^2 y / dx^2.

f(x)  =  ddx[f(x)].f''(x) \;=\; \frac{d}{dx}\bigl[f'(x)\bigr].

The physical picture is the cleanest. If f(t)f(t) is the position of a moving object at time tt, then f(t)f'(t) is its velocity (how fast position is changing), and f(t)f''(t) is its acceleration (how fast velocity is changing). Position-velocity-acceleration is exactly position-first derivative-second derivative.

You compute ff'' by differentiating twice. For f(x)=x3f(x) = x^3:

f(x)=3x2,f(x)=6x.f'(x) = 3x^2, \qquad f''(x) = 6x.

One more, twice

With f(x)=x3f(x) = x^3, what is f(2)f''(2)?

What f'' tells you that f' doesn't

ff' tells you which way the curve is going at xx: positive slope means up, negative means down. ff'' tells you which way that direction is bending:

  • f>0f'' > 0: the slope is increasing. The curve bends upward. Concave up — shaped like a cup.
  • f<0f'' < 0: the slope is decreasing. The curve bends downward. Concave down — shaped like a cap.
  • f=0f'' = 0 and changes sign: the bend itself flips. That point is called an inflection point.

For f(x)=x2f(x) = x^2, f(x)=2>0f''(x) = 2 > 0 everywhere — always concave up, always a cup. For f(x)=x2f(x) = -x^2, f(x)=2<0f''(x) = -2 < 0 everywhere — always a cap. For f(x)=x3f(x) = x^3, f(x)=6xf''(x) = 6x flips sign at 00: concave down for x<0x < 0, concave up for x>0x > 0, with 00 as the inflection point.

Critical points: where the slope is zero

A critical point of ff is a value xx where f(x)=0f'(x) = 0 (or where ff' does not exist). Geometrically these are the points where the tangent is horizontal — the candidates for peaks and valleys.

Drag the marker below. The widget shows ff in red, ff' in faint teal, and the marker turns green when you land on a critical point. Hunt for all of them on each function.

-2-112-3-2-1123
x 0.30
f(x) -0.87
f'(x) -2.73
f″(x) 1.80
not a critical point
critical points found: 0 / 2

Drag the marker. Red = f(x), faint teal = f′(x). The marker glows green and the verdict snaps when f′ ≈ 0. Then read f″ to classify.

For x33xx^3 - 3x: two critical points at x=±1x = \pm 1. For x44x2x^4 - 4x^2: three, at x=0,±2x = 0, \pm\sqrt{2}. For x3x^3: just one, at 00 — and the verdict will tell you something interesting.

That last one is the warning. A critical point is not the same as an extremum. f(x)=x3f(x) = x^3 has f(0)=0f'(0) = 0, but x=0x = 0 is neither a max nor a min — the curve is just briefly flat as it passes through. You need more than f=0f' = 0 to classify what the point actually is.

Two tests, one job

Given a critical point cc (so f(c)=0f'(c) = 0), how do you tell whether it is a max, a min, or neither?

The first-derivative test. Look at ff' on either side of cc.

  • ff' goes from ++ to - as xx crosses cclocal maximum (climbing, then descending).
  • ff' goes from - to ++local minimum (descending, then climbing).
  • ff' keeps the same sign on both sides → neither (the curve flattens briefly and continues).

The second-derivative test. Compute f(c)f''(c).

  • f(c)>0f''(c) > 0 → concave up at cclocal minimum.
  • f(c)<0f''(c) < 0 → concave down at cclocal maximum.
  • f(c)=0f''(c) = 0 → inconclusive. Fall back to the first-derivative test.

The two agree when they both apply, and the second is usually faster. The widget runs the second-derivative test for you each time you land.

Classify a critical point

For f(x)=x33xf(x) = x^3 - 3x, you already know that x=1x = 1 is a critical point (f(1)=0f'(1) = 0).

Compute f(1)f''(1). The sign of this number tells you whether x=1x = 1 is a local max, min, or inflection.

Read the verdict

You just computed f(1)=6f''(1) = 6 for f(x)=x33xf(x) = x^3 - 3x. What kind of critical point is x=1x = 1?

Why any of this matters: minimization is training

Take a single sentence and read it slowly:

Training a neural network is finding the values of its weights that minimize a loss function.

That sentence is a calculus problem. The “loss function” is a function from millions of inputs (the weights) to a single output (a number measuring how badly the network is performing). “Minimize” means: find the critical point where the gradient — the multi-dimensional cousin of ff' — is zero, and use the multi-dimensional cousin of ff'' to confirm it is a minimum and not a saddle.

Module 10 will make this picture quantitative. Module 12 will give you the algorithm for computing the gradient through any function you can write down — that is backpropagation. Both of them are this lesson, generalized to many variables.

One module, end to end

You started with secant slopes and ended with the test that classifies critical points of any function. Six lessons. Seven differentiation rules. Two more for products and quotients. One chain rule. And one job — minimizing a function — that this entire module was secretly setting up.

The chain rule, applied across a computational graph, is backpropagation. That is what makes module 12 the keystone of this course. You are now exactly one module away from running it backwards.

Lesson complete

Nice tinkering.