Stack the layers, surely?
Last lesson ended with a plan: a hidden layer warps the space, an output layer finishes with a line. Two layers beat one. So depth is the lever. Add more layers, warp harder, solve more.
A single layer is the linear layer : a matrix multiply plus a bias vector. It is the perceptron’s weighted sum, written for many outputs at once. Stacking two of them looks like the obvious move:
It feels like it must be more powerful than one layer. Multiply the algebra out and see.
The algebra collapses
Expand that second expression:
Look at what you’re left with: a single matrix times , plus a single vector. That is the exact shape of one linear layer , with and .
Two layers did not build something new. They built a different spelling of one layer. And this telescopes: ten linear layers collapse into one matrix, a hundred collapse into one. Depth alone buys you nothing.
Collapse it yourself
A two-layer linear network (no biases, no activation) has
The network computes .
What is for the input ?
Watch the grid refuse to bend
Here is the collapse made visual. The left panel is a square grid. The right panel is that grid after it passes through a stack of linear layers. The red line is a straight “decision boundary” carried along for reference.
Leave ReLU between layers unchecked and push the layer count up to 5. The output is squished, rotated, sheared, but every grid line stays dead straight and the red boundary stays a line. The widget even prints the single matrix all those layers collapsed into. Five layers, one matrix.
Bug, or expected?
A learner builds a 5-layer network of linear layers with no activation functions, trains it on a dataset that needs a curved boundary, and notices the decision boundary is stubbornly a straight line.
Is this a bug, or expected behavior?
Linear layers compose into a single linear map. A linear map sends straight lines to straight lines, always. No bug — this is the math working exactly as proven above.
The fix goes between the layers
Now check ReLU between layers and look again. The grid folds. The red boundary kinks. Suddenly depth is doing real work, and the deeper you go the more intricate the warp.
ReLU is the nonlinearity from the XOR fix, , applied to every entry of a layer’s output before it reaches the next layer. The collapse proof above relied on one quiet fact: you can pull matrices through each other because matrix multiplication is associative. Slip a nonlinear function into the chain and that move is blocked. cannot be flattened, because is not a matrix.
That little function wedged between every pair of layers is called the activation function. It is not decoration. It is the one component that makes depth mean something.
What actually rescues depth?
The 5-layer network collapsed because every layer was linear. What is the smallest change that fixes it?
The collapse happens for any linear (or, more generally, polynomial) map between layers. What breaks it is inserting a function that is not a polynomial — ReLU, sigmoid, tanh and GELU all qualify. ReLU is a common default, not the only option.
So that's why activations exist
You now know the real reason every neural network diagram has those little nonlinearity boxes between the matrix multiplies. Strip them out and the whole deep stack quietly folds into a single linear layer, no better than logistic regression.
The activation is what keeps depth honest. Which raises a fair question: there are several of these functions on offer, ReLU, sigmoid, tanh, GELU. Are they interchangeable? Does the choice matter? Next lesson, you meet the zoo and find out why ReLU won.
Lesson complete