Neural Network Fundamentals · 12 min

Why a stack of linear layers is just one linear layer

Without a nonlinearity between them, ten layers do exactly what one layer does. The algebra is short and brutal, and it is the reason activation functions exist.

0 / 0

Stack the layers, surely?

Last lesson ended with a plan: a hidden layer warps the space, an output layer finishes with a line. Two layers beat one. So depth is the lever. Add more layers, warp harder, solve more.

A single layer is the linear layer y=Wx+by = Wx + b: a matrix multiply plus a bias vector. It is the perceptron’s weighted sum, written for many outputs at once. Stacking two of them looks like the obvious move:

x    W(1)x+b(1)    W(2)(W(1)x+b(1))+b(2).x \;\longrightarrow\; W^{(1)}x + b^{(1)} \;\longrightarrow\; W^{(2)}\bigl(W^{(1)}x + b^{(1)}\bigr) + b^{(2)}.

It feels like it must be more powerful than one layer. Multiply the algebra out and see.

The algebra collapses

Expand that second expression:

W(2)(W(1)x+b(1))+b(2)  =  (W(2)W(1))one matrixx  +  (W(2)b(1)+b(2))one vector.W^{(2)}\bigl(W^{(1)}x + b^{(1)}\bigr) + b^{(2)} \;=\; \underbrace{\bigl(W^{(2)}W^{(1)}\bigr)}_{\text{one matrix}} x \;+\; \underbrace{\bigl(W^{(2)}b^{(1)} + b^{(2)}\bigr)}_{\text{one vector}}.

Look at what you’re left with: a single matrix times xx, plus a single vector. That is the exact shape of one linear layer y=Wx+by = Wx + b, with W=W(2)W(1)W = W^{(2)}W^{(1)} and b=W(2)b(1)+b(2)b = W^{(2)}b^{(1)} + b^{(2)}.

Two layers did not build something new. They built a different spelling of one layer. And this telescopes: ten linear layers collapse into one matrix, a hundred collapse into one. Depth alone buys you nothing.

Collapse it yourself

A two-layer linear network (no biases, no activation) has

W(1)=(2003),W(2)=(11).W^{(1)} = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}, \qquad W^{(2)} = \begin{pmatrix} 1 & 1 \end{pmatrix}.

The network computes f(x)=W(2)(W(1)x)f(x) = W^{(2)}\bigl(W^{(1)}x\bigr).

What is f(x)f(x) for the input x=(1,1)x = (1, 1)?

Watch the grid refuse to bend

Here is the collapse made visual. The left panel is a square grid. The right panel is that grid after it passes through a stack of linear layers. The red line is a straight “decision boundary” carried along for reference.

input

-2-112-2-112

after 3 layers

-3-2-1123-3-2-1123
layers 3

The grid stays a grid. Stack as many linear layers as you like; the output is still one linear warp. The red boundary stays dead straight. Those 3 layers collapse into a single matrix:

W = [ 1.01 0.20 ; 0.51 1.02 ]

Leave ReLU between layers unchecked and push the layer count up to 5. The output is squished, rotated, sheared, but every grid line stays dead straight and the red boundary stays a line. The widget even prints the single matrix all those layers collapsed into. Five layers, one matrix.

Bug, or expected?

A learner builds a 5-layer network of linear layers with no activation functions, trains it on a dataset that needs a curved boundary, and notices the decision boundary is stubbornly a straight line.

Is this a bug, or expected behavior?

The fix goes between the layers

Now check ReLU between layers and look again. The grid folds. The red boundary kinks. Suddenly depth is doing real work, and the deeper you go the more intricate the warp.

ReLU is the nonlinearity from the XOR fix, max(0,z)\max(0, z), applied to every entry of a layer’s output before it reaches the next layer. The collapse proof above relied on one quiet fact: you can pull matrices through each other because matrix multiplication is associative. Slip a nonlinear function into the chain and that move is blocked. W(2)ϕ(W(1)x)W^{(2)}\,\phi\bigl(W^{(1)}x\bigr) cannot be flattened, because ϕ\phi is not a matrix.

That little function wedged between every pair of layers is called the activation function. It is not decoration. It is the one component that makes depth mean something.

What actually rescues depth?

The 5-layer network collapsed because every layer was linear. What is the smallest change that fixes it?

So that's why activations exist

You now know the real reason every neural network diagram has those little nonlinearity boxes between the matrix multiplies. Strip them out and the whole deep stack quietly folds into a single linear layer, no better than logistic regression.

The activation is what keeps depth honest. Which raises a fair question: there are several of these functions on offer, ReLU, sigmoid, tanh, GELU. Are they interchangeable? Does the choice matter? Next lesson, you meet the zoo and find out why ReLU won.

Lesson complete

Nice tinkering.