Neural Network Fundamentals · 22 min

Forward pass, end to end

Assemble linear layers and activations into a full MLP, run it from input to output by hand, count its parameters, and take the universal approximation theorem exactly as seriously as it deserves.

0 / 0

The whole machine

You have every part now. A linear layer Wx+bWx + b does a matmul and a shift. An activation ϕ\phi bends the result so depth means something. A multilayer perceptron is just those two, alternating:

y^=ϕ(W(L)ϕ(W(2)ϕ(W(1)x+b(1))+b(2))+b(L)).\hat{y} = \phi\bigl(W^{(L)} \cdots \phi\bigl(W^{(2)} \phi\bigl(W^{(1)}x + b^{(1)}\bigr) + b^{(2)}\bigr) \cdots + b^{(L)}\bigr).

That nested expression is intimidating only because it is written inside-out. Read it as a pipeline instead: input goes in, each layer does matmul, add bias, apply activation, and hands its output to the next. Running that pipeline once, input to prediction, is the forward pass.

Let’s run a real one.

Trace a forward pass

Here is a concrete 2-3-1 network: two inputs, a hidden layer of three sigmoid units, one linear output. The weights are fixed. Drag the input point and watch every intermediate vector update.

drag the input

-2-112-2-112
x = (1.000, 1.000)
input x
1.0001.000
z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾
0.3000.6000.100
h⁽¹⁾ = σ(z⁽¹⁾)
0.5740.6460.525
ŷ = W⁽²⁾h⁽¹⁾ + b⁽²⁾
0.191
stage 1 / 4

the two numbers you feed in

Step through the stages. The input is two numbers. The hidden pre-activation z(1)z^{(1)} is one matmul plus a bias, three numbers out. The hidden activation h(1)h^{(1)} is sigmoid applied to each. The output is a second matmul. That is the entire computation, and there is nothing in it but arithmetic.

Compute the output

Put the widget’s input at x=(1,1)x = (1, 1) and walk it all the way through to the output stage.

What value does the network output, y^\hat{y}, to two decimals?

Counting the cost

A network’s size is its parameter count, the total number of learnable scalars. Every layer-to-layer connection contributes a weight matrix and a bias vector. A layer mapping nn inputs to mm outputs holds

#params=nmweights+mbiases.\#\text{params} = \underbrace{n \cdot m}_{\text{weights}} + \underbrace{m}_{\text{biases}}.

The widget below lets you resize an MLP’s layers and watch the total move. Drag the layer widths, add layers, and try the MNIST net preset.

4
input
4→7 35
7
hidden
7→5 40
5
hidden
5→3 18
3
output
93 learnable parameters
372 B at fp32

Each gap costs in · out + out — the + out is the bias. Widen the first hidden layer and watch where the parameters pile up: the count is dominated by the widest matmul.

Notice the bar chart: the parameters are not spread evenly. They pile up in the widest matmul. Make the first hidden layer huge and the count is almost entirely that one layer.

Size a real network

Consider an MLP with layer sizes 7841286410784 \to 128 \to 64 \to 10, every layer with a bias.

How many learnable parameters does it have in total?

Where the parameters live

That 784-128-64-10 network has about 109,000 parameters, and roughly 100,000 of them, over 90%, sit in the very first matrix. The reason is simple: 784×128784 \times 128 is by far the largest product of adjacent widths.

Hold onto this fact. It returns in Module 16, where you’ll see a transformer’s feed-forward block deliberately expand to four times its width, d4ddd \to 4d \to d. That 4× expansion is exactly why most of a transformer’s parameters live in its MLP blocks. Parameter budgeting is not bookkeeping; it is where the model’s capacity is.

Universal approximation, taken at exactly face value

There is a famous theorem about MLPs, and it is famous partly for being oversold. The universal approximation theorem says: an MLP with a single hidden layer, wide enough, with a non-polynomial activation, can approximate any continuous function on a closed bounded region, to any accuracy you like.

That sounds like “MLPs can do anything.” It is worth knowing precisely what it does not say.

It is an existence theorem. It promises a network with the right weights exists. It gives you no method to find those weights, and no promise that gradient descent will. It says nothing about how wide the hidden layer must be, and for most useful functions that width grows explosively with the input dimension. And “continuous on a bounded region” is a real fine-print clause.

The theorem is true. It is also much weaker than its name.

What the theorem oversells

“Universal approximation proves a one-hidden-layer MLP can represent any continuous function to arbitrary precision.” Select every claim this oversells (more than one is correct).

You've built the transformer's MLP

Step back and see what you have. You built a neuron, hit the wall of XOR, learned why depth needs a nonlinearity between its layers, met the four activations, and ran a full network forward by hand. That is the multilayer perceptron, complete.

Here is the payoff. A transformer’s feed-forward block, one of the two halves of every transformer layer, is a 2-layer MLP: a linear layer, a GELU, a linear layer. Same matmul, same bias, same nonlinearity you just used. You are not “preparing” to understand the transformer’s MLP. You have already built it.

What you have not built is the part where the network finds its own weights, instead of you placing them by hand. That is Module 12: backpropagation. The forward pass you just traced is the structure. Next, you walk it backwards.

Lesson complete

Nice tinkering.