Find a direction the matrix doesn't rotate.
The widget below has a fixed matrix . Drag the red probe vector around. The coral ghost arrow is .
For most directions, points somewhere different from : the transformation rotates you off your line.
But for a few special directions, stays on the same line as . The widget will cheer when you find one. There are two for this matrix. Find both.
Those directions are called eigenvectors. The stretch factor along each is its eigenvalue, . The widget reports it.
The equation, earned
An eigenvector (nonzero) and its eigenvalue (a scalar) satisfy:
Read that: ” applied to equals a scalar multiple of .” Which is exactly what your hands found: the transformation stretched without rotating it.
can be:
- : gets stretched longer.
- : gets shrunk.
- : gets sent to the origin (the eigenvector lies in the null space).
- : flips through the origin.
Eigenvectors are directions, not specific vectors. If is an eigenvector, so is any nonzero scalar multiple. The line is what matters.
Spot an eigenvector by inspection
For the diagonal matrix , one eigenvector has eigenvalue . Give the -component of the simplest such eigenvector.
(Any nonzero scalar multiple is also an eigenvector. Report the -component of the unit one.)
Only now, the formula
You understand what eigenvectors are. Here’s how to find them mechanically.
We want for some nonzero . Rewrite: . We want a nonzero in the null space of .
A matrix has a nonzero null vector if and only if its determinant is zero (you saw that two lessons ago). So:
This is the characteristic equation. It’s a polynomial in whose roots are exactly the eigenvalues. For a 2×2 matrix it’s a quadratic; for it’s degree-.
You didn’t memorize “solve .” You wanted an eigenvector, noticed that requires a null space, and knew that requires determinant zero. The equation fell out.
Eigenvalues by characteristic equation
Find the eigenvalues of (the same matrix as the widget).
Give the sum of the two eigenvalues.
Same arrow, different address.
Here’s the trick that becomes a hammer.
Drag the basis arrows below. The coral point stays put, but its coordinates in the basis change.
The arrow in space is unchanged. Only the address moves. That’s all change of basis is: a translation dictionary, not a change in meaning.
Why care? Suppose you have a hard matrix that has two linearly independent eigenvectors . Build a coordinate system using those as basis arrows. Let be the matrix of eigenvectors as columns. Then in this eigenbasis, the same transformation looks like:
A diagonal matrix. The weird transformation, in the right basis, is just two independent stretches: one along each eigenvector, by its eigenvalue.
Diagonalization pays for itself
Once you’ve diagonalized , you get things cheap.
Powers of : . And is just , which is trivial. Without diagonalization, would take nine matrix multiplications. With it, it’s one diagonal power and two sandwich multiplications.
Solving systems, simulating dynamics, computing the long-run behavior of iterated transformations (Markov chains, for example) are all tractable in the eigenbasis and miserable otherwise. Diagonalization is the move that transforms a hard problem into a trivial one by changing what coordinates you use.
Not every matrix is diagonalizable. Some are missing independent eigenvectors. There’s a generalization (Jordan form) that handles these, but we won’t need it. The next tool is much more general.
Every matrix is rotate-stretch-rotate.
The singular value decomposition is the universal generalization. Every real matrix , square or not, diagonalizable or not, factors as
where and are rotations (orthogonal matrices) and is diagonal with nonnegative entries , the singular values.
Drive the animation slider below from 0 to 3 to watch the unit circle pass through each stage: (rotate), (axis-aligned stretch), then (rotate again). The circle becomes an ellipse, then a rotated ellipse: the image of .
Drop to zero. The ellipse collapses to a line: that’s the best rank-1 approximation of . Every matrix is rotate-stretch-rotate. That’s the theorem.
Low-rank approximation: the practical payoff
You can form a rank- approximation of by keeping only the largest singular values (and their associated columns of and ) and throwing away the rest:
This is the best rank- approximation to in a precise sense: it minimizes the error over all possible rank- matrices. If your data has a handful of large singular values and a long tail of small ones, you can compress dramatically by keeping only the top handful and lose almost nothing.
That’s PCA, in one sentence. It’s also the engine behind low-rank adapters (LoRA) in ML, latent-semantic indexing, image compression, and anything else labeled “find the underlying structure.” You’ll see SVD again.
Where this shows up in the transformer
The final callback for the module, and we’ve earned it now.
An attention layer does three things:
- Projects token embeddings into query, key, and value spaces: , , . Each projection is matrix multiplication: a linear combination of the columns of a learned weight matrix.
- Computes pairwise dot products between queries and keys, scaled: . A similarity table, exactly the kind of object the dot product was built for.
- Uses a softmax-normalized attention matrix to produce weighted sums of value vectors.
Every step is linear algebra. The transformer is a stack of linear transformations separated by mild nonlinearities, and the mild nonlinearities are there specifically to prevent the whole thing from being equivalent to a single linear transformation (which, by the eigendecomposition reasoning you just did, would be a single rotate-stretch-rotate and therefore woefully underpowered).
You now understand 80% of a transformer. The remaining 20% (learned weights, backprop, scale) is in the coming modules. The linear algebra is yours.
Lesson complete
Nice tinkering.
Before you go