Linear Algebra · 22 min

Eigenvectors, Change of Basis, and a Glimpse of SVD

Every transformation has special directions it stretches without rotating. Working in those directions collapses complicated matrices to pure scalings. SVD extends the idea to every matrix there is.

0 / 0

Find a direction the matrix doesn't rotate.

The widget below has a fixed matrix AA. Drag the red probe vector v\mathbf{v} around. The coral ghost arrow is AvA\mathbf{v}.

A = [[2, 1], [0, 3]] · drag v

find a direction where Av stays on the same line as v

-3-2-1123-2-112
v
(1.40, 0.70)
Av
(3.50, 2.10)
eigenvector found
λ = 2.61

For most directions, AvA\mathbf{v} points somewhere different from v\mathbf{v}: the transformation rotates you off your line.

But for a few special directions, AvA\mathbf{v} stays on the same line as v\mathbf{v}. The widget will cheer when you find one. There are two for this matrix. Find both.

Those directions are called eigenvectors. The stretch factor along each is its eigenvalue, λ\lambda. The widget reports it.

The equation, earned

An eigenvector v\mathbf{v} (nonzero) and its eigenvalue λ\lambda (a scalar) satisfy:

Av  =  λv.A\mathbf{v} \;=\; \lambda\, \mathbf{v}.

Read that: ”AA applied to v\mathbf{v} equals a scalar multiple of v\mathbf{v}.” Which is exactly what your hands found: the transformation stretched v\mathbf{v} without rotating it.

λ\lambda can be:

  • λ>1\lambda > 1: v\mathbf{v} gets stretched longer.
  • 0<λ<10 < \lambda < 1: v\mathbf{v} gets shrunk.
  • λ=0\lambda = 0: v\mathbf{v} gets sent to the origin (the eigenvector lies in the null space).
  • λ<0\lambda < 0: v\mathbf{v} flips through the origin.

Eigenvectors are directions, not specific vectors. If v\mathbf{v} is an eigenvector, so is any nonzero scalar multiple. The line is what matters.

Spot an eigenvector by inspection

For the diagonal matrix A=[3002]A = \begin{bmatrix}3 & 0 \\ 0 & 2\end{bmatrix}, one eigenvector has eigenvalue λ=3\lambda = 3. Give the xx-component of the simplest such eigenvector.

(Any nonzero scalar multiple is also an eigenvector. Report the xx-component of the unit one.)

Only now, the formula

You understand what eigenvectors are. Here’s how to find them mechanically.

We want Av=λvA\mathbf{v} = \lambda\mathbf{v} for some nonzero v\mathbf{v}. Rewrite: (AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0}. We want a nonzero v\mathbf{v} in the null space of AλIA - \lambda I.

A matrix has a nonzero null vector if and only if its determinant is zero (you saw that two lessons ago). So:

det(AλI)  =  0.\det(A - \lambda I) \;=\; 0.

This is the characteristic equation. It’s a polynomial in λ\lambda whose roots are exactly the eigenvalues. For a 2×2 matrix it’s a quadratic; for n×nn \times n it’s degree-nn.

You didn’t memorize “solve det(AλI)=0\det(A - \lambda I) = 0.” You wanted an eigenvector, noticed that requires a null space, and knew that requires determinant zero. The equation fell out.

Eigenvalues by characteristic equation

Find the eigenvalues of A=[2103]A = \begin{bmatrix}2 & 1 \\ 0 & 3\end{bmatrix} (the same matrix as the widget).

Give the sum of the two eigenvalues.

Same arrow, different address.

Here’s the trick that becomes a hammer.

Drag the basis arrows v1,v2\mathbf{v}_1, \mathbf{v}_2 below. The coral point PP stays put, but its coordinates in the basis {v1,v2}\{\mathbf{v}_1, \mathbf{v}_2\} change.

a fixed point, two coordinate systems

drag the basis arrows · drag P · same arrow, two addresses

-3-2-1123-2-112
P (standard) (2.00, 1.00)
P in basis {v₁, v₂} (2.00, 1.00)

P never moved. Only the address changed.

The arrow in space is unchanged. Only the address moves. That’s all change of basis is: a translation dictionary, not a change in meaning.

Why care? Suppose you have a hard matrix AA that has two linearly independent eigenvectors v1,v2\mathbf{v}_1, \mathbf{v}_2. Build a coordinate system using those as basis arrows. Let PP be the matrix of eigenvectors as columns. Then in this eigenbasis, the same transformation AA looks like:

P1AP  =  [λ100λ2]  =  D.P^{-1} A P \;=\; \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix} \;=\; D.

A diagonal matrix. The weird transformation, in the right basis, is just two independent stretches: one along each eigenvector, by its eigenvalue.

Diagonalization pays for itself

Once you’ve diagonalized A=PDP1A = P D P^{-1}, you get things cheap.

Powers of AA: Ak=PDkP1A^k = P D^k P^{-1}. And DkD^k is just diag(λ1k,λ2k)\mathrm{diag}(\lambda_1^k, \lambda_2^k), which is trivial. Without diagonalization, A10A^{10} would take nine matrix multiplications. With it, it’s one diagonal power and two sandwich multiplications.

Solving systems, simulating dynamics, computing the long-run behavior of iterated transformations (Markov chains, for example) are all tractable in the eigenbasis and miserable otherwise. Diagonalization is the move that transforms a hard problem into a trivial one by changing what coordinates you use.

Not every matrix is diagonalizable. Some are missing independent eigenvectors. There’s a generalization (Jordan form) that handles these, but we won’t need it. The next tool is much more general.

Every matrix is rotate-stretch-rotate.

The singular value decomposition is the universal generalization. Every real matrix AA, square or not, diagonalizable or not, factors as

A  =  UΣVA \;=\; U \Sigma V^\top

where UU and VV are rotations (orthogonal matrices) and Σ\Sigma is diagonal with nonnegative entries σ1σ20\sigma_1 \ge \sigma_2 \ge \cdots \ge 0, the singular values.

Drive the animation slider below from 0 to 3 to watch the unit circle pass through each stage: VV^\top (rotate), Σ\Sigma (axis-aligned stretch), then UU (rotate again). The circle becomes an ellipse, then a rotated ellipse: the image of AA.

unit circle → A = UΣV

stage 3: U rotation
-3-2-1123-2-112

Push σ₂ to zero. The ellipse collapses to a line, which is the best rank-1 approximation of A. Every matrix factors this way: rotate, stretch, rotate.

Drop σ2\sigma_2 to zero. The ellipse collapses to a line: that’s the best rank-1 approximation of AA. Every matrix is rotate-stretch-rotate. That’s the theorem.

Low-rank approximation: the practical payoff

You can form a rank-kk approximation of AA by keeping only the kk largest singular values (and their associated columns of UU and VV) and throwing away the rest:

Ak  =  i=1kσiuivi.A_k \;=\; \sum_{i=1}^{k} \sigma_i \mathbf{u}_i \mathbf{v}_i^\top.

This is the best rank-kk approximation to AA in a precise sense: it minimizes the error over all possible rank-kk matrices. If your data has a handful of large singular values and a long tail of small ones, you can compress dramatically by keeping only the top handful and lose almost nothing.

That’s PCA, in one sentence. It’s also the engine behind low-rank adapters (LoRA) in ML, latent-semantic indexing, image compression, and anything else labeled “find the underlying structure.” You’ll see SVD again.

Where this shows up in the transformer

The final callback for the module, and we’ve earned it now.

An attention layer does three things:

  1. Projects token embeddings into query, key, and value spaces: Q=XWqQ = X W_q, K=XWkK = X W_k, V=XWvV = X W_v. Each projection is matrix multiplication: a linear combination of the columns of a learned weight matrix.
  2. Computes pairwise dot products between queries and keys, scaled: QK/dkQ K^\top / \sqrt{d_k}. A similarity table, exactly the kind of object the dot product was built for.
  3. Uses a softmax-normalized attention matrix to produce weighted sums of value vectors.

Every step is linear algebra. The transformer is a stack of linear transformations separated by mild nonlinearities, and the mild nonlinearities are there specifically to prevent the whole thing from being equivalent to a single linear transformation (which, by the eigendecomposition reasoning you just did, would be a single rotate-stretch-rotate and therefore woefully underpowered).

You now understand 80% of a transformer. The remaining 20% (learned weights, backprop, scale) is in the coming modules. The linear algebra is yours.

Lesson complete

Nice tinkering.