Trigonometry: compact · 24 min

Polar coordinates and the position fingerprint

Polar coordinates are the unit circle scaled up. Inverse trig works only once you restrict the domain. Then the payoff. Tag a position with sines and cosines at many frequencies and every position gets a unique fingerprint, which is exactly a transformer's positional encoding.

0 / 0

A second way to name a point

Cartesian coordinates locate a point by “go right xx, go up yy.” There is another way, and it is the unit circle scaled up.

Polar coordinates name a point by (r,θ)(r, \theta): how far it is from the origin, and at what angle. To convert to the familiar xx and yy, you walk distance rr in the direction θ\theta:

x=rcosθ,y=rsinθx = r\cos\theta, \qquad y = r\sin\theta

That is just the unit-circle definition, stretched by rr. Cosine and sine point you in a direction; rr says how far to go.

-2-112-2-112
123456-2-112
curve cardioid
theta 0.000 rad
r = r(theta) 2.000
x = r cos theta 2.000
y = r sin theta 0.000
note The same function is a loop on the left and a wave on the right. x = r cos theta, y = r sin theta is the bridge.

Pick a curve and scrub the angle dial. The left panel draws the path in polar; the right panel plots the same r(θ)r(\theta) as an ordinary wave. One motion, two pictures.

Converting both directions

Forward, polar to Cartesian, is the pair above. Backward, Cartesian to polar, uses Pythagoras and an angle:

r=x2+y2,θ=atan2(y,x)r = \sqrt{x^2 + y^2}, \qquad \theta = \operatorname{atan2}(y, x)

The radius is easy. The angle needs care, and that is the next idea: getting an angle back out of a coordinate is what inverse trig is for, and it has a catch.

Polar to Cartesian

Convert the polar point (r,θ)=(5,π/6)(r, \theta) = (5, \pi/6) to Cartesian. Recall cos(π/6)=3/20.866\cos(\pi/6) = \sqrt{3}/2 \approx 0.866.

What is the xx-coordinate?

Inverse trig needs a restricted domain

You want to go from a sine value back to the angle. Define arcsin\arcsin: “the angle whose sine is this.”

But there is a problem. sinθ=12\sin\theta = \tfrac12 is true at θ=π/6\theta = \pi/6, and again at 5π/65\pi/6, and again every full turn after each of those. Infinitely many angles share a sine. So “the angle whose sine is 12\tfrac12” is not a function yet, because a function must return one answer.

The fix is the same one you saw in algebra with f(x)=x2f(x) = x^2, where  \sqrt{\ } returns only the non-negative root. We restrict the domain to a stretch where sine hits each value exactly once, the interval [π/2,π/2][-\pi/2, \pi/2], and define arcsin\arcsin to return the angle from that stretch. That chosen stretch is called the principal branch.

arcsin:[1,1][π2,π2]\arcsin: [-1, 1] \to [-\tfrac{\pi}{2}, \tfrac{\pi}{2}]

Cosine and tangent get the same treatment: arccos\arccos returns angles in [0,π][0, \pi], and arctan\arctan returns angles in (π/2,π/2)(-\pi/2, \pi/2).

The reciprocal trap

One notation warning. You will see arcsin\arcsin written as sin1\sin^{-1}. The 1-1 there means “inverse function,” not “reciprocal.”

sin1(x) means arcsin(x),NOT  1sinx\sin^{-1}(x) \ \text{means}\ \arcsin(x), \qquad \text{NOT}\ \ \frac{1}{\sin x}

The reciprocal 1/sinx1/\sin x has its own name, cosecant, and it is a different thing. When in doubt, write arcsin\arcsin. It cannot be misread.

An inverse sine

Evaluate arcsin(1/2)\arcsin(1/2), the principal angle whose sine is 12\tfrac12.

What is it, in radians?

An inverse tangent

Evaluate arctan(1)\arctan(1), the principal angle whose tangent is 11.

What is it, in radians?

Why atan2 exists

Back to converting a Cartesian point to its angle. Here is the catch in concrete form.

The point (1,1)(1, 1) and the point (1,1)(-1, -1) have the same ratio y/x=1y/x = 1. Feed that ratio to arctan\arctan and it returns π/4\pi/4 for both. But the two points are in opposite quadrants, π\pi apart. arctan\arctan alone cannot tell them apart, because it only ever saw the ratio.

That is why code has a two-argument version, atan2(y, x). It takes both coordinates separately, so it can see the signs, and it returns the angle in the correct quadrant across the full [π,π][-\pi, \pi]. If you have ever written Math.atan2 in a program without quite knowing why it took two arguments, that is the reason: one argument throws away the quadrant.

Tagging a position with waves

Now the payoff the whole module was climbing toward.

Suppose you have a position, an integer pp, and you want to hand a machine a set of numbers that uniquely identifies it, smoothly, so that nearby positions get nearby numbers. Here is the move. Sample a stack of sine and cosine waves at the input pp, each wave at a different frequency:

[ sin(p/w1), cos(p/w1), sin(p/w2), cos(p/w2),  ]\Big[\ \sin(p/w_1),\ \cos(p/w_1),\ \sin(p/w_2),\ \cos(p/w_2),\ \dots\ \Big]

where the wavelengths w1,w2,w_1, w_2, \dots span a wide range. The fast waves change a lot between pp and p+1p+1; the slow waves barely move. Together, the list of numbers is a fingerprint of pp.

p = 7
Fingerprint vector at p = 7
sin p0.66cos p0.75sin p/100.64cos p/100.76sin p/1000.07cos p/1001.00sin p/10000.01cos p/10001.00
Same vector across positions 0 – 30 (current p highlighted)
sin pcos psin p/10cos p/10sin p/100cos p/100sin p/1000cos p/10000102030
sin p = 0.657 · cos p = 0.754 · sin p/10 = 0.644 · cos p/10 = 0.765 · sin p/100 = 0.070 · cos p/100 = 0.998 · sin p/1000 = 0.007 · cos p/1000 = 1.000

Every position gets a unique vector of sines and cosines sampled at many frequencies. This is the positional encoding inside a transformer — PE(p, 2i) = sin(p / 10000^(2i/d)), PE(p, 2i+1) = cos(p / 10000^(2i/d)). You will build it again in the transformer module.

Drag the position. The bar chart is the fingerprint at the current pp. The heatmap stacks the fingerprints of positions 00 through 3030, one per row.

Why many frequencies

Look at the heatmap. The high-frequency columns on the left flicker rapidly as you move down the rows: those waves separate adjacent positions sharply. The low-frequency columns on the right change slowly: those waves tell far apart from nearby.

Use only fast waves and distant positions start to look alike once the waves wrap around. Use only slow waves and neighbors are nearly identical. Use a spread of frequencies and you get both at once: every position gets a fingerprint that is unique, and that varies smoothly, so “close” and “far” are both visible in the numbers.

The fingerprint at position zero

At p=0p = 0, every sin\sin term is sin(0)=0\sin(0) = 0 and every cos\cos term is cos(0)=1\cos(0) = 1.

With 44 frequency bands switched on, the fingerprint has 88 components. How many of them equal 11?

The endgame: where this whole module was going

Here is the artifact you have been building toward, named at last.

A transformer sees a bag of tokens with no order. “the cat sat” and “sat the cat” look identical to it unless position is added back in. To put position back, it tags each token with a vector of sines and cosines read off the unit circle at many frequencies, the exact fingerprint you just built. That is sinusoidal positional encoding, from the 2017 paper Attention Is All You Need:

PE(p,2i)=sin ⁣(p100002i/d),PE(p,2i+1)=cos ⁣(p100002i/d)\mathrm{PE}(p, 2i) = \sin\!\left(\frac{p}{10000^{2i/d}}\right), \qquad \mathrm{PE}(p, 2i+1) = \cos\!\left(\frac{p}{10000^{2i/d}}\right)

That fraction 100002i/d10000^{2i/d} is just the wavelength schedule, fast waves for small ii, slow waves for large ii, the same spread you saw in the heatmap.

And inside LLaMA 2 and 3, Mistral, and Gemma, the machine goes one step further: it rotates the query and key vectors by an angle proportional to position, the exact 2D rotation you derived from the angle addition formulas in the last lesson. That is RoPE. You did not just preview these ideas. You derived both halves of them.

What to expect next

The trigonometry you needed is done, and you can see the far shore from here.

Module four sharpens function transformations. Module five, calculus, shows why sine and cosine are unusually well-behaved under the derivative. Module seven recasts your rotation rule as a 2×22\times2 matrix. Module fifteen uses that matrix as RoPE inside attention, and module sixteen uses the multi-frequency fingerprint you built in this lesson. Every one of those is a turn of the same circle you have been dragging since lesson one.

Lesson complete

Nice tinkering.