Algebra I & II · 40 min

The one identity we live by

Exponents, the exponential function, and its inverse the logarithm. The payoff is a single identity, log turns products into sums, and it is the reason training a model is even possible.

0 / 0

Exponents, and the one rule that runs the show

ana^n for a counting number nn is nn copies of aa multiplied: 24=2222=162^4 = 2 \cdot 2 \cdot 2 \cdot 2 = 16. You met this in module 1.

From that, one rule follows immediately. Multiply ama^m by ana^n and you’ve lined up mm copies then nn more, so:

aman=am+na^m \cdot a^n = a^{m+n}

Hold onto this rule. It is the load-bearing one. Everything strange-looking about exponents, the zero, the negatives, the fractions, is forced by the demand that this rule keeps working.

What zero, negative, and fraction exponents must be

“Three copies of aa” makes no sense for a0a^0 or a2a^{-2} or a1/2a^{1/2}. So we don’t define them by counting. We define them by insisting aman=am+na^m \cdot a^n = a^{m+n} stays true, and see what that forces.

Zero: a0an=a0+n=ana^0 \cdot a^n = a^{0+n} = a^n. The only number that leaves ana^n unchanged when multiplied in is 11. So a0=1a^0 = 1. Forced.

Negative: anan=a0=1a^{-n} \cdot a^n = a^{0} = 1. So ana^{-n} is whatever you multiply ana^n by to get 11, namely an=1/ana^{-n} = 1/a^n. Forced.

Fraction: a1/2a1/2=a1=aa^{1/2} \cdot a^{1/2} = a^{1} = a. So a1/2a^{1/2} is the number that squares to aa: a1/2=aa^{1/2} = \sqrt{a}. Forced.

None of these is an arbitrary convention. Each is the only value that keeps the product rule intact.

Exercise the laws

Simplify (2321)2\left(2^3 \cdot 2^{-1}\right)^2 using the exponent laws. Combine the powers inside, then apply the outer exponent.

What is the value?

A fractional exponent

Evaluate 272/327^{2/3}. The denominator is a root, the numerator is a power: take the cube root of 2727 first, then square it.

What is the value?

The exponential function

So far the exponent moved. Now flip it: fix the base, let the exponent be the variable. That’s the exponential function:

f(x)=bx(b>0, b1)f(x) = b^x \qquad (b > 0,\ b \neq 1)

A line adds a constant per step. An exponential multiplies by a constant per step: every time xx goes up by 11, the output is multiplied by bb. b>1b > 1 gives growth; b<1b < 1 gives decay. Populations, compound interest, radioactive material, and the loss curves you’ll stare at later all live on this shape.

The logarithm undoes the exponential

The exponential traps a variable up in the exponent. 2x=322^x = 32, what is xx? You need the machine that runs bxb^x backwards. That machine is the logarithm:

logb(x)=yby=x\log_b(x) = y \quad \Longleftrightarrow \quad b^y = x

In words: logb(x)\log_b(x) answers the question “to what power must I raise bb to get xx?” So log2(32)=5\log_2(32) = 5, because 25=322^5 = 32.

That’s the entire definition. The logarithm is not a calculator button you press and trust. It is the inverse function of the exponential, exactly the inverse-machine idea from the functions lesson, applied to bxb^x. Its domain is x>0x > 0, because bxb^x never produces zero or a negative, so its inverse is never asked about them.

Ask the logarithm's question

Evaluate log2(32)\log_2(32) from the definition: to what power must you raise 22 to get 3232?

What is log2(32)\log_2(32)?

The log laws come from the exponent laws

Because the logarithm is the exponential’s inverse, each exponent law reflects into a log law. The product rule bubv=bu+vb^u \cdot b^v = b^{u+v} is the important one. Reflected through the logarithm it becomes:

logb(xy)=logb(x)+logb(y)\log_b(xy) = \log_b(x) + \log_b(y)

A logarithm turns a product into a sum. Its siblings: logb(x/y)=logb(x)logb(y)\log_b(x/y) = \log_b(x) - \log_b(y) turns a quotient into a difference, and logb(xk)=klogb(x)\log_b(x^k) = k\log_b(x) turns a power into a multiple.

One caution, the single most common log error: this is the log of a product. The log of a sum is nothing nice. log(a+b)\log(a + b) does not equal loga+logb\log a + \log b. The sum of the logs is the log of the product, never the log of the sum.

e and ln, on credit

You’ll see one base constantly: e2.71828e \approx 2.71828\ldots, an irrational constant, and its logarithm ln=loge\ln = \log_e, the natural log.

Why ee and not 1010? Because ee is the base that makes calculus come out clean: the exponential exe^x is the one function that is its own rate of change. That sentence can’t be cashed in yet, it needs module 5. For now, take it on credit: ee is just a base, a specific number near 2.7182.718, and ln\ln is its logarithm. Module 5 will tell you why it’s the base. Until then, every log law above works with ln\ln exactly as written.

The identity we live by

Here is the payoff the whole module has been walking toward.

A model that predicts a sequence assigns a probability to each piece, then multiplies them for the joint probability: P=p1p2pNP = p_1 \cdot p_2 \cdots p_N, a product of millions of numbers each between 00 and 11.

Multiplying millions of small numbers is a disaster on a real computer. The product shrinks past the smallest number the machine can represent and silently collapses to exactly 00. That’s underflow, and once it happens every trace of the real value is gone.

The logarithm rescues this. Because log\log turns products into sums:

log ⁣(i=1Npi)=i=1Nlog(pi)\log\!\left(\prod_{i=1}^N p_i\right) = \sum_{i=1}^N \log(p_i)
110^-5010^-10010^-15010^-20010^-25010^-300← tinylarge →p₁p₂p₃
p₁ 0.100
10^-1
p₂ 0.0100
10^-2
p₃ 0.0501
10^-1
multiply track p₁ × p₂ × p₃
10^(-4.30)
log₁₀(product) = -4.3000
log-sum track log₁₀(p₁) + log₁₀(p₂) + log₁₀(p₃)
-4.3000
still exact — no underflow possible
identity check: log₁₀(product) ≈ ∑log = -4.3000 identity holds
No underflow yet. Both tracks agree: -4.3000

Drag the markers tiny. Watch the multiply track collapse to zero while the sum-of-logs track keeps the real number. That collapse is why every training loss is a sum of logs.

Drag the three probabilities tiny. The multiply track collapses to 00 and dies. The sum-of-logs track keeps the real number, every time. That collapse, and that survival, is the demo of the lesson.

Why a sum, specifically

Switching from a product to a sum of logs buys two things, and a model is trained on both.

It survives the arithmetic. A sum of a million moderate negative numbers is an ordinary, representable number. The product those logs came from would have underflowed to zero long ago. The sum is the only form that physically fits in the machine.

It can be differentiated cheaply. Module 5 will show that the rate of change of a sum is just the sum of the rates of change of its parts. A product of a million terms has no such mercy. Training means nudging parameters using exactly those rates, so the loss has to be a sum. That is why every loss function you’ll meet is written L=ilogpi\mathcal{L} = -\sum_i \log p_i, a sum of logs, not a product of probabilities.

A negative log-likelihood

A model assigns probability 0.0010.001 to an event. Its negative log-likelihood, the loss contribution, is log10(0.001)-\log_{10}(0.001).

Find log10(0.001)\log_{10}(0.001) first, then negate it. What is the negative log-likelihood?

Where this goes next

That’s module 2. You can reshape expressions, graph lines, solve systems, wire functions into pipelines, bend parabolas, and turn a doomed product into a workable sum of logs.

g(f(x)) is one layer wired into the next, the forward pass you built in the functions lesson. log(∏) = ∑(log) is why a million tiny probabilities don’t sink training, the identity you just watched rescue a computation. Modules 8 and 9 cash in the logarithm as the loss function. Module 11 cashes in composition as the network. Everything that follows is these two facts at scale, and you now hold both.

Lesson complete

Nice tinkering.