course · ml-math

Lessons

Interactive lessons from "Machine Learning, Backpropagation, and AI: The Math." Pick any module to start; each lesson is one read-in-one-sitting page with gated steps and a widget for every key idea.

  1. Module 1 drafting

    Pre-algebra Refresh

    Number line intuition, arithmetic fluency, negatives, fractions as division, order of operations, variables as placeholders, one-variable equations.

    1. 01 Quantities live on a line Every number, including the negative ones, is a position on one straight line. Distance from zero has a name, and you'll use it for the rest of the course. 18 min
    2. 02 There are only two operations Subtraction is just adding a negative. Division is just multiplying by a reciprocal. Four operations collapse into two, each with an undo button. 22 min
    3. 03 A fraction is one number a/b means "a divided by b," one point on the line. Fractions, decimals, and percents are three spellings of that single point. 25 min
    4. 04 Expressions are trees Operator precedence isn't a slogan to memorize; it's a parser. Every expression has exactly one tree, and evaluating it is a climb up that tree. 20 min
    5. 05 Variables, expressions, equations A variable is a named box. An equation is a constraint. Solving is pressing undo buttons on the expression tree, from the outside in. 28 min
  2. Module 2 drafting

    Algebra I & II

    Linear equations & graphing, slope, systems, quadratics, functions as machines, function composition, polynomials, exponents, logarithms.

    1. 01 Same value, different shape The distributive law and combining like terms are rewrites. They change what an expression looks like without touching what it computes. Substitution is the check that proves it. 25 min
    2. 02 The only shape with a slope A line is the one curve whose steepness never changes. Slope is that single number. The plane, the three equation forms, and parallel-versus-perpendicular all fall out of it. 30 min
    3. 03 Two lines, three outcomes A system of two linear equations is two lines in one plane. Solving it is finding where they cross, and elimination is a chain of moves that never lets the crossing point escape. 25 min
    4. 04 Wiring machines together A function is a machine with one input and one output, every single time. Composition wires one machine's output into the next one's input, which is exactly what a neural network is. 35 min
    5. 05 Quadratics and the parabola A quadratic is one curve, the parabola, wearing three different equations. Vertex form, factored form, standard form. The quadratic formula is not magic; it is completing the square, done once. 30 min
    6. 06 The one identity we live by Exponents, the exponential function, and its inverse the logarithm. The payoff is a single identity, log turns products into sums, and it is the reason training a model is even possible. 40 min
  3. Module 3 drafting

    Trigonometry: compact

    Unit circle, sine and cosine, angle addition, rotations in 2D, Pythagoras, polar coordinates. Only what we'll actually use.

    1. 01 What a radian actually is An angle is a turn. The unit circle is its natural home. A radian is not a mysterious unit, it is just the length of the arc your turn carves out on a circle of radius one. 20 min
    2. 02 Sine and cosine are coordinates Cosine and sine are just the x and y of a point on the unit circle. Inside a right triangle the same two numbers reappear as ratios of sides. It is one definition wearing two outfits, not two topics to memorize. 24 min
    3. 03 The wave they make Stop thinking of sine and cosine as angle machines. Feed them any real number and their outputs trace a wave. You can stretch it, squeeze it, and slide it, and every transformer you ever meet is built from waves like these. 20 min
    4. 04 What rotation actually is The angle addition formulas are not party tricks to memorize. They are the algebra of stacking two turns, and out of them falls the rule for rotating any point in the plane. That rule is the one a transformer uses to track position. 26 min
    5. 05 Polar coordinates and the position fingerprint Polar coordinates are the unit circle scaled up. Inverse trig works only once you restrict the domain. Then the payoff. Tag a position with sines and cosines at many frequencies and every position gets a unique fingerprint, which is exactly a transformer's positional encoding. 24 min
  4. Module 4 drafting

    Pre-calculus: the limit intuition

    The ten parent functions and one grammar that bends them all, sequences and the first infinite process, informal limits by table and zoom, continuity, and the number e two ways.

    1. 01 The zoo of functions, and one grammar There are about ten function shapes worth knowing on sight. Then there is a single shift, scale, and reflect grammar that bends any of them into the graph you need. Learn the grammar once and every parent function obeys it. 30 min
    2. 02 Sequences, series, and the first infinite process A sequence is just a function whose inputs are 1, 2, 3, and on. A series is its running total. Add up infinitely many shrinking terms and, against intuition, you can land on a finite number. That is the on-ramp to the limit. 30 min
    3. 03 What approaching actually means A limit asks where a function is headed near a point, not what it does at the point. Read it three ways, by graph, by table, by zoom. Meet the ways a limit can fail, and the property called continuity that makes most limits a plug-in. 30 min
    4. 04 The number that makes calculus clean Meet e two completely different ways and watch them land on the same number. Recap the natural log. Then zoom into a smooth curve until it becomes a straight line, the single picture the next module turns into the derivative. 30 min
  5. Module 5 drafting

    Single-variable Calculus: Derivatives & the Chain Rule

    The most important module in the first half of the course. The derivative. The chain rule. THIS is what runs when you call `loss.backward()`.

    1. 01 What is a derivative? A curve has a different slope at every point. Finding that slope means cheating with a secant, then taking the cheat to its limit. 12 min
    2. 02 The derivative as a function f'(a) is a number at one point. Let a slide, and the numbers join into a new function f'(x). The derivative is not a thing you compute once. It is a function in its own right. 10 min
    3. 03 Differentiation rules One rule for powers, one for sums, one for constants, plus the derivatives of sine, cosine, the exponential, and the logarithm. Memorize seven lines, differentiate ninety percent of the functions you will ever meet. 14 min
    4. 04 Product and quotient rules The derivative of a product is not the product of the derivatives. The geometric reason why, the rule that comes out of it, and one more for division. 8 min
    5. 05 The Chain Rule When functions nest, their rates multiply. That multiplication is what trains every neural network. 16 min
    6. 06 Second derivatives and finding optima f' tells you which way the curve is heading. f'' tells you whether that heading is itself speeding up or slowing down. Together they find every peak, every valley, and tee up the entire job of training a neural network. 12 min
  6. Module 6 drafting

    Multivariable Calculus: Partial Derivatives, Gradients, Jacobians

    Functions of many variables. The gradient as a vector pointing uphill. The Jacobian as a gradient for vector-valued functions.

    1. 01 Partial Derivatives: one knob at a time Two inputs instead of one. Freeze one, move the other, and the ordinary derivative still works. That's all a partial derivative is. 15 min
    2. 02 The Gradient: every partial, bundled into one arrow Pack all your partials into a tuple. That tuple is a direction, specifically the direction things get worse fastest. Or better fastest, depending on the sign. 16 min
    3. 03 Local Linear Models and Saddle Points Some critical points are bowls. Some are caps. Some are neither (saddles, where the function goes up in one direction and down in another). In high dimensions, saddles are everywhere. 16 min
    4. 04 Jacobians and the Multivariable Chain Rule When a function returns multiple numbers, stack its gradients into a grid, and that grid is the Jacobian. Compose two functions and their Jacobians multiply. Neural networks are exactly this, about a hundred times in a row. 20 min
    5. 05 loss.backward() is a Vector–Jacobian Product A neural network is a long function composition. Its derivative is a long product of Jacobians. We never build those Jacobians; we multiply them onto a row vector from right to left. That walk is backprop. 20 min
  7. Module 7 drafting

    Linear Algebra

    Vectors, matrices as linear transformations, dot products, matrix multiplication as composition, determinants, eigenvalues, SVD intuition.

    1. 01 What is a vector, really? Physicists see arrows. Programmers see lists. It's the same object wearing two costumes, and switching between them is a skill you build once. 14 min
    2. 02 Dot product, the alignment number A single number that tells you how much two arrows point the same way. Two formulas compute it. The ML world runs on it. 14 min
    3. 03 A Matrix Is a Transformation Two arrows are enough to record any linear move on 2D space. Drag them, and you've written every 2×2 matrix that exists. 18 min
    4. 04 Composition, Determinants, and Inverses Multiplying matrices = composing transformations. The determinant is the area factor. The inverse is the transformation that undoes, and it exists exactly when no space has been crushed. 22 min
    5. 05 Eigenvectors, Change of Basis, and a Glimpse of SVD Every transformation has special directions it stretches without rotating. Working in those directions collapses complicated matrices to pure scalings. SVD extends the idea to every matrix there is. 22 min
  8. Module 8 drafting

    Probability & Statistics

    Sample spaces, conditional probability, Bayes, random variables, PMF/PDF/CDF, expectation, the Gaussian, the CLT, sampling.

    1. 01 What does probability measure? Probability is a rule that hands numbers in [0, 1] to sets of outcomes. Random variables, pmfs, and densities are how we drag that rule onto the real line so we can do math. 12 min
    2. 02 Joint, marginal, conditional When two random variables share a world, the joint pmf is the whole picture, marginals are projections, and conditioning is taking a slice and renormalizing it. Bayes' theorem is the punchline. 16 min
    3. 03 What you can say about a random variable in one number Mean, variance, covariance: three summary statistics that capture almost everything we need from a distribution. Weight initialization, BatchNorm, and the noise in stochastic gradient descent all live inside these three numbers. 18 min
    4. 04 Maximum likelihood: fit the data, with a formula Every loss function you'll meet for the rest of this course is the same move. Pick parameters that make the data as probable as possible. For Bernoulli, that's count-and-normalize. For Gaussian noise, that's least squares. 14 min
    5. 05 Drawing samples (and what could possibly go wrong) torch.multinomial isn't magic. It's six lines of JavaScript. Once you can sample from a categorical, you can sample from a trained language model, and controlling generation is just settings on top of those six lines. 14 min
  9. Module 9 drafting

    Information Theory Basics

    Surprise = −log p. Entropy as average surprise. Cross-entropy. KL divergence. Why cross-entropy is the right classification loss.

    1. 01 How surprised should you be? Rare events carry more information than common ones. Turn that intuition into a single number, and then average that number over a distribution to get entropy, the loss every language model in this course minimizes. 12 min
    2. 02 Two distributions in the same room When the truth is P but your model is Q, the average surprise you suffer is cross-entropy. The excess over H(P) is KL divergence, which is non-negative, asymmetric, and the thing label smoothing exists to fix. 15 min
    3. 03 Why classifiers have a softmax head For one-hot targets, cross-entropy collapses to negative log-likelihood of the true class. Softmax exists because that loss needs a valid Q, and softmax-plus-NLL produces the cleanest gradient in all of supervised learning. 12 min
    4. 04 Perplexity and the floor of English Perplexity is cross-entropy on a branching-factor scale. Shannon (1951) bounded English at about 1.0 to 1.3 bits per character, a hard floor the m18 capstone's training curve approaches but cannot break. 14 min
  10. Module 10 drafting

    Optimization

    Minimizing a scalar function. Gradient descent. Learning rate. SGD. Momentum. RMSProp. Adam. Loss-landscape pathologies.

    1. 01 What does it mean to minimize a function? Gradients point uphill. Walk against them and you go downhill. The only subtle part is how far to step, and it turns out that one dial has a lot of opinions. 16 min
    2. 02 Batches, Stochastic, and the Noise Ball Real datasets are too big to gradient-check every step. Use a sample. You introduce noise, which turns out to help, not hurt, in ways that matter. 16 min
    3. 03 Momentum: Gradient Descent with a Memory Teach gradient descent to remember its past moves. Persistent directions amplify; oscillating ones cancel. Ravines stop being slow. 18 min
    4. 04 Per-parameter Steps: RMSProp, and then Adam Build Adam in pieces. Start with momentum. Add a per-parameter adaptive step via an EMA of squared gradients. Add bias correction for the cold start. Now you have the optimizer that trains transformers. 20 min
    5. 05 Schedules, Pathologies, and What nanoGPT Actually Does Warmup, cosine decay, gradient clipping, saddle points, and the overparameterized regime. By the end you'll read nanoGPT's training loop and understand every line. 18 min
  11. Module 11 drafting

    Neural Network Fundamentals

    Perceptron. Activations (ReLU, GELU, sigmoid, tanh). The XOR problem. Multilayer perceptrons. Forward pass as matrix multiplies and nonlinearities.

    1. 01 What's a perceptron, really? A single neuron is three numbers and a threshold. It computes a weighted sum, compares it to zero, and answers yes or no. Geometrically, it draws a line. 15 min
    2. 02 The XOR moment A single perceptron cannot compute XOR. That 1969 fact stalled the field for a decade, and it is the entire reason neural networks have hidden layers. 18 min
    3. 03 Why a stack of linear layers is just one linear layer Without a nonlinearity between them, ten layers do exactly what one layer does. The algebra is short and brutal, and it is the reason activation functions exist. 12 min
    4. 04 ReLU, and the activation zoo Four activation functions are worth knowing. ReLU is the modern default for hidden layers, and the reason why comes down to one number, the slope of its competitor. 18 min
    5. 05 Forward pass, end to end Assemble linear layers and activations into a full MLP, run it from input to output by hand, count its parameters, and take the universal approximation theorem exactly as seriously as it deserves. 22 min
  12. Module 12 drafting

    Backpropagation from Scratch

    The keystone module of the course. Build micrograd node by node. Computational graph editor. Watch gradients flow backward through a tanh.

    1. 01 Draw the graph Every formula you can write down is also a picture. Nodes hold values, edges hold operations, and every edge has a local derivative. This is the picture backprop walks. 14 min
    2. 02 Walk it backward The backward pass is the forward pass in reverse. Seed the root with grad 1, walk the graph upstream, and use each edge's local derivative to distribute one number per node. The order matters more than anyone tells you. 16 min
    3. 03 Build the Value class Stop walking by hand. Write the code that does the walk. Each closure you fill in becomes one line of the engine that trains every neural network on earth. 30 min
    4. 04 From scalars to tensors Stack 64 scalars into a vector, stack 768 vectors into a matrix, and the same chain rule still works. The forward is faster. The backward gradients are now matrices. The bookkeeping (broadcasting, axis-sums, the famous softmax+CE collapse) is what this lesson teaches. 18 min
    5. 05 Check it, break it, fix it Every autodiff bug is caught by a gradient check or diagnosed by recognizing one of the four classic footguns. And one slide on why deep learning picked reverse-mode in the first place. 18 min
  13. Module 13 drafting

    Training Dynamics & Modern Tricks

    Over/underfitting. Train/val/test splits. L2. Dropout. Weight init. BatchNorm. LayerNorm. Residual connections. LR warmup.

    1. 01 When your model memorizes the training set Train and validation curves are diagnostic instruments. Learn to read them, and learn why touching the test set is a sin. 12 min
    2. 02 Two ways to keep weights small (and what they really mean) L2, L1, and dropout, framed as priors and as expected-value contracts. Two completely different stories that solve overlapping problems. 14 min
    3. 03 Why your gradients explode Initialization is not folklore. It falls out of one line of variance algebra, and it's the difference between a network that trains and a network that sits dead at step zero. 15 min
    4. 04 Normalize what, exactly? BatchNorm and LayerNorm are the same operation parameterized by which axes you reduce over. Pick the axis whose statistics are stable. 15 min
    5. 05 The tricks that make depth possible Residual connections, learning-rate warmup, gradient clipping. The three things that turn a 12-layer transformer from "diverges in 200 steps" into "converges cleanly." 16 min
  14. Module 14 shipped

    Sequence Modeling: Bigrams to RNNs

    From a bigram count table to an RNN. Tokens, the chain rule of probability, perplexity, sampling, fixed-context MLPs, and the recurrent hidden state, built so that attention next module lands as a fix to a specific, named failure mode.

    1. 01 Predicting the next character A language model is just a table, one row per "what came before" and one column per "what could come next." Learn the table by counting; sample from a row to generate text. 14 min
    2. 02 Two paths to the same model Counting bigrams and training a one-layer neural network find the exact same probabilities. Watch them converge in real time, then learn why "smoothing" and "L2 regularization" are the same trick from two angles. 18 min
    3. 03 Loss, perplexity, and how to sample NLL is the loss; perplexity is its branching-factor twin; temperature reshapes a distribution without changing what's most likely; top-k censors the long tail. Four ideas, one widget, every later language model. 16 min
    4. 04 Why bigrams aren't enough A bigram remembers one character. A trigram needs |V|² rows. Bengio's 2003 fix replaces the exploding lookup table with a small MLP fed concatenated embeddings, and the model learns to generalize across similar contexts. 16 min
    5. 05 Carrying memory forward, the RNN A vector that gets overwritten each timestep replaces the fixed window. The same cell (same W_x, same W_h, same b) runs at every step. Weight sharing is the entire trick. 16 min
    6. 06 Training the RNN, and where it breaks BPTT is just backprop on the unrolled graph, with the shared weight's gradient summed across timesteps. Vanishing and exploding gradients are properties of repeated linear contraction, and the bottleneck is the failure mode attention is built to fix. 16 min
  15. Module 15 shipped

    Attention

    Build attention from a soft dictionary lookup. Scaled dot-product. Q, K, V as projections of the same X. Causal masking. Permutation-equivariance and three flavors of positional encoding. Multi-head as parallel subspaces. The T² cost and the KV-cache that tames it.

    1. 01 Soft dictionary lookup A hash table maps a query to one entry. A soft hash table returns every entry, weighted by similarity. Replace argmax with softmax and the lookup becomes differentiable. That's attention; the rest of this module is plumbing. 13 min
    2. 02 Why divide by √dₖ At high dimension, the dot product q·k has a large variance. Without correction, softmax saturates to one-hot and gradients to all but the argmax die. Divide by √dₖ and the score variance returns to 1, and the model can keep learning. 12 min
    3. 03 Three projections of the same X Self-attention is the same operation as last lesson, applied in parallel. Every token is both a querier and a query target. Q, K, V are three linear projections of the same X. Causal masking and the permutation-equivariance bug come for free. 18 min
    4. 04 Position, three ways Self-attention is blind to order. Patch it. Sinusoidal encodings drop a multi-scale clock onto the input. Learned absolute encodings buy positions from a lookup table. RoPE rotates queries and keys inside the attention score so only relative position survives. 14 min
    5. 05 Multi-head: parallel subspaces, same budget A single attention head averages; that's its job. Several heads, run in parallel on disjoint subspaces of the same model dimension, give the layer a way to attend in several different patterns at once. Same total parameters, same total FLOPs, structural diversity for free. 10 min
    6. 06 The cost and the cache Attention is O(T²·dₖ), the number that defines modern LLM economics. At inference, only the new token issues a query, so K and V for past tokens can be cached. That single trick converts streaming generation from cubic to quadratic in sequence length. 10 min
  16. Module 16 shipped

    The Transformer Block

    Compose attention with a position-wise FFN, wrap each in residual + layer norm, stack N times, top with a final LN and a tied unembedding. Pre-LN vs post-LN. The residual stream as the noun the model operates on. Where the parameters and FLOPs actually go.

    1. 01 One block, top to bottom Build the canonical transformer block from already-known parts. Two sub-layers (attention then a position-wise MLP), each wrapped in residual + layer norm. The block doesn't transform the residual stream; it adds a delta to it. 15 min
    2. 02 Why pre-LN Two transformer-block architectures differ by one wire, whether LayerNorm sits before each sub-layer or after the residual add. One trains without warmup at any depth. The other doesn't. Find out why. 14 min
    3. 03 The residual stream as the object Stop thinking of the residual stream as plumbing and start thinking of it as the noun the model is operating on. Every block reads from it, computes a delta, writes the delta back. The forward pass is a sum of corrections layered onto a bigram floor. 18 min
    4. 04 Stacking N and the full forward pass Stack the block N times. Add a final LayerNorm. Project back to vocabulary with a tied unembedding. That's GPT: the entire architecture, top to bottom. 16 min
    5. 05 Counting the cost Per-block, per-token, per-step. Where the parameters live (FFN, mostly), where the FLOPs live (FFN, also mostly), and the rule of thumb that lets you predict training compute from parameter count alone. 12 min
  17. Module 17 shipped

    Tokenization, Training & Sampling

    How text becomes integer ids (BPE). How the M16 forward pass becomes a working model (the training loop with AdamW, warmup, cosine decay, gradient clipping). How a trained model becomes text again (autoregressive sampling with temperature, top-k, top-p, and the KV cache that makes inference O(T)).

    1. 01 Text becomes integers The model never sees characters or bytes. It sees integer ids into a fixed vocabulary, picked once when the tokenizer was trained. Three flavors of tokenizer (character, word, subword) and the costs of each. 14 min
    2. 02 How a tokenizer is built Byte-pair encoding is a deterministic, frequency-greedy merge loop. Train once, ship the merge table, replay it at inference. Byte-level BPE makes the tokenizer total, meaning every string is representable with no UNK token. The "strawberry" failure is a downstream consequence, not a bug. 18 min
    3. 03 Wrapping the training loop around the transformer From a 1-D stream of token ids to a working data loader, batched NLL across all positions, and AdamW with warmup, cosine decay, and gradient clipping. The training loop is the outside of the transformer; the forward pass from M16 sits inside it, untouched. 18 min
    4. 04 What the loss curve is telling you Loss has shape, not just direction. The expected trajectory on tiny-shakespeare-char. How to recognize overfitting from the train/val gap. Why "monotone down" is not the same as "working." The five canonical pathologies. 12 min
    5. 05 How distributions become text The autoregressive loop wrapped around the M16 forward pass. Why argmax fails for open-ended LMs. Temperature, top-k, top-p as logit-space transforms in a fixed pipeline. Beam search and why we don't use it. The KV cache as the reason inference is O(T) instead of O(T²). 22 min
  18. Module 18 shipped

    Capstone: Train a Tiny Transformer in Your Browser

    4 layers, 4 heads, 64-embed, 64-context. ~209k parameters. Trains in roughly 5 minutes on WebGPU. Produces Shakespeare-flavored nonsense. Yours to keep.

    1. 01 Press start Boot the engine. Watch the first iteration take three seconds because every WGSL shader compiles on first dispatch. Then watch the rest take eighty milliseconds. The training loop you wrote in M17 is now running on the GPU you own. 15 min
    2. 02 Watch it learn Same model, longer run. Live samples appear every two hundred iterations so you can watch gibberish turn into bigrams turn into Shakespeare-flavored prose. Then six preset buttons that retrain under known pathologies so you recognize what broken training looks like in your own engine. 25 min
    3. 03 Your checkpoint A seed string is the whole identity of a training run. Type the same seed twice, you get byte-identical weights. Save the weights. They land on your disk as one tiny binary file. That file is yours. 15 min
    4. 04 Now make it talk The same trained model, with three knobs. Temperature reshapes the distribution. Top-k throws away the long tail. Top-p chooses how much probability mass to keep. Watch the bars vanish in real time as you slide. 30 min
    5. 05 The credits roll Six files. About four hundred lines of code. Every line is something you wrote, and something you understand. The model is still talking above the scroll. Read the code. 18 min