Grokking the Algorithm: How a Transformer Implements Modular Addition


The part that really confused me was this expression:

Why are we comparing against all possible values of ? And where do trigonometric identities even come into play?

Let me walk through how I finally understood it.


The Setup

We’re working modulo some number . So the task is:

For example, if , then all values live in:

At first, I thought maybe and could be arbitrary numbers like 24 or 67—but in the model, they’re actually elements of this finite set.


Step 1 — Numbers Become Angles

The model doesn’t treat numbers as integers. Instead, it maps them to the unit circle:

where:

So each number corresponds to a point on a circle. This already hints at modular behavior, since angles naturally wrap around.


Where I Got Stuck

I saw this part:

“compare each candidate using

And I thought:

The answer is: yes—it is.

But that’s not inefficient—it’s exactly how transformers work.


Step 2 — The Model Doesn’t Output a Number

This was the key realization.

The model does not directly compute:

Instead, it produces a score for every possible output:

Then it picks the largest one.

So naturally, it must compare against every candidate .


Step 3 — Where Trigonometric Identities Show Up

This was the second big confusion.

We start with:

Now the model needs to combine them into something representing .

But it doesn’t know explicitly.

So how does it do it?

This is where the identities come in:

Using these, the model can construct a representation of in angle space:

This isn’t the model adding numbers — it’s constructing a new point on the circle that encodes their sum. The MLP weights effectively learn to implement these identities: given the embeddings of and separately, the nonlinear layers compute the products and linear combinations that produce this joint representation.


Step 4 — Comparing Against Each Candidate

Now suppose the model has computed a vector representing :

Each candidate also has a representation. These come from the unembedding matrix — the same weight matrix used to project the residual stream back into vocabulary space provides a geometric vector for each possible output :

The model compares them using a dot product:

Expanding this:

Using another identity:

this becomes:

So that’s where the cosine comes from.


Step 5 — Why This Picks the Correct Answer

Cosine is maximized when its argument is zero (mod ):

So:

is maximized when:

Since:

this becomes:

which means:

So the correct gets the highest score.


A Concrete Example

Let’s take:

Then:

The model computes scores:

So it picks .


The Final Picture

What’s actually happening is:

The model:

  1. maps numbers to points on a circle via the embedding matrix
  2. uses the MLP to compute a representation of in that same space
  3. compares the result against all possible outputs via the unembedding matrix
  4. picks the closest match

Why This Is So Interesting

The model didn’t memorize addition.

It discovered a representation where:

In other words, it learned structure—not just answers.

This post covers the algorithm itself. The deeper question — why the model discovers this structure through training, and what the grokking phase transition has to do with it — is where things get even more interesting. That’s coming next.


The Moment It Clicked

The key realization for me was:

The model isn’t solving for .
It’s asking: which looks most like my computed result?

Once I saw that, everything else fell into place.