Grokking the Algorithm: How a Transformer Implements Modular Addition
The part that really confused me was this expression:
Why are we comparing against all possible values of
Let me walk through how I finally understood it.
The Setup
We’re working modulo some number
For example, if
At first, I thought maybe
Step 1 — Numbers Become Angles
The model doesn’t treat numbers as integers. Instead, it maps them to the unit circle:
where:
So each number corresponds to a point on a circle. This already hints at modular behavior, since angles naturally wrap around.
Where I Got Stuck
I saw this part:
“compare each candidate
using ”
And I thought:
- Why are we looping over all
? - Where did this cosine come from?
- Is the model somehow testing every possible answer?
The answer is: yes—it is.
But that’s not inefficient—it’s exactly how transformers work.
Step 2 — The Model Doesn’t Output a Number
This was the key realization.
The model does not directly compute:
Instead, it produces a score for every possible output:
Then it picks the largest one.
So naturally, it must compare against every candidate
Step 3 — Where Trigonometric Identities Show Up
This was the second big confusion.
We start with:
Now the model needs to combine them into something representing
But it doesn’t know
So how does it do it?
This is where the identities come in:
Using these, the model can construct a representation of
This isn’t the model adding numbers — it’s constructing a new point on the circle that encodes their sum. The MLP weights effectively learn to implement these identities: given the embeddings of
Step 4 — Comparing Against Each Candidate
Now suppose the model has computed a vector representing
Each candidate
The model compares them using a dot product:
Expanding this:
Using another identity:
this becomes:
So that’s where the cosine comes from.
Step 5 — Why This Picks the Correct Answer
Cosine is maximized when its argument is zero (mod
So:
is maximized when:
Since:
this becomes:
which means:
So the correct
A Concrete Example
Let’s take:
Then:
The model computes scores:
: ← maximum : ≈ 0.62 : ≈ negative
So it picks
The Final Picture
What’s actually happening is:
The model:
- maps numbers to points on a circle via the embedding matrix
- uses the MLP to compute a representation of
in that same space - compares the result against all possible outputs via the unembedding matrix
- picks the closest match
Why This Is So Interesting
The model didn’t memorize addition.
It discovered a representation where:
- modular arithmetic becomes geometry
- addition becomes rotation
- equality becomes alignment
In other words, it learned structure—not just answers.
This post covers the algorithm itself. The deeper question — why the model discovers this structure through training, and what the grokking phase transition has to do with it — is where things get even more interesting. That’s coming next.
The Moment It Clicked
The key realization for me was:
The model isn’t solving for
.
It’s asking: whichlooks most like my computed result?
Once I saw that, everything else fell into place.