I trained a transformer in an email

Tiny PS: If you love solving similarly fun (but massively more useful) problems, consider joining us in Nevis, by sending me an email at adamoshadjivasiliou@gmail.com (no LLM needed inside it) or here.

Turns out you cankind of train a transformer and performkind of inference on it inside an email. If you are maybe wondering why would anyone do this, there is no answer. I was just interested to see if it was possible because it would be funny.

So what are the minimal requirements? For something to count as transformer, lets say it has to contain: token and position embeddings, Q/K/V attention with softmax, a residual stream, an MLP, output logits, and gradient updates across all of those when it trains. Easy.

And for the email part, lets say all compute has to happen locally from purely content of a valid email. .. Easy?

1. Can an email compute?

Can you have pytorch inside an email? Of course not. But can you have python? Also no. But at least you can have some javascript with some tricks right? No.

Fortunately we just need a way to evaluate an expression and save the result somwhere. CSS is turing complete, technically, through mainly calc(), can do arithmetic, but only inside CSS property values, not really over numbers we put somewhere ourselves.

What we can get is AMP for Email, a stripped-down version of Google's mobile-page framework that runs inside Gmail, Yahoo, and Fastmail. Pretty cool stuff, you can even call an API this way. The relevant piece for us though is amp-bind: on a click, it evaluates an expression over values writes the result somehwere, and re-renders anything bound to the state. The expression language is small: arithmetic, comparisons, ternaries, array indexing, string slicing, and a handful of math functions (pow, min, max, abs).

Some brutal constraints:

No loops, so anything repeated has to be unrolled into explicit steps.
Each expression caps at 250 characters, which is tough, since it means we have to split quite a few times and that means one button for each.

But the good news is that is still enough to add two numbers, compare them, pick a result, and write it into state. So we are in business.

For example, this button computes Q.

<button class="button"
  [hidden]="!(s.inferStage=='encoded')"
  on="tap:AMP.setState({s:{
    stage:'query',
    q0:(s.x20*s.wq00+s.x21*s.wq01),
    q1:(s.x20*s.wq10+s.x21*s.wq11),
    inferStage:'query'
  }})">Compute Q</button>

2. Can an email store the model?

We need somewhere to keep weights, activations, gradients, the current context, and the generated string between clicks.

So we put the whole model into an amp-state blob, which is basically a json. Token embeddings, position embeddings, Q/K/V projection matrices, MLP weights and biases, output weights and biases, loss, gradients, and every intermediate activation from the last forward pass. And buttons read from it and write back.

The whole model, in one json blob.

<amp-state id="s"><script type="application/json">{
  "vocab":"abcde",
  "context":"abc",
  "generated":"",
  "trainSteps":0,
  "lr":0.045,
  "emb0":[0.12,-0.18,0.24,-0.08,0.16],
  "emb1":[-0.05,0.09,-0.14,0.19,0.04],
  "pos0":[-0.09,0.04,0.11],
  "pos1":[0.03,-0.06,0.08],
  "wq00":0.41,"wq01":-0.22,"wq10":0.18,"wq11":0.33,
  "wk00":-0.37, ... ,
  "wv00":0.53, ... ,
  "w100":0.72, ... , "b10":0.02, "b11":-0.01,
  "wout":[[-0.31,0.22], ... ,[-0.06,-0.25]],
  "x00":0,"x01":0, ... ,"q0":0,"q1":0,
  "aw0":1,"aw1":1,"aw2":1, ... ,
  "loss":1.609438
}</script></amp-state>

3. Can an email do softmax or similar?

Attention needs softmax. AMP has pow, max, min, and abs, but no exp.

So we cannot, but any monotonic exponential wouldkind of do. So instead of exp(score) we use pow(2, score). It of course has a lot of drawbacks, but we move on.

softmax with pow(2, x) instead of exp, clipped so it doesn't explode.

aw0: pow(2, max(-8, min(8, (s.q0*s.k00+s.q1*s.k01)/s.temp))),
aw1: pow(2, max(-8, min(8, (s.q0*s.k10+s.q1*s.k11)/s.temp))),
aw2: pow(2, max(-8, min(8, (s.q0*s.k20+s.q1*s.k21)/s.temp))),

aden: (s.aw0 + s.aw1 + s.aw2),
a0:   s.aw0 / s.aden,
a1:   s.aw1 / s.aden,
a2:   s.aw2 / s.aden

4. Can it train the model?

With no autodiff in AMP, the chain rule has to be spelled out by hand, one gradient term per button, each computing the next term from values already in state. And every operation is basically its own button as well. This is not a way I would recommend learning backprop. It is, however, very effective at making sure you remember it.

Chain rule, written out.

e0: (s.p0 - (s.target==0?1:0)) * 0.69314718056 / s.outTemp,
...

gh0: s.e0*s.wout0[0] + s.e1*s.wout0[1] + ... + s.e4*s.wout0[4],
gh1: s.e0*s.wout1[0] + s.e1*s.wout1[1] + ... + s.e4*s.wout1[4],

relu0: (s.fpre0 > 0 ? 1 : 0),
relu1: (s.fpre1 > 0 ? 1 : 0),

gp0: (s.gh0*s.w200 + s.gh1*s.w210) * s.relu0,
gp1: (s.gh0*s.w201 + s.gh1*s.w211) * s.relu1,

wout0: [s.wout0[0] - s.lr*s.e0*s.h0,
        s.wout0[1] - s.lr*s.e1*s.h0, ... ]

5. Can all of this fit inside an email?

AMP caps the total document at 200KB and custom CSS at 75KB. And since we have to unroll pretty much everything, this is also brutal. So for now we will run with a five-token vocabulary and a three-token context (we could improve a bit on this, but why would anyone need more than 3?), one attention head with two-dimensional Q/K/V, a small MLP, and five output logits. It comes to around 45KB. The corpus is abcdeabcdeabcde, because we are not exactly doing any fancy post training here (or pre, or mid).

No loops, so each loss bar is its own span with a ternary cascade.

<span class="loss-bar loss-bar-empty"
  [class]="s.lossCount>5
    ? s.loss0<0.2 ? 'loss-bar loss-bar-1'
    : s.loss0<0.4 ? 'loss-bar loss-bar-2'
    : s.loss0<0.6 ? 'loss-bar loss-bar-3'
    : ...
    : 'loss-bar loss-bar-12'
    : 'loss-bar loss-bar-empty'"></span>
<!-- ...repeated for loss1..loss5 -->

6. So can it predict the next character?

Yeskind of. Well predict is a strong word, i would not say its the future of local models, but we can definitely do a forward pass and get a result. Here it is.

Its 44 clicks for 1 training pass, and 18 clicks for predicting 1 token. AGIliterally. So if you were wondering about the speed, tks/s depends on your clicking skills. Which it kind of makes it the first model that its performance depends on you and not your machine's specs.

Download

Loading…

You can test the email in https://amp.gmail.dev/playground/ and send it to yourself.