Rotary embeddings and shift both improve convergence #350

rom1504 · 2021-08-17T10:02:28Z

rom1504
Aug 17, 2021

@lucidrains implemented both rotary embeddings and shift tokens techniques in recent commits
I ran 4 runs on https://github.com/rom1504/kaggle-fashion-dalle (which is a small 50k high quality sample, easy to experiment with for dalle), a baseline, a run with shift tokens, a run with rotary and a run with both.

Here is the wandb report https://wandb.ai/rom1504/good_kaggle/reports/Dalle-pytorch-adding-rotary-and-shift--Vmlldzo5MzkwNzk

depth 16
dim head 64
heads 8
batch size 80

Ran for 40 epochs (20k steps)

The results are that shift tokens and rotary improve the convergence by about the same amount, and they are additive: when using both it's twice better.
Sample/s is

baseline 64
shift 63
rotary embedding 61
both 60

Loss results look like this at step 20k:

baseline 1.79
shift 1.765
rotary 1.765
both 1.73

So in general I advise turning these 2 options on.

I hope this can be useful to dalle pytorch trainers :)

EmaadKhwaja · 2022-03-30T23:31:33Z

EmaadKhwaja
Mar 30, 2022

why do you think shift tokens improve performance?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rotary embeddings and shift both improve convergence #350

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Rotary embeddings and shift both improve convergence #350

rom1504 Aug 17, 2021

Replies: 1 comment

EmaadKhwaja Mar 30, 2022

rom1504
Aug 17, 2021

EmaadKhwaja
Mar 30, 2022