Rotary embeddings and shift both improve convergence #350
rom1504
started this conversation in
Show and tell
Replies: 1 comment
-
why do you think shift tokens improve performance? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@lucidrains implemented both rotary embeddings and shift tokens techniques in recent commits
I ran 4 runs on https://github.com/rom1504/kaggle-fashion-dalle (which is a small 50k high quality sample, easy to experiment with for dalle), a baseline, a run with shift tokens, a run with rotary and a run with both.
Here is the wandb report https://wandb.ai/rom1504/good_kaggle/reports/Dalle-pytorch-adding-rotary-and-shift--Vmlldzo5MzkwNzk
depth 16
dim head 64
heads 8
batch size 80
Ran for 40 epochs (20k steps)
The results are that shift tokens and rotary improve the convergence by about the same amount, and they are additive: when using both it's twice better.
Sample/s is
Loss results look like this at step 20k:
So in general I advise turning these 2 options on.
I hope this can be useful to dalle pytorch trainers :)
Beta Was this translation helpful? Give feedback.
All reactions