-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How best to implement a differential transformer? #567
Comments
ive done similar experiment too, first of all I recommend looking at non flash implementation from diff attn: "multihead_diffattn.py" and not "multihead_flashdiff_1.py" secondly youre dividing n_heads and head dim twice, its a issue, it does not appear in original code. lastly, even through standard gpt2 with RoPE went well, I recommend starting with non RoPE version, since its easier to begin with. and probably F.scaled_dot_product_attention is a bit different that flash attention internally. |
my implementation:
|
I'm not sure issues is the greatest place to post this but I just wanted to see if anyone else had been trying this idea:
There was a paper that came out recently that proposed a new head architecture, and I wanted to see if I could replicate the results (according to the paper they are very promising). It didn't seem too hard given what I knew from messing around with this repo. The authors provided 3 versions of the code here and to keep things simple I tried to use this implementation here. I added rotary positional encoding separately and tested that, it worked well, and then I added the differential mechanism, my code looks like this:
When i try and train this model it understandably trains at a lower iterations/sec, but if we look at the loss per iteration it seems to be getting stuck. (in each iteration i have kept the total batch size as compared to the gpt2-124M-RoPE run)
Any ideas on what I've gotten wrong? I'm no ML expert
@karpathy on the off chance that you see this, have you read about the diff transformer paper and if so, what do you think about it?
The text was updated successfully, but these errors were encountered: