-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lion 8 bit #188
Lion 8 bit #188
Conversation
autoregressive enwik8 with lion8bit converged successfully (sans weight decay), even though there's some inaccuracies in kernel.cu |
resolves #150 |
@christallire welcome you to build from source and import the seems to be working great, thanks to Tim's surrounding scaffold |
…the python class (so that the state dict still makes sense)
ok, will let this PR sit for a while to gather feedback. probably will revisit it at end of this month and see if we can get it merged |
i think there may still be an issue with the blockwise version, where the momentum update with beta2 is occuring before the actual parameter updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this! This looks almost good to me. The only thing that is off is the order of the updates that you point out. We can get around this without rewriting the CUDA code by using the gradient variable as temporary variable to store the current momentum value + the update value for Lion (see other comments). Please add a comment to these lines to help people understand the use of the gradient as a temporary variable (adding more comments is appreciated, sorry for the mess with the undocumented code).
The enwiki test runs are great! I think this shows that it works. It would be great to add a test in test_optim.py for Lion. You need three things for that: (1) define lion str2optimizer dictionary entry, (2) define str2statenames dictionary entry, (3) add the optimizer name to the optimizer_names list for the test_optimzier8bit function. That should be all that is needed. The only question is which 32-bit baseline to use. You might want to use your own Lion 32-bit repo. I think a fair test could also be to compare against the 32-bit bnb optimizer (since you have shown that 8-bit already replicates enwiki performance). This might be simpler and quick to write up.
@TimDettmers thank you Tim for the feedback! 🙏 will address everything you brought up and poke you when it is ready later this week |
Thank you, this looks good to me. The issue with the test is expected. The error from Lion is expected to be higher at times due to its noisy update. I will merge and will have a look at the test. This is an excellent PR, thank you for all the work. I think it will be invaluable to the community! |
@TimDettmers oh hey Tim! glad to see this merged thank you for reviewing it and getting it out there! totally forgot about it, my bad |
per advice from Tim, the plan will be to closely follow the 1-state logic of RMSProp and conditionally branch out for the lion logic
status -
Lion8bit
is successfully training a small autoregressive transformer for character level enwik8 on my machinesome remaining todos before merging
in pytorch
to test