-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Support MS-AMP #2143
Comments
Might be good to have this as an alternative choice, from their docs: MS-AMP has the following benefit comparing with Transformer Engine:
Will work on this next week :) |
+++ would love to see MS-AMP supported. Currently, H100s are on par with A100s cost-wise even with the current FP8 implementation, but if MS-AMP FP8 can be implemented, it is likely anywhere between a 50-100% boost in training speed. We still need Flash Attention with FP8, but MS-AMP is a great first step towards faster training. |
@muellerzr is this branch in a state to be tested? https://github.com/huggingface/accelerate/tree/ms-amp thanks! |
@winglian not quite yet! But I'll let you know for you to test :) (should be by end of this week!) |
@winglian go ahead and try the branch out :) Note that it only works on single GPU for now (will look at deepspeed tommorow), and you shouldn't see a time decrease I don't think. What you should see though is a memory decrease for NLP based models. For example, I ran bert-base-cased (NLP example) and saw:
But time was almost ~2x increase 😱 |
Shouldn’t the FLOPs increase and thereby reducing training time? It should not be present on small models, but if you take a 30B, I would be surprised if you don’t see a difference |
Correct. I only tested on a tiny model just to get the API stable 😉 |
Now that it’s a bit more stable, I saw both memory decreases and speed increases when combining MS-AMP and TransformerEngine. More details are in the PR (so overall purely positives) |
@muellerzr accelerate fp8 with ms-amp backend seems not work with deepspeed. However ms-amp itself support work with deepspeed (zero) https://azure.github.io/MS-AMP/docs/user-tutorial/usage/#usage-in-deepspeed |
Correct, I'm looking into that this week |
Docs
MS-AMP would allow us to also store the weights in FP8, allowing for larger models to be trained on smaller hardware, as right now the weights are still stored on device as fp16/bf16.
The implementation example they provide seems similar to
accelerate.prepare(...)
:The text was updated successfully, but these errors were encountered: