-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix glu activation #148
Fix glu activation #148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to read the paper soon, but until then I defer to Jake on the proposed functionality change.
I made one proposal on the code changes, the rest looks good.
Thank you, Thomas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing this! I realize that GLU unit tests only perform checks on individual tensors and their computed values, not the whole training pipeline, so this fix was necessary. Looks good to me!
Hmm, I think we got a new pytorch and apex needs a manual rebuild and making a new image:
but
I pushed a fix and re-run the test suite - good to merge now. |
Test concerning glu was bypassed due to missing "--no-bias-gelu-fusion" which bypasses the whole activation path. Check https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/model/transformer.py#L101
The reason I came up with that, is
glu
activations divides the last dimension by two. which would not be possible in current MLP mechanism.In fact https://arxiv.org/abs/2002.05202 changes the hidden size as instead of two matrices, you have three (2 are merged together in the implementation)
cc @jaketae