-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose GLU activations as arguments #69
Conversation
@thomasw21 Some additional notes:
Thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Small nit, let me know what you think:
- also could you add a test? Maybe one that just run a model with a glu config so we know it doesn't throw.
About your comments:
- I dislike
getattr
as I generally consider it bad practice, but really depends on the crowd I guess. I've seen some occurence of it in the codebase so why not? - Concerning that ... I don't know maybe you can test it out?
@thomasw21 Thanks for the review! The activation function themselves are already tested for in Are we running experiments in >>> import torch
>>> x = torch.randn(8, 100, 768).to(torch.bfloat16)
>>> x.dtype
torch.bfloat16
>>> torch.nn.functional.gelu(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jaketae/Documents/Dev/GitHub/Megatron-DeepSpeed/venv/lib/python3.7/site-packages/torch/nn/functional.py", line 1555, in gelu
return torch._C._nn.gelu(input)
RuntimeError: "GeluKernelImpl" not implemented for 'BFloat16'
>>> torch.__version__
'1.9.0' The trace seems to suggest that there's no kernel for |
So I'm not too familiar with bf16. From what I understand it's only available for A100 (cpu support seems to be very basic). Concerning our current experiments, we're running fp16 on all experiments right now. JZ has very few A100 and are more experimental, so there's probably no chance we're running bf16 there. However it doesn't mean we never will. Let's make a best effort to support it. Also concerning tests, maybe we should start discussing what's the general testing strategy. I usually enjoy general tests that test an entire pipeline. The reason it allows to think of the lib as an API, and launch script via CLI. So we'd want expected commands to work as expected. But I'm open to UT. Let's discuss. |
@thomasw21 I brushed up on the things that were mentioned (e.g. I agree that an exhaustive high-level test that goes through the entire pipeline would be helpful. I also think @stas00 might have some ideas in mind. Maybe we can open a separate issue and continue the discussion there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, @jaketae
Probably add --glu-activation=geglu
to the main training test - at the moment it just tests that it can just run and validates logs where possible - i.e. no need for a new training test, as you're already testing the functionality elsewhere.
So here it'd be something like this: 8088c0f
indeed, we don't need to worry about bf16 for JZ, but we could use it on A100 if we get access. To check for bf16 support, please see: I have rtx-3090 so if you want me to test something with bf16 please let me know. I only have 1 card though. The jit+bf16 check can be done in a test |
@stas00 Thanks for the review.
Let me know if there's anything that needs to be fixed! |
Co-authored-by: Stas Bekman <[email protected]>
Here you go - on rtx-3090
|
I see you got the same error as the one I got on my local. The trace seems to suggest that we can't use BF16 with JIT due to some missing kernel implementations. Here are my thoughts:
What do you think? |
This is unrelated to JIT, it's the same error if you remove jit:
It says it has no such kernel implemented for this dtype. |
We definitely don't have to do anything about it for now. We will cross that bridge when we come to it. Until then probably comment out the whole bf16 test, since we might want it down the road when torch implements bf16 support for those kernels. And it's good to have it there for other bf16 tests as an example. |
I see, thanks for the clarification. I see that it's just a BF16 issue. I'll comment out the test for now. Thanks! |
uncomment in the future when torch supports gelu kernels for bf16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to go now
@stas00 Shouldn't have merged this too quickly, thanks for the quick fix! |
All is good. It was my unvalidated suggestion that broke it. Once we get a CI, it'll be much easier. |
* add checkpoint measurement * Update CODEOWNERS * add TFLOP per sec support * remove duplicate tflops calculation * remove unnecessary comment * remove comments * remove comment
This PR exposes improved GLU activation functions (#47) as arguments to the main script. The added flag is