Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron fused CUDA kernels to improve Hugging Face model classes' scalability #11368

Open
g-karthik opened this issue Apr 22, 2021 · 1 comment
Labels
Performance WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@g-karthik
Copy link

g-karthik commented Apr 22, 2021

🚀 Feature request

Support for custom fused CUDA kernels with HF model classes.

Motivation

It appears that Hugging Face model classes do not scale very well as-is unlike Megatron-LM, even when the latter is configured with a degree of model-parallelization = 1 for a "fair" performance comparison.

One of the presumed reasons for this is that Megatron-LM leverages custom fused CUDA kernels written by NVIDIA, specifically these.

Could we get variants of existing HF classes (perhaps for GPT2Model, GPT2LMHeadModel, etc.) such that the variants leverage some/all of these fused CUDA kernels? All this while still ensuring that one can load the original pre-trained weights into these variant classes.

Any guidance/low-level thoughts towards making this happen would also be greatly useful!

@thomwolf @patrickvonplaten @LysandreJik @stas00

@huggingface huggingface deleted a comment from github-actions bot May 22, 2021
@stas00 stas00 added WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress Performance labels May 22, 2021
@stas00
Copy link
Contributor

stas00 commented May 23, 2021

I think the biggest barrier to using custom CUDA kernel is that it'd require transformers to move from a python-only package, to a compilation-required type of package (even if JIT), which in my experience is the type of a package that is far from trivial to use and often raises a barrier to entry.

If I'm not mistaken some fused kernels have been pushed upstream into the pytorch-core, so if you know of any that we could receive precompiled via pytorch, then we can definitely use those.

And if they aren't and you have some resources to initiate the conversation - it'd definitely help to request that such kernels will be added to pytorch-core. Definitely tag me if I do start such a thread at pytorch Issues.


I love your spirit of proposing various performance optimizations, @g-karthik and I'd love to work on all of those you have been proposing here and at Deepspeed issues, but so far I find no free resources to do so and all my time is spent on making things work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

2 participants