-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference] Support GPT-J-6B #1332
Comments
Hi @oborchers Thanks for your request.
I think after resolving these issues, we can get this model running through DeepSpeed-Inference. |
much appreciated, thanks for coming back at the request! 👍 In the meantime I already took some time to understand the models behavior, the policy, and came up with something that runs. It doesn't produce anything useful (obviously), because I didn't consider 2 and 3, but it runs to some degree and can realize similar gains in inference speed compared to Neo on a V100. Regarding 1: I came up with the following
This, however, has multiple caveats:
Regarding 2: Yes. Also reflected in the missing second layer norm, so just altering the config with something like Regarding 3: Not even thought about that! Thanks for the analysis and the support of the request 👍 All the best and a nice evening, |
Any updates on this issue? @oborchers @RezaYazdaniAminabadi |
I'm also very interested in this one. |
Hi @yovizzle @zgerrard and @oborchers Thanks for your interest. Sorry for the delay on getting back on this thread. |
Can you please try this PR to see if it works for this model? |
@RezaYazdaniAminabadi: Thank you for working on the issue! Much appreciated 👍 Without: (Single GPU online inference) I'm assuming based on the PR description this is mostly targeted at multi-GPU inference due to the tensor slicing, right? |
Hi @oborchers |
Thanks for this, @RezaYazdaniAminabadi! Do you have an ETA on the inference kernels for GPT-J? Even a very rough ETA would be helpful. |
Hi @joehoover, I am going to be more focused on this through next week. I would say it is ready by early December. |
@RezaYazdaniAminabadi excellent! Thank you for that time estimate and the work on it 👍🏻 |
@RezaYazdaniAminabadi I tried this PR but I've got strange outcome. It worked with 1 or 2 GPUs, but crashed with 3 GPUs. Here is my code. The input file has 10 prompts.
When running with --include localhost:0 or --include localhost:0,1, it worked fine. It crashed with 3 GPUs.
|
Hi @dunalduck0 Thanks for trying this. Best, |
Thank you @RezaYazdaniAminabadi . For text-generation task, input lengths are normally varying. Does it mean we need to pad the input so that the dimensions are divisible? If so, how do I do that? |
Sorry, I meant the model dimensions, such as hidden-size and number of attention heads. This is due to partitioning the weights across GPUs. The input will not be however partitioned but broadcasted to GPUs. |
Well, then the number of GPUs is constrained by so many dimensions. Maybe 2, 4 and 8 are the only possible choices that work with most models?
Mei
…On Dec 11, 2021, 2:11 PM -0800, Reza Yazdani ***@***.***>, wrote:
Sorry, I meant the model dimensions, such as hidden-size and number of attention heads. This is due to partitioning the weights across GPUs. The input will not be however partitioned but broadcasted to GPUs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
Yes, this is the case. Even, for some models, like GPT2 which has 25 heads, the lowest mp_size we can set after 1 is 5. So, we have this constraint based on the model structure. |
It sounds like you wanted to split the model into equal sizes to GPUs. But why can't you split by different sizes? Say first two GPUs have 9 heads each, the 3rd one has 8 heads, so total 25 heads? Sorry that I actually don't know much about DL model structures. Apologize in advance if my suggestion is wildly stupid :P. |
Any update on the GPT-J-6B Inference kernel? |
Hi, I also found that when using deepspeed to speed up gpt-j inference, only 1 or 2 GPUs works, but crashed with 3 GPUs. Besides, I found that using 2 GPUs is not faster than using 1, except that the large model can be modeled into these 2 GPUs. Thanks! |
Hi @Leezekun, A large decrease in latency is not an expected result of mere model parallelism. That's what the kernels are for. Also, MP for GPT-J won't work with 3 devices. The degree of MP is constrained by the dimensions of the model. |
Hi @joehoover, Thanks for the clarification. Do you know any updates on the GPT-J-6B Inference kernel? |
Hi @Leezekun That is true that the parallelism alone may not improve performance and I agree with @joehoover that it requires kernels for getting higher performance. Thanks, |
Thanks a lot! Look forward to the update. |
Hi everyone, I have added this PR to run GPTJ model through DeepSpeed. Can you please try it and see if it works on your side.
Best, |
Thanks for the PR. But when I tried it using the script you provided, I got the following errors:
|
Awesome @RezaYazdaniAminabadi! Highly appreciated and thanks for tackling the issue 💯
Steps to replicate:
The same happens when I do:
Upgrading ninja did also not work: and when running:
Did I miss something? |
Does also not work in a clean environment:
Tested on |
Thanks for trying this, yes, there is some issue on the half-precision kernels. I am creating another PR to fix this. |
@RezaYazdaniAminabadi: This is working! Great job! 💯 In terms of performance: Pytorch (1.9 + cu111) + transformers 4.15 + 1x V100 and
Without kernels:
With kernels:
-> Decrease: ~43%
Without kernels (fp16):
With kernels (fp16):
-> Decrease: ~63% 🚀
But, still some issues remain. But this may not necessarily relate to this very issue, if I am correct. Shall I open a new issue for this?
Results in:
|
Hi @joehoover Thanks for trying this out. Great performance results, I am happy to see such good improvement. |
@RezaYazdaniAminabadi, thanks so much for putting this together! Quick question: should I expect DeepSpeed inference to add memory overhead? I've been using a 16GB t4 for inference dev and I can fit the FP16 GPT-J weights on that device with room to spare. However, when I initialize DeepSpeed inference, I'm running out of VRAM. Just want to make sure I'm not making a mistake somewhere. |
@joehoover yes. You need to load on CPU and then let deepspeed do the conversion which moves it to GPU. @RezaYazdaniAminabadi do you think there’s a way to limit memory consumption during injectio? |
Same problem, getting OOM using single 16GB t4 for inference. Is there a sample script for this? |
Has anyone here been able to get the GPT-J inference kernel to work on more than one GPU? There might be a bug in the code causing it to fail with more than one GPU, see issue: #1719 |
@Kkkassini, can you try creating pipeline and set the device after deepsepeed.init_inference. Please use this script as an example. |
@TiesdeKok, there has been some changes on the injection that might have caused this issue. I look into this and try to fix it soon. |
Is your feature request related to a problem? Please describe.
With the new release of transformers, the
gpt-j-6b
model will be available for the public: huggingface/transformers#13022Currently,
will only return
Deepspeed already supports the smaller
gpt-neo
variants, so the addition ofgpt-j-6b
would make sense.Additional context
If there is anything I could do (create a PR) with some guidance I'd be happy to work on the issue and contribute as well.
The text was updated successfully, but these errors were encountered: