[BUG] GPT-J + init_inference + replace_with_kernel_inject returns copy error with multiple GPUs #1719

TiesdeKok · 2022-01-23T21:09:06Z

Describe the bug

Using the replace_with_kernel_inject option in init_inference returns an error when using multiple GPUs (with a GPT-J model).

To Reproduce
Steps to reproduce the behavior:

Create an inference script using HF Transformers and GPT-J
Run the deepspeed command with multiple GPUs

import os
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed
from transformers import pipeline as t_pipeline

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
generator = t_pipeline('text-generation', model=model, tokenizer=tokenizer, eos_token_id=50256,  device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.float16,
                                            replace_method= 'auto',
                                            replace_with_kernel_inject= True
                                        )

input_list = ["This is the input "]

res_ds = generator(input_list, do_sample=True, max_length = 1000, eos_token_id=50256, temperature=0.25, pad_token_id=50257)

Expected behavior
No error.

ds_report output
Unavailable, not currently in the compute node.

Screenshots

System info (please complete the following information):

OS: Linux - Ubuntu
One machine with 8x A100 40gb PCIE
Python 3.8
Using the following docker image: pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel

Launcher context
Deepspeed command line

Docker context
Base image is: pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel

Additional context

The problem does not exist when replace_with_kernel_inject is set to False
Things work fine with replace_with_kernel_inject = True and running the script directly with a single GPU.
The error appears to come from here:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/module_inject/replace_module.py#L74

The text was updated successfully, but these errors were encountered:

TiesdeKok · 2022-01-24T18:50:47Z

In addition, I was able to replicate the issue on a different box with Fedora and 8x a6000 GPUs.

RezaYazdaniAminabadi · 2022-01-24T22:29:13Z

Hi @TiesdeKok,

I will take a look at this.

Thanks,
Reza

TiesdeKok · 2022-01-24T22:35:57Z

Hi @TiesdeKok,

I will take a look at this.

Thanks,
Reza

Thanks a lot!

RezaYazdaniAminabadi · 2022-01-25T17:40:57Z

Hi @TiesdeKok
Can you please try this PR and see if this is fixed?

Thanks.

TiesdeKok · 2022-01-25T19:39:39Z

Appreciate the quick turnaround here @RezaYazdaniAminabadi!

The copy error is gone and the inference starts now, so that appears resolved. 🥳

However, I am running into another problem where everything works great with one GPU, however, with multiple GPUs, the inference will hang indefinitely. I can make a separate issue if you prefer, but let me describe what I am observing:

"EleutherAI/gpt-j-6B" with float16 with one GPU without kernel inject --> works
"EleutherAI/gpt-j-6B" with float16 with one GPU with kernel inject --> works
"EleutherAI/gpt-j-6B" with float16 with 2+ GPU without kernel inject --> hangs indefinitely
"EleutherAI/gpt-j-6B" with float16 with 2+ GPU with kernel inject --> hangs indefinitely

No errors are shown, it just pins the GPUs at 100% and nothing happens. I have tried this on two different machines and the behavior is the same. I noticed the same issue already yesterday without the kernel inject and letting it run for hours (on one prompt) clearly indicates that things are stuck.

To dig into this further, I have also tried using the distilgpt2 model, the same issue pops up:

"distilgpt2" with float 16 with one GPU --> works
"distilgpt2" with float 16 with 2+ GPU --> hangs indefinitely
"distilgpt2" with float 32 with one GPU --> works
"distilgpt2" with float 32 with 2+ GPU --> hangs indefinitely

I am a little lost here, the code I am running is essentially the same as:
https://github.com/microsoft/DeepSpeedExamples/blob/fix-inferen-test/inference/huggingface/gpt-neo.py

Which I run with deepspeed --num_gpus X gpt-neo.py (pseudo-code).

I tried looking for a verbose option to see if I could get better logging once things are on the GPUs, however, I could not find it. Any ideas on what might be happening here? 😕

RezaYazdaniAminabadi · 2022-01-25T19:58:50Z

@TiesdeKok,

This is a known issue that we have with the integration of DeepSpeed Inference and HF. This is happening since one GPU is finished with the generation while the other one is waiting to continue for the next token-generation. Would you mind setting the min_length and max_length to the same number and see if this issue is resolved?
Thanks

TiesdeKok · 2022-01-25T21:14:40Z

After reading your description it immediately hit me that the hanging issue is caused by a random.shuffle() line in my code, that created a different input for every machine and caused it all to hang. 🤦🏻‍♂️

With that out of the way, I am now seeing weird behavior with the kernel inject:

multiple GPUs without kernel inject works great.
multiple GPus with kernel inject and do_sample=False completes without errors but it generates garbage output. The output looks like this (\n####\n is where my prompt ends):

\n####\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

multiple GPus with kernel inject and do_sample=True throws an error:

I added a quick print statement right before that torch.multinomial() step and it shows:

print(probs)
#tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:1')
#tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:2')
#tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0')
#tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:3')

The above issue and error also occur when settings max_lenght and min_length to the same value.

Any thoughts on what might be the issue here? Thanks again for your help!

Ps. my torch version is 1.10.0 and my transformers version is 4.16.0.dev0

RezaYazdaniAminabadi · 2022-01-25T22:29:07Z

I did test this on the same versions as you mentioned. Just that I am using PyTorch1.9. The code snippet I am using is as follows:

import os
import torch
import transformers

from deepspeed import module_inject
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

generator = pipeline('text-generation',
                     model='EleutherAI/gpt-j-6B',
                     device=local_rank,
                     )
generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
                                           replace_with_kernel_inject=True)
string = generator("DeepSpeed is ", do_sample=True, min_length=50)
print(string)

RezaYazdaniAminabadi · 2022-01-25T22:43:42Z

Here is some part of the result I am seeing for this example:

TiesdeKok · 2022-01-25T23:55:32Z

That little code snippet was very helpful to debug what is happening here, my observations:

I was using the float16 revision so I had to download the float32 version and I figured that might be it, but that didn't change anything. I got the same error as before when running the exact code you provided (I only fixed the deepspeed import):

When turning off sampling I also saw the same weird behavior with the exclamation marks:

However, given that it worked for you there had to be something about my setup that was causing it, so I started changing dials:

Changing to transformers==4.15 --> no change
Changing to 2 GPUs --> no change

But then I tried deepspeed==0.5.10 and it all works again! Both your code snippet started working as well as my code. This suggests to me that something else got introduced that causes things to break.

tomerip · 2022-02-27T14:25:34Z

Hi @TiesdeKok,
I think taking a look on this issue I opened might be relevant to your use case:
#1797
I think it at least explains why you got the exclamation marks outputs and also probably raise your attention regarding the outputs you're getting in case you pad some of your inputs.

lanking520 · 2022-07-20T00:08:49Z

Hi @TiesdeKok I am also facing the garbage output issue. Not sure if it is related to the issue you were having previously: #2113

TiesdeKok added the bug Something isn't working label Jan 23, 2022

TiesdeKok changed the title ~~[BUG] init_inference + replace_with_kernel_inject returns copy error with multiple GPUs~~ [BUG] GPT-J + init_inference + replace_with_kernel_inject returns copy error with multiple GPUs Jan 23, 2022

TiesdeKok mentioned this issue Jan 24, 2022

[Inference] Support GPT-J-6B #1332

Closed

RezaYazdaniAminabadi mentioned this issue Jan 25, 2022

Fix the tensor-slicing with multi-GPU inference and kernel-injection #1724

Merged

RezaYazdaniAminabadi closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPT-J + init_inference + replace_with_kernel_inject returns copy error with multiple GPUs #1719

[BUG] GPT-J + init_inference + replace_with_kernel_inject returns copy error with multiple GPUs #1719

TiesdeKok commented Jan 23, 2022

TiesdeKok commented Jan 24, 2022

RezaYazdaniAminabadi commented Jan 24, 2022

TiesdeKok commented Jan 24, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

TiesdeKok commented Jan 25, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

TiesdeKok commented Jan 25, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

TiesdeKok commented Jan 25, 2022

tomerip commented Feb 27, 2022

lanking520 commented Jul 20, 2022

[BUG] GPT-J + init_inference + replace_with_kernel_inject returns copy error with multiple GPUs #1719

[BUG] GPT-J + init_inference + replace_with_kernel_inject returns copy error with multiple GPUs #1719

Comments

TiesdeKok commented Jan 23, 2022

TiesdeKok commented Jan 24, 2022

RezaYazdaniAminabadi commented Jan 24, 2022

TiesdeKok commented Jan 24, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

TiesdeKok commented Jan 25, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

TiesdeKok commented Jan 25, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

RezaYazdaniAminabadi commented Jan 25, 2022

TiesdeKok commented Jan 25, 2022

tomerip commented Feb 27, 2022

lanking520 commented Jul 20, 2022