[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227

mallorbc · 2022-08-17T04:52:08Z

Describe the bug

My understanding of model parallelism is that the model is split over multiple GPUs to lower memory usage per GPU allowing larger models and speeding up inference. Thus for GPT Neo 2.7B and two 3090s I would expect the VRAM usage to be roughly 5.5GB per GPU or 11GB in total if placed on one GPU.

The issue is that the VRAM usage is roughly 11GB on each GPU. Additionally, when selecting torch.half for the dtype, the VRAM usage stays high and does not change. Due to this problem, I also OOM for GPTJ on one or both GPUs

To Reproduce

Steps to reproduce the behavior:

Install DeepSpeed from the source
Install Transformers from pip
Follow inference tutorial
Watch VRAM usage or experience OOM for larger models

Expected behavior

I would expect VRAM usage to decrease as you use multiple GPUs. I would also expect the VRAM usage to decrease when using lower precision data types.

ds_report output

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch']
torch version .................... 1.12.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.1+7d8ad45, 7d8ad45, master
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: 2 3090s
Interconnects: 1 system, 2 3090s
Python version: 3.9.13

I am using a docker container with Nvidia Cuda already set up as the base image.

Launcher context

deepspeed --num_gpus 2 infer.py
deepspeed --num_gpus 1 infer.py
both lead to the same VRAM usage per GPU

Docker context

Are you using a specific docker image that you can share?
nvidia/cuda:11.3.1-devel-ubuntu20.04
then I am building python packages into the container

Additional context

When setting up the pipeline like the tutorial, if I load the model first with the data type I want, and then pass the model and tokenizer to the pipeline, it seems to lower the VRAM usage on one of the GPUs. Doing this and running with only one GPU rather than two had lead me to be able to use GPTJ, but I want to use two GPUs for inference

#this works for one GPU
model_name = 'EleutherAI/gpt-j-6B'
model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline('text-generation', model=model,tokenizer=tokenizer, device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

#this leads to OOM for 2 3090s, with one being normal and one not
model_name = 'EleutherAI/gpt-j-6B'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
generator = pipeline('text-generation', model=model,tokenizer=tokenizer, device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

#this also leads to OOM for 2 3090s
model_name = 'EleutherAI/gpt-j-6B'
generator = pipeline('text-generation', model=model_name, device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

The text was updated successfully, but these errors were encountered:

mallorbc · 2022-08-18T00:58:17Z

By changing the pipeline to the following I now get VRAM usage of roughly 12GB per GPU. However, shouldn't the model be split over both GPUs and thus roughly 6GB each? Perhaps I am misunderstanding model parallelism. In either case, an issue similar to #2113 still occurs for GPTJ but not GPT Neo. I get junk output using fp16 on two 3090s

generator = pipeline('text-generation', model=model_name, device=local_rank,torch_dtype=torch.float16)

GPTJ output:

[{'generated_text': 'DeepSpeed is,: to,,/ &.. by and.. a\n.. and- and.. the,,\n of\n [.,.\n:, &-. and a- the,\n\n). the'}]

GPT Neo 2.7B output:

[{'generated_text': 'DeepSpeed is a speedup technique for Java programs.\nSee  http://www.cs.indiana.edu/~mh/classes/java/cs405/labs/gong/\nHere are some examples.\n\n'}]

RezaYazdaniAminabadi · 2022-08-29T16:46:55Z

Hi @mallorbc ,

The problem is that the model selected from HF is Fp32 and it will load the checkpoint before coming to the model-partitioning on the DeepSeed-Inference side. For reducing the memory-usage on GPU, you can remode the device at the pipeline creation and pass the model to deepspeed.init_inference to do the model-partitioning, and set the device for the pipeline after the deepspeed.init_inference function call. Does this make sense?
Thanks,
Reza

RezaYazdaniAminabadi · 2022-08-29T16:48:48Z

By changing the pipeline to the following I now get VRAM usage of roughly 12GB per GPU. However, shouldn't the model be split over both GPUs and thus roughly 6GB each? Perhaps I am misunderstanding model parallelism. In either case, an issue similar to #2113 still occurs for GPTJ but not GPT Neo. I get junk output using fp16 on two 3090s
generator = pipeline('text-generation', model=model_name, device=local_rank,torch_dtype=torch.float16)
GPTJ output:
[{'generated_text': 'DeepSpeed is,: to,,/ &.. by and.. a\n.. and- and.. the,,\n of\n [.,.\n:, &-. and a- the,\n\n). the'}]
GPT Neo 2.7B output:
[{'generated_text': 'DeepSpeed is a speedup technique for Java programs.\nSee  http://www.cs.indiana.edu/~mh/classes/java/cs405/labs/gong/\nHere are some examples.\n\n'}]

This issue is related to another issue that you opened regarding the MP issue for the GPTJ. Please let me know if this PR solves the issue.
Thanks,
Reza

mallorbc · 2022-08-31T02:54:07Z

Hi @mallorbc ,

The problem is that the model selected from HF is Fp32 and it will load the checkpoint before coming to the model-partitioning on the DeepSeed-Inference side. For reducing the memory-usage on GPU, you can remode the device at the pipeline creation and pass the model to deepspeed.init_inference to do the model-partitioning, and set the device for the pipeline after the deepspeed.init_inference function call. Does this make sense? Thanks, Reza

@RezaYazdaniAminabadi ,
I did figure out how to reduce the memory usage from full fp32 to using the memory expected for fp16, but on EACH gpu. So for GPTJ, roughly 12GB of memory is used on each GPU. Shouldn't it be 12GB in total? Roughly 6GB on each GPU?
Thanks your help.

RezaYazdaniAminabadi · 2022-11-29T01:18:38Z

Hi @mallorbc,

We have added a test-suite here that measure the memory consumption after init_inference and also the pipeline creation. Can you please try it to see if the memory is as expected?
You can follow the instructions here to call the inference_test.
Thanks,
Reza

mallorbc · 2022-11-29T08:00:07Z

Hi @mallorbc,

We have added a test-suite here that measure the memory consumption after init_inference and also the pipeline creation. Can you please try it to see if the memory is as expected? You can follow the instructions here to call the inference_test. Thanks, Reza

I will try this. Thanks!

jeffra · 2022-12-12T18:22:21Z

@mallorbc please re-open if you're still having issues

mallorbc added the bug Something isn't working label Aug 17, 2022

mallorbc changed the title ~~[BUG] High VRAM Usage For Inference, torch dtype doesn't matter~~ [BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter Aug 17, 2022

mallorbc mentioned this issue Aug 17, 2022

[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113

Closed

jeffra added the inference label Aug 17, 2022

RezaYazdaniAminabadi mentioned this issue Aug 29, 2022

Ds inference/fix mp2 #2270

Merged

martincai assigned RezaYazdaniAminabadi Nov 11, 2022

jeffra closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227

[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227

mallorbc commented Aug 17, 2022 •

edited

Loading

mallorbc commented Aug 18, 2022

RezaYazdaniAminabadi commented Aug 29, 2022

RezaYazdaniAminabadi commented Aug 29, 2022

mallorbc commented Aug 31, 2022

RezaYazdaniAminabadi commented Nov 29, 2022

mallorbc commented Nov 29, 2022

jeffra commented Dec 12, 2022

[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227

[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227

Comments

mallorbc commented Aug 17, 2022 • edited Loading

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

mallorbc commented Aug 18, 2022

RezaYazdaniAminabadi commented Aug 29, 2022

RezaYazdaniAminabadi commented Aug 29, 2022

mallorbc commented Aug 31, 2022

RezaYazdaniAminabadi commented Nov 29, 2022

mallorbc commented Nov 29, 2022

jeffra commented Dec 12, 2022

mallorbc commented Aug 17, 2022 •

edited

Loading

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]