-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] High VRAM Usage For Inference, Torch Dtype Doesn't Matter #2227
Comments
By changing the pipeline to the following I now get VRAM usage of roughly 12GB per GPU. However, shouldn't the model be split over both GPUs and thus roughly 6GB each? Perhaps I am misunderstanding model parallelism. In either case, an issue similar to #2113 still occurs for GPTJ but not GPT Neo. I get junk output using fp16 on two 3090s
GPTJ output:
GPT Neo 2.7B output:
|
Hi @mallorbc , The problem is that the model selected from HF is Fp32 and it will load the checkpoint before coming to the model-partitioning on the DeepSeed-Inference side. For reducing the memory-usage on GPU, you can remode the device at the pipeline creation and pass the model to |
This issue is related to another issue that you opened regarding the MP issue for the GPTJ. Please let me know if this PR solves the issue. |
@RezaYazdaniAminabadi , |
@mallorbc please re-open if you're still having issues |
Describe the bug
My understanding of model parallelism is that the model is split over multiple GPUs to lower memory usage per GPU allowing larger models and speeding up inference. Thus for GPT Neo 2.7B and two 3090s I would expect the VRAM usage to be roughly 5.5GB per GPU or 11GB in total if placed on one GPU.
The issue is that the VRAM usage is roughly 11GB on each GPU. Additionally, when selecting torch.half for the dtype, the VRAM usage stays high and does not change. Due to this problem, I also OOM for GPTJ on one or both GPUs
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I would expect VRAM usage to decrease as you use multiple GPUs. I would also expect the VRAM usage to decrease when using lower precision data types.
ds_report output
ds_report
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch']
torch version .................... 1.12.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.1+7d8ad45, 7d8ad45, master
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
System info (please complete the following information):
I am using a docker container with Nvidia Cuda already set up as the base image.
Launcher context
deepspeed --num_gpus 2 infer.py
deepspeed --num_gpus 1 infer.py
both lead to the same VRAM usage per GPU
Docker context
Are you using a specific docker image that you can share?
nvidia/cuda:11.3.1-devel-ubuntu20.04
then I am building python packages into the container
Additional context
When setting up the pipeline like the tutorial, if I load the model first with the data type I want, and then pass the model and tokenizer to the pipeline, it seems to lower the VRAM usage on one of the GPUs. Doing this and running with only one GPU rather than two had lead me to be able to use GPTJ, but I want to use two GPUs for inference
The text was updated successfully, but these errors were encountered: