Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU #2194

Closed
wkkautas opened this issue Aug 8, 2022 · 8 comments
Assignees
Labels
bug Something isn't working inference

Comments

@wkkautas
Copy link

wkkautas commented Aug 8, 2022

Describe the bug

Thanks for releasing deepspeed-inference.
I'm following tutorial, https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference and I want to do inference with half-precision by setting dtype=torch.half.
However, when using Tesla P40, it seems not working correctly generating unmeaningful text such as
[{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}].
As a side note, when I switched GPU to Tesla T4 with same environment setting and script, this issue was not observed (attached log in Additional context).
Would be Pascal GPU not supported in deepspeed-inference?

To Reproduce

# Filename: gpt-neo-2.7b-generation-float16.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)
$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:04,866] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:35:04,866] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:35:06,221] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:35:06,221] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:35:06,221] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:35:06,221] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:35:59,433] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:35:59,434] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25583362579345703 seconds
[2022-08-08 13:36:00,342] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]
[2022-08-08 13:36:14,299] [INFO] [launch.py:318:main] Process 33 exits successfully.

Expected behavior
generated_text should be some meaningful text.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

System info (please complete the following information):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000001:00:00.0 Off |                  Off |
| N/A   22C    P8     9W / 250W |      0MiB / 24451MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Additional context
When I switched GPU to Tesla T4, this issue was not observed.

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:14,922] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:40:14,923] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:40:16,025] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:40:16,025] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:40:16,025] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:40:16,025] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:42:26,151] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:42:26,152] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25422143936157227 seconds
[2022-08-08 13:42:26,994] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is the result of his experiences with the U.S. Army. He served from 2000 to 2004 as a Combat Medic in Special Forces with the 2nd Platoon, 1st Sustainment Brigade. He also has served as a Fire'}]
[2022-08-08 13:42:39,178] [INFO] [launch.py:318:main] Process 29 exits successfully.
@wkkautas wkkautas added the bug Something isn't working label Aug 8, 2022
@zcrypt0
Copy link

zcrypt0 commented Aug 10, 2022

This might be related to the weird output I see with bigscience/bloom-350m because I am using 1080ti in those tests.

@mrwyattii
Copy link
Contributor

@zcrypt0 the 350m model is one you shouldn't use. I believe it's not fully trained and will therefore output garbage. The model page was recently updated and HF suggests you use the 560m version:
image

@mrwyattii
Copy link
Contributor

@wkkautas Thanks for reporting your issue. I'll try to reproduce what you're seeing and report back

@mrwyattii mrwyattii self-assigned this Aug 11, 2022
@zcrypt0
Copy link

zcrypt0 commented Aug 12, 2022

@mrwyattii I saw weird output with other models too, including the inference tutorial. I switched over to Volta now and everything is working well.

@mrwyattii
Copy link
Contributor

@wkkautas I am able to reproduce this error on P40. I also noted that MP>1 with FP16 is breaking here (but FP32 seems to work). We are working on a solution

@wkkautas
Copy link
Author

wkkautas commented Nov 9, 2022

Just removing these #if seems good.


Is there any reason to set cuda arch higher than 700? (P40 would be 610)
Thank you,

@mrwyattii mrwyattii removed their assignment Dec 2, 2022
@cmikeh2
Copy link
Contributor

cmikeh2 commented Dec 5, 2022

Hi @wkkautas,

This PR #2574 should fix the issue you are seeing. If you have time, please try on your end to make sure it does work as expected. Thanks!

@cmikeh2
Copy link
Contributor

cmikeh2 commented Dec 9, 2022

If you are still seeing this issue, please reopen.

@cmikeh2 cmikeh2 closed this as completed Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

4 participants