[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU #2194

wkkautas · 2022-08-08T05:05:02Z

Describe the bug

Thanks for releasing deepspeed-inference.
I'm following tutorial, https://www.deepspeed.ai/tutorials/inference-tutorial/#end-to-end-gpt-neo-27b-inference and I want to do inference with half-precision by setting dtype=torch.half.
However, when using Tesla P40, it seems not working correctly generating unmeaningful text such as
[{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}].
As a side note, when I switched GPU to Tesla T4 with same environment setting and script, this issue was not observed (attached log in Additional context).
Would be Pascal GPU not supported in deepspeed-inference?

To Reproduce

# Filename: gpt-neo-2.7b-generation-float16.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)

generator.model = deepspeed.init_inference(generator.model,
                                           mp_size=world_size,
                                           dtype=torch.half,
                                           replace_method='auto',
					   replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:04,866] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:35:04,866] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:35:06,221] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:35:06,221] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:35:06,221] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:35:06,221] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:35:06,221] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:35:06,221] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:35:59,433] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:35:59,434] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25583362579345703 seconds
[2022-08-08 13:36:00,342] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is that one one S\'s of more it his B in B I it a I and an- two The an high B it all.. or old in a D of B T the,\n F and the " S S The'}]
[2022-08-08 13:36:14,299] [INFO] [launch.py:318:main] Process 33 exits successfully.

Expected behavior
generated_text should be some meaningful text.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

System info (please complete the following information):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P40           On   | 00000001:00:00.0 Off |                  Off |
| N/A   22C    P8     9W / 250W |      0MiB / 24451MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Additional context
When I switched GPU to Tesla T4, this issue was not observed.

$ deepspeed gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:14,922] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-08-08 13:40:14,923] [INFO] [runner.py:504:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 gpt-neo-2.7b-generation-float16.py
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,024] [INFO] [launch.py:129:main] 0 NCCL_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.8.4-1+cuda11.2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2022-08-08 13:40:16,025] [INFO] [launch.py:129:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.8.4-1
[2022-08-08 13:40:16,025] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-08 13:40:16,025] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-08 13:40:16,025] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-08 13:40:16,025] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-08 13:40:16,025] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
vocab_file vocab.json
merges_file merges.txt
tokenizer_file tokenizer.json
added_tokens_file added_tokens.json
special_tokens_map_file special_tokens_map.json
tokenizer_config_file tokenizer_config.json
[2022-08-08 13:42:26,151] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.0, git-hash=unknown, git-branch=unknown
[2022-08-08 13:42:26,152] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /tmp/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25422143936157227 seconds
[2022-08-08 13:42:26,994] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 2560, 'intermediate_size': 10240, 'heads': 20, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/transformers/src/transformers/generation_utils.py:1202: UserWarning: Neither `max_length` nor `max_new_tokens` have been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
[{'generated_text': 'DeepSpeed is the result of his experiences with the U.S. Army. He served from 2000 to 2004 as a Combat Medic in Special Forces with the 2nd Platoon, 1st Sustainment Brigade. He also has served as a Fire'}]
[2022-08-08 13:42:39,178] [INFO] [launch.py:318:main] Process 29 exits successfully.

The text was updated successfully, but these errors were encountered:

zcrypt0 · 2022-08-10T21:57:14Z

This might be related to the weird output I see with bigscience/bloom-350m because I am using 1080ti in those tests.

mrwyattii · 2022-08-11T23:47:11Z

@zcrypt0 the 350m model is one you shouldn't use. I believe it's not fully trained and will therefore output garbage. The model page was recently updated and HF suggests you use the 560m version:

mrwyattii · 2022-08-11T23:47:54Z

@wkkautas Thanks for reporting your issue. I'll try to reproduce what you're seeing and report back

zcrypt0 · 2022-08-12T06:57:06Z

@mrwyattii I saw weird output with other models too, including the inference tutorial. I switched over to Volta now and everything is working well.

mrwyattii · 2022-08-18T16:55:39Z

@wkkautas I am able to reproduce this error on P40. I also noted that MP>1 with FP16 is breaking here (but FP32 seems to work). We are working on a solution

wkkautas · 2022-11-09T13:33:34Z

Just removing these #if seems good.

DeepSpeed/csrc/transformer/inference/csrc/transform.cu

Line 93 in 521d329

#if __CUDA_ARCH__ >= 700

Is there any reason to set cuda arch higher than 700? (P40 would be 610)
Thank you,

cmikeh2 · 2022-12-05T18:53:11Z

Hi @wkkautas,

This PR #2574 should fix the issue you are seeing. If you have time, please try on your end to make sure it does work as expected. Thanks!

cmikeh2 · 2022-12-09T18:22:03Z

If you are still seeing this issue, please reopen.

wkkautas added the bug Something isn't working label Aug 8, 2022

zcrypt0 mentioned this issue Aug 10, 2022

[BUG][0.6.7] garbage output for multi-gpu with tutorial #2113

Closed

mrwyattii added the inference label Aug 11, 2022

mrwyattii self-assigned this Aug 11, 2022

martincai assigned cmikeh2 Nov 11, 2022

mrwyattii removed their assignment Dec 2, 2022

cmikeh2 mentioned this issue Dec 5, 2022

Drop Maxwell Support #2574

Merged

cmikeh2 closed this as completed Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU #2194

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU #2194

wkkautas commented Aug 8, 2022

zcrypt0 commented Aug 10, 2022

mrwyattii commented Aug 11, 2022

mrwyattii commented Aug 11, 2022

zcrypt0 commented Aug 12, 2022

mrwyattii commented Aug 18, 2022

wkkautas commented Nov 9, 2022

cmikeh2 commented Dec 5, 2022

cmikeh2 commented Dec 9, 2022

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU #2194

[BUG] deepspeed-inference seems not working correctly with torch.half on Pascal GPU #2194

Comments

wkkautas commented Aug 8, 2022

zcrypt0 commented Aug 10, 2022

mrwyattii commented Aug 11, 2022

mrwyattii commented Aug 11, 2022

zcrypt0 commented Aug 12, 2022

mrwyattii commented Aug 18, 2022

wkkautas commented Nov 9, 2022

cmikeh2 commented Dec 5, 2022

cmikeh2 commented Dec 9, 2022