Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT-NEO model parallelism not work issue #1209

Closed
switiz opened this issue Jul 8, 2021 · 3 comments · Fixed by #1244
Closed

GPT-NEO model parallelism not work issue #1209

switiz opened this issue Jul 8, 2021 · 3 comments · Fixed by #1244

Comments

@switiz
Copy link

switiz commented Jul 8, 2021

@RezaYazdaniAminabadi
When I inference using gpt-neo 1.3b model with tutorial code. may be model parallelism is not work
I tested several times using multiple A100 gpus. In my opinion, if i increase the number of gpus, memory usage per gpu should be decrease. But it is increasing. ㅠ.ㅜ

And if the number of gpu is increased, host ram consumption increases to process * num_gpus.
So, if the num of gpu is increase continuously, I met host OOM.

What should I do?

code

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
model = AutoModelForCausalLM.from_pretrained('/workspace/home/hugging_gpt/gpt_torch/')
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=local_rank)
generator.model = deepspeed.init_inference(generator.model,
                                    mp_size=world_size,
                                    dtype=torch.float,
                                    replace_method='auto')
prompt = "Q: why not work? ? A:"
start_time = time.time()
sample_outputs = generator(prompt, do_sample=True, top_k=50,
                                max_length=300, top_p=0.95, temperature=1.5, num_return_sequences=1)

if world_size > 0 and torch.distributed.get_rank() == 0:
    print(f'spend:{time.time()-start_time}')
    for i, sample_output in enumerate(sample_outputs):
        print("[index]: {} \n[decode]: {}".format(i, sample_outputs))

test result

1gpu
[0] A100-SXM4-40GB   | 38'C,   0 % |  7966 / 40537 MB |

2gpu
[0] A100-SXM4-40GB   | 40'C,  33 % |  8704 / 40537 MB |
[1] A100-SXM4-40GB   | 38'C,  87 % |  8706 / 40537 MB |
 
4 gpu
[0] A100-SXM4-40GB   | 39'C,  50 % |  8318 / 40537 MB |
[1] A100-SXM4-40GB   | 37'C,  45 % |  8654 / 40537 MB |
[2] A100-SXM4-40GB   | 39'C,  79 % |  8654 / 40537 MB |
[3] A100-SXM4-40GB   | 40'C,  34 % |  8352 / 40537 MB |

logs

root@bf1d5639fc32:/workspace/home/gpt-neo# deepspeed --num_gpus 4 docker_gpt.py 
[2021-07-08 12:32:40,447] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.

[2021-07-08 12:32:46,130] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 docker_gpt.py
[2021-07-08 12:32:47,027] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.9.9
[2021-07-08 12:32:47,027] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-07-08 12:32:47,027] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-07-08 12:32:47,027] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-07-08 12:32:47,027] [INFO] [launch.py:102:main] dist_world_size=4
[2021-07-08 12:32:47,027] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-07-08 12:33:20,227] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown
[2021-07-08 12:33:20,228] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-07-08 12:33:22,391] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown
[2021-07-08 12:33:22,392] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-07-08 12:33:24,349] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown
[2021-07-08 12:33:24,350] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-07-08 12:33:28,081] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.4.2, git-hash=unknown, git-branch=unknown
[2021-07-08 12:33:28,082] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
Using /root/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2734529972076416 seconds
DeepSpeed Transformer Inference config is  {'layer_id': 0, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 1, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 2, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 3, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 4, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 5, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 6, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 7, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 8, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 9, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 10, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 11, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 12, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 13, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
Using /root/.cache/torch_extensions as PyTorch extensions root...
DeepSpeed Transformer Inference config is  {'layer_id': 14, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 15, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 16, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 17, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 18, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 19, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 20, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 21, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 22, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 23, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.27196311950683594 seconds
DeepSpeed Transformer Inference config is  {'layer_id': 0, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 1, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 2, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 3, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 4, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 5, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 6, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 7, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 8, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 9, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 10, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 11, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 12, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 13, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 14, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 15, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 16, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 17, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 18, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 19, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 20, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 21, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 22, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 23, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.40374255180358887 seconds
DeepSpeed Transformer Inference config is  {'layer_id': 0, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 1, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 2, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 3, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 4, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 5, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 6, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 7, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 8, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 9, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 10, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 11, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 12, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 13, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 14, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 15, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 16, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
Loading extension module transformer_inference...
DeepSpeed Transformer Inference config is  {'layer_id': 17, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
Time to load transformer_inference op: 0.40460848808288574 seconds
DeepSpeed Transformer Inference config is  {'layer_id': 0, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 18, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 19, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 20, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 21, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 22, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 1, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 23, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 2, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 3, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 4, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 5, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 6, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
DeepSpeed Transformer Inference config is  {'layer_id': 7, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 8, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 9, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 10, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 11, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 12, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 13, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 14, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 15, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 16, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 17, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 18, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 19, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 20, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 21, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 22, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': False, 'window_size': 256}
DeepSpeed Transformer Inference config is  {'layer_id': 23, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 16, 'num_hidden_layers': -1, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 4, 'q_int8': False, 'encoder_decoder': False, 'scale_attention': True, 'specialized_mode': False, 'triangular_masking': True, 'local_attention': True, 'window_size': 256}
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

@switiz
Copy link
Author

switiz commented Jul 12, 2021

Dear @RezaYazdaniAminabadi
I spent a lot of time trying to fix it. Finally I fixed this issue. issue root cause is two reason like #1161

  1. cuda cache : fixed using torch.cuda.empty_cache()
  2. huggingface model load in gpu first:
    first load using cpu and set deepspeed slicing and load in gpu. fixed deepspeed lib

If you possible, I request library-level feature that is first loaded into the CPU, applied model parallel, and sent to the gpu for many massive model inference. This improvement feature is more helpful for many deepspeed users

thank you
BR

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @switiz

Sorry for the late reply.
Yes, the reason for increasing in memory was the cached memory.
I agree that the model needs to be created on CPU and then partitioned across the CPU. I will work on the PR to fix this.
Thanks for investigating the problem.
Reza

@switiz
Copy link
Author

switiz commented Jul 20, 2021

@RezaYazdaniAminabadi
wow. That's great news.
I'll wait for the commit to come up.
Thank you
BR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants