-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]GPT Models fail for long inputs and or outputs during inference #2300
Comments
@andrewchernyh I will check the PR as soon as I can. Thanks! |
Hi @andrewchernyh and @mallorbc, Thanks for adding this PR. It solves somewhat this problem, however, it still adds many new lines at the end of the text, and this is not really an issue of the solution you provided. There is another problem with some other kernels that we see such behaviour. and it generates: |
@RezaYazdaniAminabadi Also I want notice, that result = model.forward(prompt1)
result2 = model.forward(prompt2, result.past_key_values) For caching past_key_values and get results with base prompt and different additional prompts faster. |
@RezaYazdaniAminabadi reopened PR #2344 in PR #2359 after checking current master. It still has memory corruption |
Hi @andrewchernyh, Thanks for bringing this up and showing how this can be problematic. On the other hand, I feel getting the current sentence length from past-key-value or attn_mask may not be the prefect solution either. Since there are cases where attn_mask is not passed and we perform triangular masking by default, and the caching mechanism can be handled internally not passed from outside. Also, I would say deepspeed-inference does not work properly in the case you just mentioned, because we are not consuming the content of past-key-value which is sent from outside, however, it's all managed internally. I would be happy to help add this feature if you want to work on it :-) |
Thanks a lot for your contribution, it certainly points us to the direction of solving it in a more determined way. I have add some comments in your PR. Please take a look and after making some changes we can merge it. Thanks again. |
Hi @RezaYazdaniAminabadi, |
I agree, but there are cases where there is no padding and we get a ragged batch of input. In this case, there won't be any mask passed. Or even the masking can be sparse, and we have to deal with predefined masking which doesn't show how many tokens are generated so far. Anyway, I still think bringing this feature as you suggested is helpful but wanted to mention that there are cases that this assumption might not be true. |
Is this issue resolved @RezaYazdaniAminabadi @andrewchernyh? If yes, kindly close the issue. |
Fixed, and so closing. Please (re)open if needed. |
Describe the bug
When using GPTJ or GPT Neo 2.7B with DeepSpeed inference if you give it the short simple "DeepSpeed is" like the tutorial shows, and generate only 50 tokens or so, then everything works.
However, when you give the model a long input, such as 1000 tokens or so, and or when you give a small input and want to generate many tokens, the system breaks.
Through my many tries of trying to fix the issue, I have gotten errors similar to that of #2062 where illegal memory is accessed. I have gotten errors with regards to nan/inf. Sometimes the model does not error out but rather gives garbage output once a certain length is reached, similar to that of #2233
To Reproduce
Steps to reproduce the behavior:
Note that when not specifying the min length, what sometimes happens is the model generates a few tokens but then stops. Specify a long min lenghth gurantees issues.
Expected behavior
I would expect that given 1 or multiple gpus, that one could use deepspeed inference on these GPT models with any lengh input and generate up to the max amount of tokens and get valid results.
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch']
torch version .................... 1.12.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.3+89f2dedf, 89f2ded, cholmes/fix-long-seq-len-inference
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
Screenshots
NA
System info (please complete the following information):
Launcher context
Docker context
Using a nvidia cuda container with conda installed
Additional context
I believe related issues could be #2062 #2212 and related PRs could be #2212 and #2280. For the PRs I have tried building from source and it did not resolve the issue. One of them lead to fewer errors and tended to produce just poor results though(I believe it is the one specified in the ds_report).
I also tried rolling back to before 0.6.6 as I read someone had success doing so. I also tried building from master without success.
The text was updated successfully, but these errors were encountered: