[BUG]GPT Models fail for long inputs and or outputs during inference #2300

mallorbc · 2022-09-07T04:59:14Z

Describe the bug

When using GPTJ or GPT Neo 2.7B with DeepSpeed inference if you give it the short simple "DeepSpeed is" like the tutorial shows, and generate only 50 tokens or so, then everything works.

However, when you give the model a long input, such as 1000 tokens or so, and or when you give a small input and want to generate many tokens, the system breaks.

Through my many tries of trying to fix the issue, I have gotten errors similar to that of #2062 where illegal memory is accessed. I have gotten errors with regards to nan/inf. Sometimes the model does not error out but rather gives garbage output once a certain length is reached, similar to that of #2233

To Reproduce
Steps to reproduce the behavior:

Install torch, transformers,etc
Install DeepSpeed, either from source, the latest tag, or from one of the non merged PRs I reference
Have a long text input in "input_data.txt"
Run code below
Notice bad results in some form described above

Note that when not specifying the min length, what sometimes happens is the model generates a few tokens but then stops. Specify a long min lenghth gurantees issues.

import os
import deepspeed
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
import argparse
import deepspeed
# os.environ["CUDA_LAUNCH_BLOCKING"] = '1'
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-m',"--model_name", type=str, default='EleutherAI/gpt-j-6B')
    parser.add_argument('--local_rank', type=int, default=-1,
                    help='local rank passed from distributed launcher')
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    model_name = args.model_name

    with open('input_data_long.txt', 'r') as f:
        input_text = f.read()

    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    world_size = int(os.getenv('WORLD_SIZE', '1'))

    model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=local_rank,torch_dtype=torch.float16)



    generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.half,
                                            replace_method='auto',
                        replace_with_kernel_inject=True)
#    torch.cuda.synchronize()

    string = generator(input_text, do_sample=True, max_length=2047)
#    torch.cuda.synchronize()

    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
        print(string)

Expected behavior

I would expect that given 1 or multiple gpus, that one could use deepspeed inference on these GPT models with any lengh input and generate up to the max amount of tokens and get valid results.

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch']
torch version .................... 1.12.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.3+89f2dedf, 89f2ded, cholmes/fix-long-seq-len-inference
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Screenshots
NA

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: 2 3090s
Interconnects: 1 system, 2 3090s
Python version: 3.9.13

Launcher context

deepspeed --num_gpus 1 infer.py

deepspeed --num_gpus 2 infer.py

Docker context

Using a nvidia cuda container with conda installed

Additional context

I believe related issues could be #2062 #2212 and related PRs could be #2212 and #2280. For the PRs I have tried building from source and it did not resolve the issue. One of them lead to fewer errors and tended to produce just poor results though(I believe it is the one specified in the ds_report).

I also tried rolling back to before 0.6.6 as I read someone had success doing so. I also tried building from master without success.

The text was updated successfully, but these errors were encountered:

andrewchernyh · 2022-09-22T04:38:56Z

@mallorbc - could you try this PR - #2344 ?

mallorbc · 2022-09-22T14:39:43Z

@andrewchernyh I will check the PR as soon as I can. Thanks!

RezaYazdaniAminabadi · 2022-09-22T19:37:09Z

Hi @andrewchernyh and @mallorbc,

Thanks for adding this PR. It solves somewhat this problem, however, it still adds many new lines at the end of the text, and this is not really an issue of the solution you provided. There is another problem with some other kernels that we see such behaviour.
Here is the input that I tested with:
DeepSpeed is a cool project that uses machine learning to automate a web service! We are a bunch of nerdy geeks who love to work and learn on cutting edge technology! Deep Speed is looking to hire a full stack Developer! DeepSpeed is a really cool project and we’re looking to bring on a full stack web developer to work on the core technology and bring it to the cutting edge! I’ll be here answering any questions you may have during this interview. Feel free to call me or email me any time, with any question you have! What’s your background? I’ve always been a computer nerd. I’ve been working in front end with React, Angular and all the things. However, back in 2012, I was really interested in learning about machine learning and working on it. I built a few prototypes and really got into it. I decided I wanted to build more of a platform than a single application. I began working on DeepSpeed in 2015 and it has continued to grow. DeepSpeed today looks at a broader web of APIs than any other single service, we’re currently looking to build that out. It takes about 100 requests to use the service. Why work on DeepSpeed? It’s really cool. It has real world applications in sports and stock picking with automated data science. It’s a really, truly innovative product! It’s something that will help small businesses all around the world to gain a competitive edge over their competitors. They can create software that goes into automation and is going to save them lots of money. They can use it to create software that will help people in their business or at home. I’m really excited about DeepSpeed, but I’m actually also really interested in the other projects you work on. With me being a full stack developer and all the different technologies we use – how have you grown DeepSpeed in the recent weeks? So, to start with, we’ve just made our product really accessible for anyone! You can go and sign-up for DeepSpeed through our website or simply through the app. When you come through the website, it shows you what APIs we’re going to use. It doesn’t matter if you sign up through either site, you’re going to see the same things. We use a front, backend and a mobile project, through React and Angular. Do you have previous experience in web development? If not, what’s the best way for you to understand it? If you can, please, come to an office with me. Yes, I do have some experience,

and it generates:
[{'generated_text': "DeepSpeed is a cool project that\nuses machine learning to\nautomate a web service!\nWe are a bunch of nerdy geeks who\nlove to work and learn on\ncutting edge technology!\nDeep Speed is looking to hire\na full stack Developer!\nDeepSpeed is a really cool project and we’re looking\nto bring on a full stack web developer to work on the\ncore technology and bring it to the cutting\nedge!\nI’ll be here answering any questions you may have\nduring this interview. Feel free to call me or email\nme any time, with any question you have!\nWhat’s your background?\nI’ve always been a computer nerd. I’ve been\nworking in front end with React, Angular and all the\nthings. However, back in 2012, I was really interested in\nlearning about machine learning and working on it. I\nbuilt a few prototypes and really got into it. I decided\nI wanted to build more of a platform than a single\napplication. I began working on DeepSpeed in 2015 and\nit has continued to grow. DeepSpeed today looks at a\nbroader web of APIs than any other single service,\nwe’re currently looking to build that out. It takes\nabout 100 requests to use the service.\nWhy work on DeepSpeed?\nIt’s really cool. It has real world applications\nin sports and stock picking with automated data\nscience. It’s a really, truly innovative product! It’s\nsomething that will help small businesses all around\nthe world to gain a competitive edge over their\ncompetitors. They can create software that goes into\nautomation and is going to save them lots of money.\nThey can use it to create software that will help\npeople in their business or at home.\nI’m really excited about DeepSpeed, but I’m\nactually also really interested in the other projects\nyou work on. With me being a full stack developer and\nall the different technologies we use – how have you\ngrown DeepSpeed in the recent weeks?\nSo, to start with, we’ve just made our product really\naccessible for anyone! You can go and sign-up for\nDeepSpeed through our website or simply through the\napp. When you come through the website, it shows you\nwhat APIs we’re going to use. It doesn’t matter if\nyou sign up through either site, you’re going to see\nthe same things. We use a front, backend and a mobile\nproject, through React and Angular.\nDo you have previous experience in web development? If\nnot, what’s the best way for you to understand\nit?\nIf you can, please, come to an office with me. Yes, I\ndo have some experience, \nBut I think my strength is understanding. I’m a\nsmart person…I know about programming…and\nthat’s why I’m doing these interviews.\nWhat can we find in your Github?\nI really like DeepSpeed, so I’m happy to give back\nto the community with that software. As for my\nother projects, we have a private GitHub that’s\nmore or less just for testing. We try to be as\nproud and up front as possible and when it’s ready\nto be shared, we’ll just send a pull request to the\nrespective repositories.\nYou mentioned you’re really interested in the\nother projects we work on. Could you briefly explain\nthem?\nOur project is called, “DeepSpeed.io” and that’s\nour platform. It’s the foundation of DeepSpeed.io.\nDeepSpeed makes your API service easy and makes it\nall come together. So the API service needs to tie\nin with our backend. The backend needs to be in a\nseparate project. All of the technologies that we’re\nusing are really cool to use and we’re all open to\nsharing, so you’ll also hear about other projects and\ntechnologies on our private GitHub.\nWhat are the core components inside of DeepSpeed?\nWe’ve got the API backend service where we’re\nprocessing the data.\nI understand, but what does that actually mean\nand how does it work?\nWe’re looking at over 2000 different APIs. The\nbackend will be running different APIs and we will\nhave people working on that on a daily basis. We\nuse a service called “Gigantic” to get the data out\nin the API service. We use Elasticsearch for our\nsearch and then we have a lot of other neat\ntechnologies that allow us to pull in any\nspecific information.\nSo if we’re looking at the API service, what\ndata will be stored and how much data is there?\nWe don’t have an unlimited or “huge”\ndata-intensive project. However, we do have a lot of information\nif you take a look at that.\nCould your website link me an image?\nI’ve got nothing to link you?\nLet me take a look at what you’re talking about!\nI’m not so sure.\nWhat’s the best-known person to person to use this?\nI want to try it. No, I’m not.\nOK. Who wrote the\nfamous that was the best-known person to person\nto use?\n\n(This is a a really easy one)\n\nI think you should\n\nWe are\n\nThe best-known.\n\nThis\n\nWhy I like what\n\nSo you'll make things or do what?\n\nOf course to the question I like who\nto see\n\nThe rest of this\n\n\nOf these\n\nA, yes. We'll be seeing this and we'd love to your great to\nyou\n\nA full\n\nI this the\n\nThe\n\nto\n\n\nDo\n\n# what\n\n\nWhy not\n\nThe\n\nd like to\n\nand\n\nWhat\n\nWhat to!\n\nThe full to \nlike\n\nWhat\n\nI do\n\nYou\n\nA to?\n\nof\n\nfull\nfor full\n\nin the full\n\nceiling on\n\nIf of full\n\nto\n\nfull\n\nWhat\n\nI was\n\nand\n\nd\n\nThe full\n\ne\n\nto on\n\n\nWhat\n\nfull\n\n!\n\nfull\n\n\nfull-\n\n\nto\n\nfull to\n\n\nWhat\n\n\nThe full to\n\n\nin the\n\n\n\n\non\n\nfull\n\n\nCi\nabout\nfull\n\nfull\n\ncou\n\nWhat\n\n\n-full\n\nto\n\n\n\nwhat\n\n\n\n-\n\n \n\nto full\n\n\n\n\n\nto\n\n\n\nTo\n\nAnd\n\n\n\n\n\n\n\na\nto see\n\n\n\n\n\n\n\nwanting\n\non\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfull\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nto\na lot of\nto do\ntoo\n\n\n\nto\na\nwhat\n\n\n-in real in\n\n\n\nfull\n\n\n\nA full and\nfull of a\n to\n\ndub in a to iced\non\n \n\n\n\n\n\n\nto do\n\nto\n\n-a\n\n\non\na box full \n\n\nThe\n\nto\n\n\n\nto\ns\n \n\nStoring\n\n\n\nmore-what\n\nto the\n\n\nWhat\n\n\n\nto\n\n\n\nA\n\n\na\n\n\n\nas\n\n\n\n\n\n\n\nto\n\nto\n\nr\nas\nWhat�\n\n\nd\nto\n\n\n\n\n\nbe\n\n\n\n�\n\n\n\n\n\n\n\n\n\n cut\n\n\n\n\n\n\n to\n to\n\n\n\n\n\n\nA\n\n\n\n\n\n\n\n\n\n\nabout\n\n\n\nfull\nfull and\n\n\nand\n of\n\n\n\n\n\n\nup\n\n\n\n\na\n\n\n\n\nv\n high-w\n\n\n\n\n up to a\nd a\n\n\n of\n\n\n\n what their and\n to to a full\n for a\n a.\n what\n\n\na\n\n\n\n\n\n\n\n\nagen\n\n\n\nout\n\n\nsuss\n\n\n to\n\n\n\n to\n cut\n\n\n exp\n\n\n\n\n\n\n\n to\n\n\n\n\n\n\n\n to getts a pana\n\n\n to an\n\n\n to get\n and\n\n to full\n as\nfull and \n to \n, and\n\n to makes\n is being\n a\n to\n what\n and\n\n a \\\n\n the a and a cut to a\n a\n\n a full\n and a\n\n\n\n\n to have to\n - iza\n a major- ‹ a new a and\n\n\n\n\n\n and out of who a target-a to some to see-a d to\n a-\n\n a different\n a bit a little\n to"}]
We also have updated this PR that solves the issue in a more general sense as we remove the MAX_OUT_TOKENS macro and instead decide on the maximum token-length based on the available memory.
Please give that a try, too.
Thanks,
Reza

andrewchernyh · 2022-09-23T02:37:02Z

@RezaYazdaniAminabadi Also I want notice, that if (is_prompt) Context::Instance().reset_tokens(seq_len); is not perfect solution.
Because using original huggingface model I can do one trick (simplified code):

result = model.forward(prompt1)
result2 = model.forward(prompt2, result.past_key_values)

For caching past_key_values and get results with base prompt and different additional prompts faster.
I think it's better to calculate current sentence length using attention_mask or past_key_values size.

andrewchernyh · 2022-09-26T05:18:46Z

@RezaYazdaniAminabadi reopened PR #2344 in PR #2359 after checking current master. It still has memory corruption

RezaYazdaniAminabadi · 2022-09-27T00:33:16Z

notice, that if (is_pr

Hi @andrewchernyh,

Thanks for bringing this up and showing how this can be problematic. On the other hand, I feel getting the current sentence length from past-key-value or attn_mask may not be the prefect solution either. Since there are cases where attn_mask is not passed and we perform triangular masking by default, and the caching mechanism can be handled internally not passed from outside. Also, I would say deepspeed-inference does not work properly in the case you just mentioned, because we are not consuming the content of past-key-value which is sent from outside, however, it's all managed internally. I would be happy to help add this feature if you want to work on it :-)

RezaYazdaniAminabadi · 2022-09-27T00:44:30Z

@RezaYazdaniAminabadi reopened PR #2344 in PR #2359 after checking current master. It still has memory corruption

Thanks a lot for your contribution, it certainly points us to the direction of solving it in a more determined way. I have add some comments in your PR. Please take a look and after making some changes we can merge it. Thanks again.
Best,
Reza

andrewchernyh · 2022-09-27T12:46:50Z

Hi @RezaYazdaniAminabadi,
I think in real world attention mask should be passed always, because batching will not work without attention mask, for GPT it requires padding from left and correct position_ids calculation. FasterTransformer has such feature, named interactive generation that is controlled by boolean flag and I think, it will be good to have such in DeepSpeed.

RezaYazdaniAminabadi · 2022-09-27T16:38:57Z

I agree, but there are cases where there is no padding and we get a ragged batch of input. In this case, there won't be any mask passed. Or even the masking can be sparse, and we have to deal with predefined masking which doesn't show how many tokens are generated so far. Anyway, I still think bringing this feature as you suggested is helpful but wanted to mention that there are cases that this assumption might not be true.

awan-10 · 2022-10-28T18:49:06Z

Is this issue resolved @RezaYazdaniAminabadi @andrewchernyh? If yes, kindly close the issue.

tjruwase · 2022-11-04T18:17:40Z

Fixed, and so closing. Please (re)open if needed.

mallorbc added the bug Something isn't working label Sep 7, 2022

mallorbc changed the title ~~[BUG]GPT Models fail for long inputs and or outputs~~ [BUG]GPT Models fail for long inputs and or outputs during inference Sep 7, 2022

arashb added the inference label Sep 7, 2022

Oogy mentioned this issue Sep 19, 2022

Add gpt-j-6B w/ deepspeed example coreweave/kubernetes-cloud#87

Closed

andrewchernyh mentioned this issue Sep 22, 2022

Fix issue with corrupted output on long generation for GPT #2344

Closed

andrewchernyh mentioned this issue Sep 26, 2022

Fix issue with corrupted output on long generation for GPT #2359

Merged

tjruwase closed this as completed Nov 4, 2022

sridhardev07 mentioned this issue Jan 27, 2023

[BUG] GPT-J inference on DeepSpeed using too much RAM and giving garbage results #2755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]GPT Models fail for long inputs and or outputs during inference #2300

[BUG]GPT Models fail for long inputs and or outputs during inference #2300

mallorbc commented Sep 7, 2022

andrewchernyh commented Sep 22, 2022

mallorbc commented Sep 22, 2022

RezaYazdaniAminabadi commented Sep 22, 2022

andrewchernyh commented Sep 23, 2022

andrewchernyh commented Sep 26, 2022

RezaYazdaniAminabadi commented Sep 27, 2022

RezaYazdaniAminabadi commented Sep 27, 2022

andrewchernyh commented Sep 27, 2022

RezaYazdaniAminabadi commented Sep 27, 2022

awan-10 commented Oct 28, 2022

tjruwase commented Nov 4, 2022

[BUG]GPT Models fail for long inputs and or outputs during inference #2300

[BUG]GPT Models fail for long inputs and or outputs during inference #2300

Comments

mallorbc commented Sep 7, 2022

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

andrewchernyh commented Sep 22, 2022

mallorbc commented Sep 22, 2022

RezaYazdaniAminabadi commented Sep 22, 2022

andrewchernyh commented Sep 23, 2022

andrewchernyh commented Sep 26, 2022

RezaYazdaniAminabadi commented Sep 27, 2022

RezaYazdaniAminabadi commented Sep 27, 2022

andrewchernyh commented Sep 27, 2022

RezaYazdaniAminabadi commented Sep 27, 2022

awan-10 commented Oct 28, 2022

tjruwase commented Nov 4, 2022

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]