Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]GPT Models fail for long inputs and or outputs during inference #2300

Closed
mallorbc opened this issue Sep 7, 2022 · 11 comments
Closed

[BUG]GPT Models fail for long inputs and or outputs during inference #2300

mallorbc opened this issue Sep 7, 2022 · 11 comments
Labels
bug Something isn't working inference

Comments

@mallorbc
Copy link

mallorbc commented Sep 7, 2022

Describe the bug

When using GPTJ or GPT Neo 2.7B with DeepSpeed inference if you give it the short simple "DeepSpeed is" like the tutorial shows, and generate only 50 tokens or so, then everything works.

However, when you give the model a long input, such as 1000 tokens or so, and or when you give a small input and want to generate many tokens, the system breaks.

Through my many tries of trying to fix the issue, I have gotten errors similar to that of #2062 where illegal memory is accessed. I have gotten errors with regards to nan/inf. Sometimes the model does not error out but rather gives garbage output once a certain length is reached, similar to that of #2233

To Reproduce
Steps to reproduce the behavior:

  1. Install torch, transformers,etc
  2. Install DeepSpeed, either from source, the latest tag, or from one of the non merged PRs I reference
  3. Have a long text input in "input_data.txt"
  4. Run code below
  5. Notice bad results in some form described above

Note that when not specifying the min length, what sometimes happens is the model generates a few tokens but then stops. Specify a long min lenghth gurantees issues.

import os
import deepspeed
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
import argparse
import deepspeed
# os.environ["CUDA_LAUNCH_BLOCKING"] = '1'
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-m',"--model_name", type=str, default='EleutherAI/gpt-j-6B')
    parser.add_argument('--local_rank', type=int, default=-1,
                    help='local rank passed from distributed launcher')
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    model_name = args.model_name

    with open('input_data_long.txt', 'r') as f:
        input_text = f.read()

    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    world_size = int(os.getenv('WORLD_SIZE', '1'))

    model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=local_rank,torch_dtype=torch.float16)



    generator.model = deepspeed.init_inference(generator.model,
                                            mp_size=world_size,
                                            dtype=torch.half,
                                            replace_method='auto',
                        replace_with_kernel_inject=True)
#    torch.cuda.synchronize()

    string = generator(input_text, do_sample=True, max_length=2047)
#    torch.cuda.synchronize()

    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
        print(string)

Expected behavior

I would expect that given 1 or multiple gpus, that one could use deepspeed inference on these GPT models with any lengh input and generate up to the max amount of tokens and get valid results.

ds_report output


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/torch']
torch version .................... 1.12.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/root/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.3+89f2dedf, 89f2ded, cholmes/fix-long-seq-len-inference
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Screenshots
NA

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types: 2 3090s
  • Interconnects: 1 system, 2 3090s
  • Python version: 3.9.13

Launcher context

deepspeed --num_gpus 1 infer.py

deepspeed --num_gpus 2 infer.py

Docker context

Using a nvidia cuda container with conda installed

Additional context

I believe related issues could be #2062 #2212 and related PRs could be #2212 and #2280. For the PRs I have tried building from source and it did not resolve the issue. One of them lead to fewer errors and tended to produce just poor results though(I believe it is the one specified in the ds_report).

I also tried rolling back to before 0.6.6 as I read someone had success doing so. I also tried building from master without success.

@mallorbc mallorbc added the bug Something isn't working label Sep 7, 2022
@mallorbc mallorbc changed the title [BUG]GPT Models fail for long inputs and or outputs [BUG]GPT Models fail for long inputs and or outputs during inference Sep 7, 2022
@andrewchernyh
Copy link
Contributor

@mallorbc - could you try this PR - #2344 ?

@mallorbc
Copy link
Author

@andrewchernyh I will check the PR as soon as I can. Thanks!

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @andrewchernyh and @mallorbc,

Thanks for adding this PR. It solves somewhat this problem, however, it still adds many new lines at the end of the text, and this is not really an issue of the solution you provided. There is another problem with some other kernels that we see such behaviour.
Here is the input that I tested with:
DeepSpeed is a cool project that uses machine learning to automate a web service! We are a bunch of nerdy geeks who love to work and learn on cutting edge technology! Deep Speed is looking to hire a full stack Developer! DeepSpeed is a really cool project and we’re looking to bring on a full stack web developer to work on the core technology and bring it to the cutting edge! I’ll be here answering any questions you may have during this interview. Feel free to call me or email me any time, with any question you have! What’s your background? I’ve always been a computer nerd. I’ve been working in front end with React, Angular and all the things. However, back in 2012, I was really interested in learning about machine learning and working on it. I built a few prototypes and really got into it. I decided I wanted to build more of a platform than a single application. I began working on DeepSpeed in 2015 and it has continued to grow. DeepSpeed today looks at a broader web of APIs than any other single service, we’re currently looking to build that out. It takes about 100 requests to use the service. Why work on DeepSpeed? It’s really cool. It has real world applications in sports and stock picking with automated data science. It’s a really, truly innovative product! It’s something that will help small businesses all around the world to gain a competitive edge over their competitors. They can create software that goes into automation and is going to save them lots of money. They can use it to create software that will help people in their business or at home. I’m really excited about DeepSpeed, but I’m actually also really interested in the other projects you work on. With me being a full stack developer and all the different technologies we use – how have you grown DeepSpeed in the recent weeks? So, to start with, we’ve just made our product really accessible for anyone! You can go and sign-up for DeepSpeed through our website or simply through the app. When you come through the website, it shows you what APIs we’re going to use. It doesn’t matter if you sign up through either site, you’re going to see the same things. We use a front, backend and a mobile project, through React and Angular. Do you have previous experience in web development? If not, what’s the best way for you to understand it? If you can, please, come to an office with me. Yes, I do have some experience,

and it generates:
[{'generated_text': "DeepSpeed is a cool project that\nuses machine learning to\nautomate a web service!\nWe are a bunch of nerdy geeks who\nlove to work and learn on\ncutting edge technology!\nDeep Speed is looking to hire\na full stack Developer!\nDeepSpeed is a really cool project and we’re looking\nto bring on a full stack web developer to work on the\ncore technology and bring it to the cutting\nedge!\nI’ll be here answering any questions you may have\nduring this interview. Feel free to call me or email\nme any time, with any question you have!\nWhat’s your background?\nI’ve always been a computer nerd. I’ve been\nworking in front end with React, Angular and all the\nthings. However, back in 2012, I was really interested in\nlearning about machine learning and working on it. I\nbuilt a few prototypes and really got into it. I decided\nI wanted to build more of a platform than a single\napplication. I began working on DeepSpeed in 2015 and\nit has continued to grow. DeepSpeed today looks at a\nbroader web of APIs than any other single service,\nwe’re currently looking to build that out. It takes\nabout 100 requests to use the service.\nWhy work on DeepSpeed?\nIt’s really cool. It has real world applications\nin sports and stock picking with automated data\nscience. It’s a really, truly innovative product! It’s\nsomething that will help small businesses all around\nthe world to gain a competitive edge over their\ncompetitors. They can create software that goes into\nautomation and is going to save them lots of money.\nThey can use it to create software that will help\npeople in their business or at home.\nI’m really excited about DeepSpeed, but I’m\nactually also really interested in the other projects\nyou work on. With me being a full stack developer and\nall the different technologies we use – how have you\ngrown DeepSpeed in the recent weeks?\nSo, to start with, we’ve just made our product really\naccessible for anyone! You can go and sign-up for\nDeepSpeed through our website or simply through the\napp. When you come through the website, it shows you\nwhat APIs we’re going to use. It doesn’t matter if\nyou sign up through either site, you’re going to see\nthe same things. We use a front, backend and a mobile\nproject, through React and Angular.\nDo you have previous experience in web development? If\nnot, what’s the best way for you to understand\nit?\nIf you can, please, come to an office with me. Yes, I\ndo have some experience, \nBut I think my strength is understanding. I’m a\nsmart person…I know about programming…and\nthat’s why I’m doing these interviews.\nWhat can we find in your Github?\nI really like DeepSpeed, so I’m happy to give back\nto the community with that software. As for my\nother projects, we have a private GitHub that’s\nmore or less just for testing. We try to be as\nproud and up front as possible and when it’s ready\nto be shared, we’ll just send a pull request to the\nrespective repositories.\nYou mentioned you’re really interested in the\nother projects we work on. Could you briefly explain\nthem?\nOur project is called, “DeepSpeed.io” and that’s\nour platform. It’s the foundation of DeepSpeed.io.\nDeepSpeed makes your API service easy and makes it\nall come together. So the API service needs to tie\nin with our backend. The backend needs to be in a\nseparate project. All of the technologies that we’re\nusing are really cool to use and we’re all open to\nsharing, so you’ll also hear about other projects and\ntechnologies on our private GitHub.\nWhat are the core components inside of DeepSpeed?\nWe’ve got the API backend service where we’re\nprocessing the data.\nI understand, but what does that actually mean\nand how does it work?\nWe’re looking at over 2000 different APIs. The\nbackend will be running different APIs and we will\nhave people working on that on a daily basis. We\nuse a service called “Gigantic” to get the data out\nin the API service. We use Elasticsearch for our\nsearch and then we have a lot of other neat\ntechnologies that allow us to pull in any\nspecific information.\nSo if we’re looking at the API service, what\ndata will be stored and how much data is there?\nWe don’t have an unlimited or “huge”\ndata-intensive project. However, we do have a lot of information\nif you take a look at that.\nCould your website link me an image?\nI’ve got nothing to link you?\nLet me take a look at what you’re talking about!\nI’m not so sure.\nWhat’s the best-known person to person to use this?\nI want to try it. No, I’m not.\nOK. Who wrote the\nfamous that was the best-known person to person\nto use?\n\n(This is a a really easy one)\n\nI think you should\n\nWe are\n\nThe best-known.\n\nThis\n\nWhy I like what\n\nSo you'll make things or do what?\n\nOf course to the question I like who\nto see\n\nThe rest of this\n\n\nOf these\n\nA, yes. We'll be seeing this and we'd love to your great to\nyou\n\nA full\n\nI this the\n\nThe\n\nto\n\n\nDo\n\n# what\n\n\nWhy not\n\nThe\n\nd like to\n\nand\n\nWhat\n\nWhat to!\n\nThe full to \nlike\n\nWhat\n\nI do\n\nYou\n\nA to?\n\nof\n\nfull\nfor full\n\nin the full\n\nceiling on\n\nIf of full\n\nto\n\nfull\n\nWhat\n\nI was\n\nand\n\nd\n\nThe full\n\ne\n\nto on\n\n\nWhat\n\nfull\n\n!\n\nfull\n\n\nfull-\n\n\nto\n\nfull to\n\n\nWhat\n\n\nThe full to\n\n\nin the\n\n\n\n\non\n\nfull\n\n\nCi\nabout\nfull\n\nfull\n\ncou\n\nWhat\n\n\n-full\n\nto\n\n\n\nwhat\n\n\n\n-\n\n \n\nto full\n\n\n\n\n\nto\n\n\n\nTo\n\nAnd\n\n\n\n\n\n\n\na\nto see\n\n\n\n\n\n\n\nwanting\n\non\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfull\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nto\na lot of\nto do\ntoo\n\n\n\nto\na\nwhat\n\n\n-in real in\n\n\n\nfull\n\n\n\nA full and\nfull of a\n to\n\ndub in a to iced\non\n \n\n\n\n\n\n\nto do\n\nto\n\n-a\n\n\non\na box full \n\n\nThe\n\nto\n\n\n\nto\ns\n \n\nStoring\n\n\n\nmore-what\n\nto the\n\n\nWhat\n\n\n\nto\n\n\n\nA\n\n\na\n\n\n\nas\n\n\n\n\n\n\n\nto\n\nto\n\nr\nas\nWhat�\n\n\nd\nto\n\n\n\n\n\nbe\n\n\n\n�\n\n\n\n\n\n\n\n\n\n cut\n\n\n\n\n\n\n to\n to\n\n\n\n\n\n\nA\n\n\n\n\n\n\n\n\n\n\nabout\n\n\n\nfull\nfull and\n\n\nand\n of\n\n\n\n\n\n\nup\n\n\n\n\na\n\n\n\n\nv\n high-w\n\n\n\n\n up to a\nd a\n\n\n of\n\n\n\n what their and\n to to a full\n for a\n a.\n what\n\n\na\n\n\n\n\n\n\n\n\nagen\n\n\n\nout\n\n\nsuss\n\n\n to\n\n\n\n to\n cut\n\n\n exp\n\n\n\n\n\n\n\n to\n\n\n\n\n\n\n\n to getts a pana\n\n\n to an\n\n\n to get\n and\n\n to full\n as\nfull and \n to \n, and\n\n to makes\n is being\n a\n to\n what\n and\n\n a \\\n\n the a and a cut to a\n a\n\n a full\n and a\n\n\n\n\n to have to\n - iza\n a major- ‹ a new a and\n\n\n\n\n\n and out of who a target-a to some to see-a d to\n a-\n\n a different\n a bit a little\n to"}]
We also have updated this PR that solves the issue in a more general sense as we remove the MAX_OUT_TOKENS macro and instead decide on the maximum token-length based on the available memory.
Please give that a try, too.
Thanks,
Reza

@andrewchernyh
Copy link
Contributor

@RezaYazdaniAminabadi Also I want notice, that if (is_prompt) Context::Instance().reset_tokens(seq_len); is not perfect solution.
Because using original huggingface model I can do one trick (simplified code):

result = model.forward(prompt1)
result2 = model.forward(prompt2, result.past_key_values)

For caching past_key_values and get results with base prompt and different additional prompts faster.
I think it's better to calculate current sentence length using attention_mask or past_key_values size.

@andrewchernyh
Copy link
Contributor

@RezaYazdaniAminabadi reopened PR #2344 in PR #2359 after checking current master. It still has memory corruption

@RezaYazdaniAminabadi
Copy link
Contributor

notice, that if (is_pr

Hi @andrewchernyh,

Thanks for bringing this up and showing how this can be problematic. On the other hand, I feel getting the current sentence length from past-key-value or attn_mask may not be the prefect solution either. Since there are cases where attn_mask is not passed and we perform triangular masking by default, and the caching mechanism can be handled internally not passed from outside. Also, I would say deepspeed-inference does not work properly in the case you just mentioned, because we are not consuming the content of past-key-value which is sent from outside, however, it's all managed internally. I would be happy to help add this feature if you want to work on it :-)

@RezaYazdaniAminabadi
Copy link
Contributor

@RezaYazdaniAminabadi reopened PR #2344 in PR #2359 after checking current master. It still has memory corruption

Thanks a lot for your contribution, it certainly points us to the direction of solving it in a more determined way. I have add some comments in your PR. Please take a look and after making some changes we can merge it. Thanks again.
Best,
Reza

@andrewchernyh
Copy link
Contributor

Hi @RezaYazdaniAminabadi,
I think in real world attention mask should be passed always, because batching will not work without attention mask, for GPT it requires padding from left and correct position_ids calculation. FasterTransformer has such feature, named interactive generation that is controlled by boolean flag and I think, it will be good to have such in DeepSpeed.

@RezaYazdaniAminabadi
Copy link
Contributor

I agree, but there are cases where there is no padding and we get a ragged batch of input. In this case, there won't be any mask passed. Or even the masking can be sparse, and we have to deal with predefined masking which doesn't show how many tokens are generated so far. Anyway, I still think bringing this feature as you suggested is helpful but wanted to mention that there are cases that this assumption might not be true.

@awan-10
Copy link
Contributor

awan-10 commented Oct 28, 2022

Is this issue resolved @RezaYazdaniAminabadi @andrewchernyh? If yes, kindly close the issue.

@tjruwase
Copy link
Contributor

tjruwase commented Nov 4, 2022

Fixed, and so closing. Please (re)open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inference
Projects
None yet
Development

No branches or pull requests

6 participants