Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak from long-duration inference #801

Closed
2 of 4 tasks
ghost opened this issue Aug 10, 2023 · 20 comments
Closed
2 of 4 tasks

Memory leak from long-duration inference #801

ghost opened this issue Aug 10, 2023 · 20 comments
Labels

Comments

@ghost
Copy link

ghost commented Aug 10, 2023

System Info

Hi there,

I am using this image ghcr.io/huggingface/text-generation-inference:0.9 for inference on a single A100-40GB GPU on EKS with the following arguments:

- '--model-id'
- bigcode/starcoderbase
- '--num-shard'
- '1'
- '--max-batch-total-tokens'
- '6000'
- '--max-concurrent-requests'
- '2000'
- '--max-input-length'
- '2048'
- '--max-total-tokens'
- '3072'

I noticed max_batch_total_tokens was overridden to an inferred value, and this seems to be a change from v0.9.4 (a related issue)

WARN text_generation_router: router/src/main.rs:232: Inferred max batch total tokens: 367184
INFO text_generation_router: router/src/main.rs:239: Setting max batch total tokens to 367184

But more importantly, I noticed the memory usage kept increasing throughout a 5hr inferencing session. Below is a chart of DCGM_FI_DEV_FB_FREE (Free Frame Buffer in MB) of the container from nvidia's dcgm exporter (from 1517 to 337) before I killed the deployment due to the concern on CUDA OOM:
Screen Shot 2023-08-09 at 10 11 11 PM

I also verified with nvidia-smi at the end of the session to confirm that the free memory has indeed been going down to 339MB:
Screen Shot 2023-08-09 at 10 14 51 PM

Is this expected? Why would the memory usage keep going up? Is there a possible memory leak and how could we prevent it from happening?

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

As described above.

Expected behavior

I expected the free GPU memory to not change throughout the inference session.

@Ichigo3766
Copy link

Yup. @Narsil Same problem here. Would suggest just adding the torch clear in the stop function inside casual file to fix it tbh. It works great and is all you would need to add. I have been using it and not issues. Just wish theres some better way to handle to max batch tokens because it seems like after that update, I am being restricted on running multiple models at same time. Before I could hf 13b (20k batch tokens) and then 15b gptq (30k batch tokens) but now having to cap the memory and having to tweak alot of settings. It also uses bit more vram for less amount of batch tokens than before.

@Narsil
Copy link
Collaborator

Narsil commented Aug 11, 2023

@bloodsucker99 do you mind opening a PR for it ? I'm not sure where the clear should be added.

@Ichigo3766
Copy link

Ichigo3766 commented Aug 11, 2023

Hey @Narsil, If you look inside causal_lm.py, line 660,

    if stopped:
        torch.cuda.empty_cache() 
        return generations, None

Thats where you can add the statement and it will fix the memory issue. :)

@Narsil
Copy link
Collaborator

Narsil commented Aug 12, 2023

@Rogerwyf I made the Pr for it: #829

Thanks you @bloodsucker99 .

However, if that fixes it, it looks like it might not be an actual leak, just torch allocator releasing less and less becuase it actually needs the memory.
That or fragmentation which could be a real issue.

Unless we actually run into OOMs in production, I would tend to consider it to be a false positive here.
Ooms would prove the leak and or the fragmentation (either way a bug).

@Ichigo3766
Copy link

@Narsil it actually OOMs eventually cause we just run out of memory. Its impossible to inference with the model for more than like 10 queries. So yea, and also there is another issue where if something is running on the gpu and then load the model, it seems to require higher settings to get it working.

@Narsil
Copy link
Collaborator

Narsil commented Aug 12, 2023

Have tried latest image for a spin?

@Ichigo3766
Copy link

Yea i am on the latest.

@ghost
Copy link
Author

ghost commented Aug 13, 2023

@Rogerwyf I made the Pr for it: #829

Thanks you @bloodsucker99 .

However, if that fixes it, it looks like it might not be an actual leak, just torch allocator releasing less and less becuase it actually needs the memory. That or fragmentation which could be a real issue.

Unless we actually run into OOMs in production, I would tend to consider it to be a false positive here. Ooms would prove the leak and or the fragmentation (either way a bug).

Thank you @Narsil for the quick PR! I'm wondering if you have an answer to my other question/comment regarding max_batch_total_tokens in the original post. Is the parameter override an expected behaviour?

Unless we actually run into OOMs in production, I would tend to consider it to be a false positive here.
Ooms would prove the leak and or the fragmentation (either way a bug).

I can try running a stress test of some sort with v0.9.4 and see if it will indeed run into OOM issue as a confirmation/proof.

@Ichigo3766
Copy link

Ichigo3766 commented Aug 14, 2023

@Rogerwyf yes this is the new change when auto maxbatchtokens was inteoduced. TGI will automatically calculate them based on your settings used and the hardware available.

@ZeroYuJie
Copy link

I encountered a similar issue while using the NousResearch/Redmond-Puffin-13B model, version v1.0.1. During testing with actual concurrent generations, GPU memory usage gradually increases until it reaches a point where inference generation can no longer be handled.
error is

ERROR batch{batch_size=1}:prefill:prefill{id=22 size=1}:prefill{id=22 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CUDA out of memory. Tried to allocate 60.00 MiB (GPU 0; 31.74 GiB total capacity; 30.31 GiB already allocated; 8.38 MiB free; 31.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am utilizing a V100 32GB GPU, and the deployment parameters are as follows:

docker run --rm --name tgi --gpus all --shm-size 1g -v /home/webserver/models:/models ghcr.io/huggingface/text-generation-inference:1.0.1 --model-id /models/Redmond-Puffin-13B --dtype float16 --max-input-length 2048 --max-total-tokens 3000

@Narsil
Copy link
Collaborator

Narsil commented Aug 22, 2023

@ZeroYuJie What hardware + Cuda version + environement ?

@ZeroYuJie
Copy link

@Narsil CUDA Version: 12.2 + Centos7 and running in docker

@0xymoro
Copy link

0xymoro commented Sep 1, 2023

@Narsil seems like this is the thread on memory leak.

For others, I don't know if you've been running it for a long time but eventually it fails. Currently only tested on quantized NF4 and GPTQ, but replicable on both. For NF4 it increases more slowly it seems, for GPTQ it increases faster. I'm thinking of doing periodic pod restarts in a K8s environment as a temporary workaround while this is being investigated.

#931

@ghost
Copy link
Author

ghost commented Sep 1, 2023

@jerryMeng100 I was able to use --cuda-memory-fraction to get away from this issue - If I understand correctly this hard caps how much memory is accessible from PyTorch to prevent CUDA OOM.

@Narsil It seems that given how many have reported this issue, it is indeed a memory leak, but I haven't got a chance to test out the fix you added in. I'm wondering if you guys have observed the same memory leak from production as well.

@0xymoro
Copy link

0xymoro commented Sep 3, 2023

This does not fix it for my case. I don't get a CUDA OOM, it's rather the pod's RAM OOM in the form of "transport error". I tried cuda memory fraction, didn't do anything for this case.

@OlivierDehaene
Copy link
Member

@Ichigo3766, what type of hardware do you run TGI with? and are you on > v1?

@Ichigo3766
Copy link

Ichigo3766 commented Sep 7, 2023

@OlivierDehaene Yes over > v1 and im on 4 A10G's (96gb vram). Using it for personal use-case with some friends pretty much.

I have added that clear thing in every casual file and no longer had any issues btw ever since and been using it for a while. Its the easiest fix :)

@LarsHill
Copy link

I ran into the same issue...
I deployed Llama-2 via docker-compose using the latest docker image.
After running the endpoint for many hours and different users sending multiple requests, eventually the RAM completely fills up and a CUDA OOM Error is thrown for every subsequent request.

This is the specific Llama-2 service and its configuration.

services:
  llama2-70b:
    image: ghcr.io/huggingface/text-generation-inference:latest
    restart: always
    shm_size: 1gb
    env_file:
      - variables.env
    volumes:
      - $VOLUME:/data
    environment:
      - HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN
    container_name: llama2-70b
    ports:
      - "3070:80"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '1' ]
              capabilities: [ gpu ]
    command: >-  # treat below as one long string
      --model-id meta-llama/Llama-2-70b-chat-hf 
      --quantize bitsandbytes-nf4 
      --num-shard 1 
      --max-input-length 3072 
      --max-total-tokens 4096

The model is deployed on an A100 80GB GPU.

@Narsil
Copy link
Collaborator

Narsil commented Sep 18, 2023

--quantize bitsandbytes-nf4

This seems to be coming up everytime I see this issue, it seems to be bnb leaking.
We happen to not use it in production ourselves which should explain.

Could you try creating an issue upstream and pinging me so I can follow the discussion.
It should be relatively easy to reproduce the leak in isolation.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 12, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants