Memory leak from long-duration inference #801

ghost · 2023-08-10T05:28:50Z

System Info

Hi there,

I am using this image ghcr.io/huggingface/text-generation-inference:0.9 for inference on a single A100-40GB GPU on EKS with the following arguments:

- '--model-id'
- bigcode/starcoderbase
- '--num-shard'
- '1'
- '--max-batch-total-tokens'
- '6000'
- '--max-concurrent-requests'
- '2000'
- '--max-input-length'
- '2048'
- '--max-total-tokens'
- '3072'

I noticed max_batch_total_tokens was overridden to an inferred value, and this seems to be a change from v0.9.4 (a related issue)

WARN text_generation_router: router/src/main.rs:232: Inferred max batch total tokens: 367184
INFO text_generation_router: router/src/main.rs:239: Setting max batch total tokens to 367184

But more importantly, I noticed the memory usage kept increasing throughout a 5hr inferencing session. Below is a chart of DCGM_FI_DEV_FB_FREE (Free Frame Buffer in MB) of the container from nvidia's dcgm exporter (from 1517 to 337) before I killed the deployment due to the concern on CUDA OOM:

I also verified with nvidia-smi at the end of the session to confirm that the free memory has indeed been going down to 339MB:

Is this expected? Why would the memory usage keep going up? Is there a possible memory leak and how could we prevent it from happening?

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

As described above.

Expected behavior

I expected the free GPU memory to not change throughout the inference session.

The text was updated successfully, but these errors were encountered:

Ichigo3766 · 2023-08-11T03:29:17Z

Yup. @Narsil Same problem here. Would suggest just adding the torch clear in the stop function inside casual file to fix it tbh. It works great and is all you would need to add. I have been using it and not issues. Just wish theres some better way to handle to max batch tokens because it seems like after that update, I am being restricted on running multiple models at same time. Before I could hf 13b (20k batch tokens) and then 15b gptq (30k batch tokens) but now having to cap the memory and having to tweak alot of settings. It also uses bit more vram for less amount of batch tokens than before.

Narsil · 2023-08-11T14:59:55Z

@bloodsucker99 do you mind opening a PR for it ? I'm not sure where the clear should be added.

Ichigo3766 · 2023-08-11T23:54:33Z

Hey @Narsil, If you look inside causal_lm.py, line 660,

    if stopped:
        torch.cuda.empty_cache() 
        return generations, None

Thats where you can add the statement and it will fix the memory issue. :)

Narsil · 2023-08-12T07:09:47Z

@Rogerwyf I made the Pr for it: #829

Thanks you @bloodsucker99 .

However, if that fixes it, it looks like it might not be an actual leak, just torch allocator releasing less and less becuase it actually needs the memory.
That or fragmentation which could be a real issue.

Unless we actually run into OOMs in production, I would tend to consider it to be a false positive here.
Ooms would prove the leak and or the fragmentation (either way a bug).

Ichigo3766 · 2023-08-12T07:39:45Z

@Narsil it actually OOMs eventually cause we just run out of memory. Its impossible to inference with the model for more than like 10 queries. So yea, and also there is another issue where if something is running on the gpu and then load the model, it seems to require higher settings to get it working.

Narsil · 2023-08-12T08:36:29Z

Have tried latest image for a spin?

Ichigo3766 · 2023-08-12T18:32:36Z

Yea i am on the latest.

ghost · 2023-08-13T00:38:14Z

@Rogerwyf I made the Pr for it: #829

Thanks you @bloodsucker99 .

However, if that fixes it, it looks like it might not be an actual leak, just torch allocator releasing less and less becuase it actually needs the memory. That or fragmentation which could be a real issue.

Unless we actually run into OOMs in production, I would tend to consider it to be a false positive here. Ooms would prove the leak and or the fragmentation (either way a bug).

Thank you @Narsil for the quick PR! I'm wondering if you have an answer to my other question/comment regarding max_batch_total_tokens in the original post. Is the parameter override an expected behaviour?

Unless we actually run into OOMs in production, I would tend to consider it to be a false positive here.
Ooms would prove the leak and or the fragmentation (either way a bug).

I can try running a stress test of some sort with v0.9.4 and see if it will indeed run into OOM issue as a confirmation/proof.

Ichigo3766 · 2023-08-14T00:59:29Z

@Rogerwyf yes this is the new change when auto maxbatchtokens was inteoduced. TGI will automatically calculate them based on your settings used and the hardware available.

ZeroYuJie · 2023-08-21T11:45:36Z

I encountered a similar issue while using the NousResearch/Redmond-Puffin-13B model, version v1.0.1. During testing with actual concurrent generations, GPU memory usage gradually increases until it reaches a point where inference generation can no longer be handled.
error is

ERROR batch{batch_size=1}:prefill:prefill{id=22 size=1}:prefill{id=22 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CUDA out of memory. Tried to allocate 60.00 MiB (GPU 0; 31.74 GiB total capacity; 30.31 GiB already allocated; 8.38 MiB free; 31.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am utilizing a V100 32GB GPU, and the deployment parameters are as follows:

docker run --rm --name tgi --gpus all --shm-size 1g -v /home/webserver/models:/models ghcr.io/huggingface/text-generation-inference:1.0.1 --model-id /models/Redmond-Puffin-13B --dtype float16 --max-input-length 2048 --max-total-tokens 3000

Narsil · 2023-08-22T10:34:36Z

@ZeroYuJie What hardware + Cuda version + environement ?

ZeroYuJie · 2023-08-23T07:25:37Z

@Narsil CUDA Version: 12.2 + Centos7 and running in docker

0xymoro · 2023-09-01T20:52:42Z

@Narsil seems like this is the thread on memory leak.

For others, I don't know if you've been running it for a long time but eventually it fails. Currently only tested on quantized NF4 and GPTQ, but replicable on both. For NF4 it increases more slowly it seems, for GPTQ it increases faster. I'm thinking of doing periodic pod restarts in a K8s environment as a temporary workaround while this is being investigated.

#931

ghost · 2023-09-01T21:55:37Z

@jerryMeng100 I was able to use --cuda-memory-fraction to get away from this issue - If I understand correctly this hard caps how much memory is accessible from PyTorch to prevent CUDA OOM.

@Narsil It seems that given how many have reported this issue, it is indeed a memory leak, but I haven't got a chance to test out the fix you added in. I'm wondering if you guys have observed the same memory leak from production as well.

0xymoro · 2023-09-03T20:23:58Z

This does not fix it for my case. I don't get a CUDA OOM, it's rather the pod's RAM OOM in the form of "transport error". I tried cuda memory fraction, didn't do anything for this case.

OlivierDehaene · 2023-09-06T13:47:44Z

@Ichigo3766, what type of hardware do you run TGI with? and are you on > v1?

Ichigo3766 · 2023-09-07T22:47:53Z

@OlivierDehaene Yes over > v1 and im on 4 A10G's (96gb vram). Using it for personal use-case with some friends pretty much.

I have added that clear thing in every casual file and no longer had any issues btw ever since and been using it for a while. Its the easiest fix :)

LarsHill · 2023-09-18T15:04:06Z

I ran into the same issue...
I deployed Llama-2 via docker-compose using the latest docker image.
After running the endpoint for many hours and different users sending multiple requests, eventually the RAM completely fills up and a CUDA OOM Error is thrown for every subsequent request.

This is the specific Llama-2 service and its configuration.

services:
  llama2-70b:
    image: ghcr.io/huggingface/text-generation-inference:latest
    restart: always
    shm_size: 1gb
    env_file:
      - variables.env
    volumes:
      - $VOLUME:/data
    environment:
      - HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN
    container_name: llama2-70b
    ports:
      - "3070:80"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '1' ]
              capabilities: [ gpu ]
    command: >-  # treat below as one long string
      --model-id meta-llama/Llama-2-70b-chat-hf 
      --quantize bitsandbytes-nf4 
      --num-shard 1 
      --max-input-length 3072 
      --max-total-tokens 4096

The model is deployed on an A100 80GB GPU.

Narsil · 2023-09-18T15:27:31Z

--quantize bitsandbytes-nf4

This seems to be coming up everytime I see this issue, it seems to be bnb leaking.
We happen to not use it in production ourselves which should explain.

Could you try creating an issue upstream and pinging me so I can follow the discussion.
It should be relatively easy to reproduce the leak in isolation.

github-actions · 2024-04-12T01:45:21Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

0xymoro mentioned this issue Sep 3, 2023

Memory leak and crash in long-running pod handling requests continuously #974

Closed

4 tasks

LarsHill mentioned this issue Sep 22, 2023

Cuda Memory Leak when running HF Text Generation Inference with bitsandbytes bitsandbytes-foundation/bitsandbytes#784

Closed

github-actions bot added the Stale label Apr 12, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak from long-duration inference #801

Memory leak from long-duration inference #801

ghost commented Aug 10, 2023 •

edited by ghost

Loading

Ichigo3766 commented Aug 11, 2023

Narsil commented Aug 11, 2023

Ichigo3766 commented Aug 11, 2023 •

edited

Loading

Narsil commented Aug 12, 2023

Ichigo3766 commented Aug 12, 2023

Narsil commented Aug 12, 2023

Ichigo3766 commented Aug 12, 2023

ghost commented Aug 13, 2023

Ichigo3766 commented Aug 14, 2023 •

edited

Loading

ZeroYuJie commented Aug 21, 2023

Narsil commented Aug 22, 2023

ZeroYuJie commented Aug 23, 2023

0xymoro commented Sep 1, 2023

ghost commented Sep 1, 2023

0xymoro commented Sep 3, 2023

OlivierDehaene commented Sep 6, 2023

Ichigo3766 commented Sep 7, 2023 •

edited

Loading

LarsHill commented Sep 18, 2023

Narsil commented Sep 18, 2023

github-actions bot commented Apr 12, 2024

Memory leak from long-duration inference #801

Memory leak from long-duration inference #801

Comments

ghost commented Aug 10, 2023 • edited by ghost Loading

System Info

Information

Tasks

Reproduction

Expected behavior

Ichigo3766 commented Aug 11, 2023

Narsil commented Aug 11, 2023

Ichigo3766 commented Aug 11, 2023 • edited Loading

Narsil commented Aug 12, 2023

Ichigo3766 commented Aug 12, 2023

Narsil commented Aug 12, 2023

Ichigo3766 commented Aug 12, 2023

ghost commented Aug 13, 2023

Ichigo3766 commented Aug 14, 2023 • edited Loading

ZeroYuJie commented Aug 21, 2023

Narsil commented Aug 22, 2023

ZeroYuJie commented Aug 23, 2023

0xymoro commented Sep 1, 2023

ghost commented Sep 1, 2023

0xymoro commented Sep 3, 2023

OlivierDehaene commented Sep 6, 2023

Ichigo3766 commented Sep 7, 2023 • edited Loading

LarsHill commented Sep 18, 2023

Narsil commented Sep 18, 2023

github-actions bot commented Apr 12, 2024

ghost commented Aug 10, 2023 •

edited by ghost

Loading

Ichigo3766 commented Aug 11, 2023 •

edited

Loading

Ichigo3766 commented Aug 14, 2023 •

edited

Loading

Ichigo3766 commented Sep 7, 2023 •

edited

Loading