-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak from long-duration inference #801
Comments
Yup. @Narsil Same problem here. Would suggest just adding the torch clear in the stop function inside casual file to fix it tbh. It works great and is all you would need to add. I have been using it and not issues. Just wish theres some better way to handle to max batch tokens because it seems like after that update, I am being restricted on running multiple models at same time. Before I could hf 13b (20k batch tokens) and then 15b gptq (30k batch tokens) but now having to cap the memory and having to tweak alot of settings. It also uses bit more vram for less amount of batch tokens than before. |
@bloodsucker99 do you mind opening a PR for it ? I'm not sure where the clear should be added. |
Hey @Narsil, If you look inside causal_lm.py, line 660,
Thats where you can add the statement and it will fix the memory issue. :) |
@Rogerwyf I made the Pr for it: #829 Thanks you @bloodsucker99 . However, if that fixes it, it looks like it might not be an actual leak, just torch allocator releasing less and less becuase it actually needs the memory. Unless we actually run into OOMs in production, I would tend to consider it to be a false positive here. |
@Narsil it actually OOMs eventually cause we just run out of memory. Its impossible to inference with the model for more than like 10 queries. So yea, and also there is another issue where if something is running on the gpu and then load the model, it seems to require higher settings to get it working. |
Have tried latest image for a spin? |
Yea i am on the latest. |
Thank you @Narsil for the quick PR! I'm wondering if you have an answer to my other question/comment regarding
I can try running a stress test of some sort with |
@Rogerwyf yes this is the new change when auto maxbatchtokens was inteoduced. TGI will automatically calculate them based on your settings used and the hardware available. |
I encountered a similar issue while using the
I am utilizing a V100 32GB GPU, and the deployment parameters are as follows:
|
@ZeroYuJie What hardware + Cuda version + environement ? |
@Narsil CUDA Version: 12.2 + Centos7 and running in docker |
@Narsil seems like this is the thread on memory leak. For others, I don't know if you've been running it for a long time but eventually it fails. Currently only tested on quantized NF4 and GPTQ, but replicable on both. For NF4 it increases more slowly it seems, for GPTQ it increases faster. I'm thinking of doing periodic pod restarts in a K8s environment as a temporary workaround while this is being investigated. |
@jerryMeng100 I was able to use @Narsil It seems that given how many have reported this issue, it is indeed a memory leak, but I haven't got a chance to test out the fix you added in. I'm wondering if you guys have observed the same memory leak from production as well. |
This does not fix it for my case. I don't get a CUDA OOM, it's rather the pod's RAM OOM in the form of "transport error". I tried cuda memory fraction, didn't do anything for this case. |
@Ichigo3766, what type of hardware do you run TGI with? and are you on > v1? |
@OlivierDehaene Yes over > v1 and im on 4 A10G's (96gb vram). Using it for personal use-case with some friends pretty much. I have added that clear thing in every casual file and no longer had any issues btw ever since and been using it for a while. Its the easiest fix :) |
I ran into the same issue... This is the specific Llama-2 service and its configuration. services:
llama2-70b:
image: ghcr.io/huggingface/text-generation-inference:latest
restart: always
shm_size: 1gb
env_file:
- variables.env
volumes:
- $VOLUME:/data
environment:
- HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN
container_name: llama2-70b
ports:
- "3070:80"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: [ '1' ]
capabilities: [ gpu ]
command: >- # treat below as one long string
--model-id meta-llama/Llama-2-70b-chat-hf
--quantize bitsandbytes-nf4
--num-shard 1
--max-input-length 3072
--max-total-tokens 4096 The model is deployed on an A100 80GB GPU. |
This seems to be coming up everytime I see this issue, it seems to be bnb leaking. Could you try creating an issue upstream and pinging me so I can follow the discussion. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
System Info
Hi there,
I am using this image
ghcr.io/huggingface/text-generation-inference:0.9
for inference on a single A100-40GB GPU on EKS with the following arguments:I noticed
max_batch_total_tokens
was overridden to an inferred value, and this seems to be a change fromv0.9.4
(a related issue)But more importantly, I noticed the memory usage kept increasing throughout a 5hr inferencing session. Below is a chart of
DCGM_FI_DEV_FB_FREE
(Free Frame Buffer in MB) of the container from nvidia's dcgm exporter (from 1517 to 337) before I killed the deployment due to the concern on CUDA OOM:I also verified with
nvidia-smi
at the end of the session to confirm that the free memory has indeed been going down to 339MB:Is this expected? Why would the memory usage keep going up? Is there a possible memory leak and how could we prevent it from happening?
Information
Tasks
Reproduction
As described above.
Expected behavior
I expected the free GPU memory to not change throughout the inference session.
The text was updated successfully, but these errors were encountered: