-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to terminate vllm.LLM and release the GPU memory #1908
Comments
Please check the codes below. It works. import gc
import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)
# Delete the llm object and free the memory
destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!") Best regards, Shuyue |
mark |
Even after executing the code above, the GPU memory is not freed with the latest vllm built from source. Any recommendations? |
Are there any updates on this? the above code does not work for me either |
+1 |
I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. |
+1 |
I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me. But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.
Anyway, even if it works, it is just a temporary solution and this issue still needs fixing. |
update:
|
In the latest version of vLLM from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache() |
thx a lot |
vLLM seems to hang to the first allocated LLM() instance. It does not hang to later instances. Maybe that helps with diagnosing the issue? from vllm import LLM
def show_memory_usage():
import torch.cuda
import torch.distributed
import gc
print(f"cuda memory: {torch.cuda.memory_allocated()//1024//1024}MB")
gc.collect()
# torch.distributed.destroy_process_group()
torch.cuda.empty_cache()
print(f" --> after gc: {torch.cuda.memory_allocated()//1024//1024}MB")
def gc_problem():
show_memory_usage()
print("loading llm0")
llm0 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=180)
del llm0
show_memory_usage()
print("loading llm1")
llm1 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=500)
del llm1
show_memory_usage()
print("loading llm2")
llm2 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=600)
del llm2
show_memory_usage()
gc_problem()
The |
Tried this including |
could try the "del llm.llm_engine.model_executor" in the following code instead:
|
did that as well, still no change in gpu memory allocation. Not sure how to go further |
We tried this in version 0.4.2, but GPU memory did not released. |
Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1 |
@zheyang0825 does adding this lines at the end make it work?
|
tried on 0.4.0.post1 and method worked, not sure what changed in the latest version that's not releasing the memory, possible bug? |
Hello ! so if I'm not wrong, no one achieved to release memory on vllm 0.4.2 yet ? |
A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process. |
I updated vllm yesterday and still have the problem, I'm using those lines :
|
This code is worked for me vllm==0.4.0.post1
|
There should be a built-in way! We cannot keep writing code that breaks on the next minor release :( |
In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks . I would say, the most stable way to terminate vLLM is to shut down the process. |
I encountered this issue when TP = 8. I'm doing this in a iterative manner since I need to run the embedding model after the generative model so there are so loading / offloading. The first iteration is fine but the second iteration the instantiation of vllm ray server hangs. |
I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it. |
I don't know if anyone can currently clear memory correctly, but in version 0.4.2, I applied the above code that failed to clear memory correctly. I can only use a slightly extreme method of creating a new process before calling and closing the process after calling to roughly solve the problem:
I still hope there is a way in the future to correctly and perfectly clear memory |
Glad to see you here @cassanof and to hear that you have been using vLLM in this kind of workflow! Given how much wanted this feature seems to be, I will bring this back to the team to discuss! If multi-gpu instance is prone to deadlocks, then perhaps we can at least start with single-gpu instances. Everyone on the maintainer team does have limited bandwidth and we have a lot of things to work on, so contributions are very welcomed as always! |
I tried inferring multiple models consecutively with vLLM v0.5.2. As mentioned above, the behavior differs depending on the value of TP.
I use this function in a pipeline where I explain images with VLM and then summarize them with LLM. I hope that this kind of processing will be officially provided and become common. from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import torch
import gc
import os
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def main():
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m", tensor_parallel_size=1)
outputs = llm.generate(prompts, sampling_params)
print(outputs)
destroy_model_parallel()
destroy_distributed_environment()
del llm.llm_engine.model_executor
del llm
gc.collect()
torch.cuda.empty_cache()
llm = LLM(model="facebook/opt-125m", tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)
print(outputs)
destroy_model_parallel()
destroy_distributed_environment()
del llm.llm_engine.model_executor
del llm
gc.collect()
torch.cuda.empty_cache()
if __name__ == "__main__":
main() |
It works for me. |
With TP=1, I am able to unload the model without difficulty with the method described above, but the re-loading fails with a esoteric error, like: File "/home/bhavnick/fd/workspace/vllm-api/modules/llm/generator.py", line 139, in load
engine = AsyncLLMEngine.from_engine_args(args)
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
engine = cls(
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
return engine_class(*args, **kwargs)
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
self.model_executor = executor_class(
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
self._init_executor()
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
self.driver_worker.init_device()
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 132, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/worker/worker.py", line 346, in init_worker_distributed_environment
ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 920, in ensure_model_parallel_initialized
backend = backend or torch.distributed.get_backend(
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1074, in get_backend
return Backend(not_none(pg_store)[0])
File "/home/bhavnick/fd/workspace/pyenv/vllm-api/lib/python3.10/site-packages/torch/utils/_typing_utils.py", line 12, in not_none
raise TypeError("Invariant encountered: value was None when it should not be")
TypeError: Invariant encountered: value was None when it should not be Some differences I can think of is I am using the |
UPDATE: Works with |
@bhavnicksm I can reproduce the same error you have. |
When debugging, I found that when using 'spawn' the main GPU used (if using PCI_BUS) would still keep some small amount of memory allocated, indicating that the clean up is unsuccessful. When I then check all available GPUs that have zero memory allocated and export
|
Another update: The problem is the global
|
When using TP>1 for the first model, it seems there's no working method that can successfully release the GPU memory. I've tried all scripts in this thread and none worked. |
@hammer-wang what version of vLLM are you using, which distributed backend (ray or multiprocessing) and how are you running the server? |
Got it working on vllm 0.5.x
and
using this method after llm.generate works for me |
This setup works perfectly on my end when using multiple GPUs! FYI, I am using import gc
import contextlib
import ray
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
destroy_model_parallel,
destroy_distributed_environment,
)
llm = LLM(
model=save_dir,
tokenizer=model_name,
dtype='bfloat16',
# Acknowledgement: Benjamin Kitor
# https://github.com/vllm-project/vllm/issues/2794
distributed_executor_backend="mp",
tensor_parallel_size=num_gpus_vllm,
gpu_memory_utilization=gpu_utilization_vllm,
# Note: We add this only to save the GPU Memories!
max_model_len=max_model_len_vllm,
disable_custom_all_reduce=True,
enable_lora=False,
)
# Delete the llm object and free the memory
destroy_model_parallel()
destroy_distributed_environment()
del llm.llm_engine.model_executor
del llm
with contextlib.suppress(AssertionError):
torch.distributed.destroy_process_group()
gc.collect()
torch.cuda.empty_cache()
ray.shutdown()
print("Successfully delete the llm pipeline and free the GPU memory.") |
After below code, is there an api(maybe like
llm.terminate
) to kill llm and release the GPU memory?The text was updated successfully, but these errors were encountered: