You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vllm 0.6.0
lm_eval 0.4.5
torch 2.4
A100 + CUDA 12.3
Model Input Dumps
No response
🐛 Describe the bug
Description:
When using lm_eval for MMLU accuracy evaluation tasks, I frequently encounter OOM errors. This issue seems to be model-specific, and many models are prone to this problem. For example, even when running the Meta-Llama-8B-Instruct model on an A100 GPU (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), OOM errors still occur.
I have verified that offline inference with this model works fine, meaning both my software and hardware are capable of running the model. However, when using lm_eval, the system encounters OOM issues. Specifically, when the batch size is set to auto, the system behaves similarly to the benchmark_throughput scenario: all requests are placed in the pool, and vllm continuously fetches requests for inference, followed by result analysis.
Upon further investigation, I discovered that the discrepancy between the maximum memory usage reported by the profile_run function and the actual maximum memory usage is due to the different sampling parameters used in lm_eval. Specifically, lm_eval uses the prompt_logprobs=1 setting, which causes a significant increase in memory consumption. For example, with max-num-seqs=256 and max-num-batched-tokens=8096, the default configuration reports a peak memory usage of 10GB, but with prompt_logprobs=1, the peak memory reaches 50GB. Our system reserves memory based on the profile_run peak, which leads to OOM errors during actual execution.
The command line I used to run lm_eval is as follows.
lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --tasks mmlu --batch_size auto
When I manually set prompt_logprobs=1 in the vllm sampling parameters, lm_eval runs successfully.
Suggested Improvement:
It would be helpful to introduce a mechanism that allows third-party users to specify their use case, so vllm can more accurately estimate the required peak memory usage.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
vllm 0.6.0
lm_eval 0.4.5
torch 2.4
A100 + CUDA 12.3
Model Input Dumps
No response
🐛 Describe the bug
Description:
When using lm_eval for MMLU accuracy evaluation tasks, I frequently encounter OOM errors. This issue seems to be model-specific, and many models are prone to this problem. For example, even when running the Meta-Llama-8B-Instruct model on an A100 GPU (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), OOM errors still occur.
I have verified that offline inference with this model works fine, meaning both my software and hardware are capable of running the model. However, when using lm_eval, the system encounters OOM issues. Specifically, when the batch size is set to auto, the system behaves similarly to the benchmark_throughput scenario: all requests are placed in the pool, and vllm continuously fetches requests for inference, followed by result analysis.
Upon further investigation, I discovered that the discrepancy between the maximum memory usage reported by the profile_run function and the actual maximum memory usage is due to the different sampling parameters used in lm_eval. Specifically, lm_eval uses the prompt_logprobs=1 setting, which causes a significant increase in memory consumption. For example, with
max-num-seqs=256
andmax-num-batched-tokens=8096
, the default configuration reports a peak memory usage of 10GB, but withprompt_logprobs=1
, the peak memory reaches 50GB. Our system reserves memory based on theprofile_run
peak, which leads to OOM errors during actual execution.The command line I used to run lm_eval is as follows.
When I manually set
prompt_logprobs=1
in the vllm sampling parameters, lm_eval runs successfully.Suggested Improvement:
It would be helpful to introduce a mechanism that allows third-party users to specify their use case, so vllm can more accurately estimate the required peak memory usage.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: