Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval #10325

Open
1 task done
wchen61 opened this issue Nov 14, 2024 · 0 comments
Open
1 task done

[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval #10325

wchen61 opened this issue Nov 14, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@wchen61
Copy link
Contributor

wchen61 commented Nov 14, 2024

Your current environment

vllm 0.6.0
lm_eval 0.4.5
torch 2.4
A100 + CUDA 12.3

Model Input Dumps

No response

🐛 Describe the bug

Description:
When using lm_eval for MMLU accuracy evaluation tasks, I frequently encounter OOM errors. This issue seems to be model-specific, and many models are prone to this problem. For example, even when running the Meta-Llama-8B-Instruct model on an A100 GPU (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), OOM errors still occur.

I have verified that offline inference with this model works fine, meaning both my software and hardware are capable of running the model. However, when using lm_eval, the system encounters OOM issues. Specifically, when the batch size is set to auto, the system behaves similarly to the benchmark_throughput scenario: all requests are placed in the pool, and vllm continuously fetches requests for inference, followed by result analysis.

Upon further investigation, I discovered that the discrepancy between the maximum memory usage reported by the profile_run function and the actual maximum memory usage is due to the different sampling parameters used in lm_eval. Specifically, lm_eval uses the prompt_logprobs=1 setting, which causes a significant increase in memory consumption. For example, with max-num-seqs=256 and max-num-batched-tokens=8096, the default configuration reports a peak memory usage of 10GB, but with prompt_logprobs=1, the peak memory reaches 50GB. Our system reserves memory based on the profile_run peak, which leads to OOM errors during actual execution.

The command line I used to run lm_eval is as follows.

lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --tasks mmlu --batch_size auto

When I manually set prompt_logprobs=1 in the vllm sampling parameters, lm_eval runs successfully.

Suggested Improvement:

It would be helpful to introduce a mechanism that allows third-party users to specify their use case, so vllm can more accurately estimate the required peak memory usage.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant