[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval #10325

wchen61 · 2024-11-14T12:41:20Z

Your current environment

vllm 0.6.0
lm_eval 0.4.5
torch 2.4
A100 + CUDA 12.3

Model Input Dumps

No response

🐛 Describe the bug

Description:
When using lm_eval for MMLU accuracy evaluation tasks, I frequently encounter OOM errors. This issue seems to be model-specific, and many models are prone to this problem. For example, even when running the Meta-Llama-8B-Instruct model on an A100 GPU (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), OOM errors still occur.

I have verified that offline inference with this model works fine, meaning both my software and hardware are capable of running the model. However, when using lm_eval, the system encounters OOM issues. Specifically, when the batch size is set to auto, the system behaves similarly to the benchmark_throughput scenario: all requests are placed in the pool, and vllm continuously fetches requests for inference, followed by result analysis.

Upon further investigation, I discovered that the discrepancy between the maximum memory usage reported by the profile_run function and the actual maximum memory usage is due to the different sampling parameters used in lm_eval. Specifically, lm_eval uses the prompt_logprobs=1 setting, which causes a significant increase in memory consumption. For example, with max-num-seqs=256 and max-num-batched-tokens=8096, the default configuration reports a peak memory usage of 10GB, but with prompt_logprobs=1, the peak memory reaches 50GB. Our system reserves memory based on the profile_run peak, which leads to OOM errors during actual execution.

The command line I used to run lm_eval is as follows.

lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9 --tasks mmlu --batch_size auto

When I manually set prompt_logprobs=1 in the vllm sampling parameters, lm_eval runs successfully.

Suggested Improvement:

It would be helpful to introduce a mechanism that allows third-party users to specify their use case, so vllm can more accurately estimate the required peak memory usage.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

wchen61 added the bug Something isn't working label Nov 14, 2024

wchen61 mentioned this issue Nov 14, 2024

OOM Issues in MMLU Evaluation with lm_eval Using vllm as Backend EleutherAI/lm-evaluation-harness#2490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval #10325

[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval #10325

wchen61 commented Nov 14, 2024 •

edited

Loading

[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval #10325

[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval #10325

Comments

wchen61 commented Nov 14, 2024 • edited Loading

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

wchen61 commented Nov 14, 2024 •

edited

Loading