[Performance]: The impact of CPU on vLLM performance is significant. #8147

skylee-01 · 2024-09-04T06:33:15Z

Proposal to improve performance

We used the same GPU on two machines but different CPUs. The following experimental conclusions were drawn:
Experimental results: The GPU is 3090, and the CPU was upgraded from Xeon Gold 6240 to i9-12900k. The impact is as follows.
a. vLLM achieved a 3.8x speedup in the agent scenario.
b. TGi achieved a 1.23x speedup in the agent scenario.
c. vLLM still has latency issues, but the time has been reduced to 100ms (previously 300ms).
e. GPU utilization has increased from 70% to 90%.

From the stress test data, it is evident that vLLM heavily relies on the performance of the CPU.
What are the main factors affecting CPU performance, and how can they be optimized?

skylee-01 · 2024-09-04T06:35:40Z

Related experiments： #7540

skylee-01 · 2024-09-04T06:38:03Z

@WoosukKwon @youkaichao Please provide some assistance.

youkaichao · 2024-09-04T06:39:04Z

what is the vllm version you use?

skylee-01 · 2024-09-04T06:40:23Z

what is the vllm version you use?

0.5.5

youkaichao · 2024-09-04T06:41:47Z

we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.

skylee-01 · 2024-09-04T06:47:26Z

we are optimizing the cpu time, please stay tuned. it should not be so dependent on CPU performance in the future.

What is the reason for VLLM's current heavy dependence on CPU, and what are the directions for optimization?
Our team is also trying to participate in the work of VLLM, hoping to contribute to the VLLM community. We hope to be able to submit code for VLLM.

youkaichao · 2024-09-04T06:54:55Z

cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).

for some examples on this line of optimization, see #7000 and #8092

contributions are definitely welcome!

skylee-01 · 2024-09-04T07:12:25Z

cpu needs to serve http requests, and also prepare lots of input data for the GPU, which changes for every step (because of continuous batching and auto-regressive LLM decoding).

for some examples on this line of optimization, see #7000 and #8092

contributions are definitely welcome!

Our team has developed some spec decoding features based on VLLM, which have been used internally and have yielded good performance benefits. How can we join the VLLM project, and where would be a good place to start?

youkaichao · 2024-09-04T07:27:03Z

welcome to send emails to [email protected]

robertgshaw2-neuralmagic · 2024-09-04T12:16:27Z

Really interesting. Thanks for reporting. The GPUs are getting fast :)

WoosukKwon · 2024-09-05T06:06:51Z

Hi @skylee-01 Thanks for reporting this! We also recently discovered the same problem. We plan to do more optimizations to mitigate the CPU effect.

vLLM is a fully open community-driven project, so we'd appreciate any contributions, including submitting or reviewing PRs, answering questions, and helping documentation.

skylee-01 added the performance Performance-related issues label Sep 4, 2024

joerunde mentioned this issue Sep 6, 2024

[CI/Build] Use python 3.12 in cuda image #8133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: The impact of CPU on vLLM performance is significant. #8147

[Performance]: The impact of CPU on vLLM performance is significant. #8147

skylee-01 commented Sep 4, 2024 •

edited

Loading

skylee-01 commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

robertgshaw2-neuralmagic commented Sep 4, 2024

WoosukKwon commented Sep 5, 2024 •

edited

Loading

[Performance]: The impact of CPU on vLLM performance is significant. #8147

[Performance]: The impact of CPU on vLLM performance is significant. #8147

Comments

skylee-01 commented Sep 4, 2024 • edited Loading

Proposal to improve performance

skylee-01 commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

skylee-01 commented Sep 4, 2024

youkaichao commented Sep 4, 2024

robertgshaw2-neuralmagic commented Sep 4, 2024

WoosukKwon commented Sep 5, 2024 • edited Loading

skylee-01 commented Sep 4, 2024 •

edited

Loading

WoosukKwon commented Sep 5, 2024 •

edited

Loading