[V1] Integrate Piecewise CUDA graphs #10058

WoosukKwon · 2024-11-05T23:18:29Z

This PR integrates the piecewise CUDA graphs into the V1 model runner.

Set VLLM_TORCH_COMPILE_LEVEL=3 to enable the feature.
The compilation + capturing time takes at most ~17 secs.
Currently, piecewise CUDA graphs are not compatible with custom ops. Therefore, we rely on Torch Inductor to optimize the model.
Consequently, FP8 or other quantizations are not supported with piecewise CUDA graphs.

Benchmarks (`benchmark_latency.py`)

Model	Version	CUDA Graphs	Time (s)
OPT-125M (fp16)	V1	w/o Piecewise CUDA Graphs	0.41
	V1	w/ Piecewise CUDA Graphs	0.21
	V0	w/o CUDA Graphs	0.47
	V0	w/ CUDA Graphs	0.21
Llama 3.1 8B (bf16)	V1	w/o Piecewise CUDA Graphs	1.26
	V1	w/ Piecewise CUDA Graphs	1.04
	V0	w/o CUDA Graphs	1.10
	V0	w/ CUDA Graphs	0.99

github-actions · 2024-11-05T23:18:40Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-05T23:19:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. @WoosukKwon please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-06T02:20:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. @WoosukKwon please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/attention/backends/flash_attn.py

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Woosuk Kwon <[email protected]>

youkaichao

glad to see the integration working!

This reverts commit 4089985.

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

mergify bot added needs-rebase and removed needs-rebase labels Nov 5, 2024

mergify bot added needs-rebase and removed needs-rebase labels Nov 6, 2024

WoosukKwon changed the title ~~[WIP][V1] Integrate Piecewise CUDA graphs~~ [V1] Integrate Piecewise CUDA graphs Nov 6, 2024

WoosukKwon marked this pull request as ready for review November 6, 2024 02:26

WoosukKwon requested a review from youkaichao November 6, 2024 03:43

WoosukKwon force-pushed the v1-piecewise branch from e1414df to b593aa6 Compare November 6, 2024 03:51

youkaichao reviewed Nov 6, 2024

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

youkaichao reviewed Nov 6, 2024

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

youkaichao reviewed Nov 6, 2024

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

youkaichao reviewed Nov 6, 2024

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

WoosukKwon requested a review from youkaichao November 6, 2024 05:46

WoosukKwon added 2 commits November 5, 2024 21:48

Reset

e662e6d

Signed-off-by: Woosuk Kwon <[email protected]>

Address review

7801bde

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon force-pushed the v1-piecewise branch from c7d0b7a to 7801bde Compare November 6, 2024 05:48

youkaichao approved these changes Nov 6, 2024

View reviewed changes

WoosukKwon merged commit 4089985 into main Nov 6, 2024
14 of 22 checks passed

WoosukKwon deleted the v1-piecewise branch November 6, 2024 06:16

flaviabeo added a commit to flaviabeo/vllm that referenced this pull request Nov 6, 2024

Revert "[V1] Integrate Piecewise CUDA graphs (vllm-project#10058)"

55c206b

This reverts commit 4089985.

JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024

[V1] Integrate Piecewise CUDA graphs (vllm-project#10058)

537c9a7

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[V1] Integrate Piecewise CUDA graphs (vllm-project#10058)

f619c6a

Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Integrate Piecewise CUDA graphs #10058

[V1] Integrate Piecewise CUDA graphs #10058

WoosukKwon commented Nov 5, 2024 •

edited

Loading

github-actions bot commented Nov 5, 2024

mergify bot commented Nov 5, 2024

mergify bot commented Nov 6, 2024

youkaichao left a comment

[V1] Integrate Piecewise CUDA graphs #10058

[V1] Integrate Piecewise CUDA graphs #10058

Conversation

WoosukKwon commented Nov 5, 2024 • edited Loading

Benchmarks (benchmark_latency.py)

github-actions bot commented Nov 5, 2024

mergify bot commented Nov 5, 2024

mergify bot commented Nov 6, 2024

youkaichao left a comment

Choose a reason for hiding this comment

WoosukKwon commented Nov 5, 2024 •

edited

Loading

Benchmarks (`benchmark_latency.py`)