Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338

Merged
merged 32 commits into from
Jul 17, 2024

Conversation

alexm-redhat
Copy link
Collaborator

@alexm-redhat alexm-redhat commented Jul 11, 2024

This PR moves the implementation of prepare_inputs(..) for advance_step(..) (inside draft_model_runner.py) to GPU. It adds a new GPU kernel to modify prepare_inputs tensors on GPU directly, and also introduces some improvements to the sampler inside draft_model_runner to reduce unnecessary GPU<=>CPU overheads. Here is a quick performance check on A100 GPU. For 50% percentile latency, the current main gets 500ms, with sampler improvements of this PR, it gets to 477ms, and this sampler improvements + lowering of prepare_inputs to GPU it gets to 329ms, so 65% improvement vs main.

TODOs

  1. Add support for more attn backends
  2. Cleanup debug and code

draft_model_runner: new - prepare_inputs on GPU + sampler improvements

python benchmark_latency.py --model JackFram/llama-160m --input-len 256 --output-len 128 --batch-size 1 --use-v2-block-manager --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --num-iters-warmup 5 --num-iters 10
  
10% percentile latency: 0.3126291814725846 seconds
25% percentile latency: 0.32072093698661774 seconds
50% percentile latency: 0.32989367097616196 seconds
75% percentile latency: 0.38147771276999265 seconds
90% percentile latency: 0.40094744893722234 seconds
99% percentile latency: 0.44348421014379713 seconds

draft_model_runner: new - prepare_inputs on CPU + sampler improvements

python benchmark_latency.py --model JackFram/llama-160m --input-len 256 --output-len 128 --batch-size 1 --use-v2-block-manager --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --num-iters-warmup 5 --num-iters 10
10% percentile latency: 0.46580099742859604 seconds
25% percentile latency: 0.4689853365998715 seconds
50% percentile latency: 0.47728427220135927 seconds
75% percentile latency: 0.49721202708315104 seconds
90% percentile latency: 0.5107997851446271 seconds
99% percentile latency: 0.5294410690665246 seconds

draft_model_runner: main - prepare_inputs on CPU

python benchmark_latency.py --model JackFram/llama-160m --input-len 256 --output-len 128 --batch-size 1 --use-v2-block-manager --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --num-iters-warmup 5 --num-iters 10

10% percentile latency: 0.4636695141904056 seconds
25% percentile latency: 0.4841108355903998 seconds
50% percentile latency: 0.5005039046518505 seconds
75% percentile latency: 0.5458661855664104 seconds
90% percentile latency: 0.5690251242369413 seconds
99% percentile latency: 0.5881028406135738 seconds

@alexm-redhat alexm-redhat force-pushed the prepare_inputs_on_gpu branch from 9711114 to bfa1647 Compare July 11, 2024 15:20
csrc/prepare_inputs/advance_step.cu Show resolved Hide resolved
vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved
vllm/model_executor/sampling_metadata.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
@alexm-redhat
Copy link
Collaborator Author

alexm-redhat commented Jul 12, 2024

@comaniac addressed review comments. Looks much better now.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much better now!
Left some minor comments. The major question I have is whether it's beneficial to also include attention metadata updating in the CUDA kernel.

vllm/spec_decode/multi_step_worker.py Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Show resolved Hide resolved
@alexm-redhat
Copy link
Collaborator Author

@comaniac addressed the second set of review comments, ready for final pass. Now working on getting all tests green.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks! Last batch of miner comments. If possible, it'd be great to have some e2e benchmarks (I could help with that if needed. Just let me know)

I'll leave to @cadedaniel to take a pass.

vllm/_custom_ops.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved
vllm/model_executor/sampling_metadata.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/worker/model_runner.py Outdated Show resolved Hide resolved
@cadedaniel
Copy link
Collaborator

I will take a look at this tomorrow. One thing that needs validation is that the draft acceptance rate is still 100% when using the same draft and target model. I ran this PR locally and got <100%, which indicates some correctness issue.

@comaniac
Copy link
Collaborator

That's a good way to evaluate. We should add this test to the CI in this PR.

@alexm-redhat
Copy link
Collaborator Author

@cadedaniel @comaniac addressed the acceptance rate issue, currently 100% for both GPU case and non GPU case of prepare_inputs

@alexm-redhat
Copy link
Collaborator Author

Addressing other review comments now

@alexm-redhat
Copy link
Collaborator Author

Addressed review comments

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, all comments are nits/docs. Missing thing is a testing story: can we guarantee that this codepath runs in CI, and fail a test if it doesn't? e.g. in the test for this codepath we assert that the GPU backend is being used at least once. This is important as otherwise we could lose coverage of this codepath silently if there is a dependency change in CI.

Would be great to have an explicit case for when the GPU path is disabled as well, so both are tested regardless of CI conditions.

If these are too hard, let's add a big docstring about it and revisit.

csrc/prepare_inputs/advance_step.cu Outdated Show resolved Hide resolved
csrc/prepare_inputs/advance_step.cu Show resolved Hide resolved
vllm/_custom_ops.py Outdated Show resolved Hide resolved
vllm/model_executor/layers/sampler.py Show resolved Hide resolved
vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved
Comment on lines +96 to +97
# Update query lengths. Note that we update only queries and not seqs,
# since tensors may be padded due to captured cuda graph batch size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @LiuXiaoxuanPKU , want to make sure you're aware for CUDA graph <> MQA integration

vllm/spec_decode/draft_model_runner.py Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
vllm/spec_decode/draft_model_runner.py Outdated Show resolved Hide resolved
@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 17, 2024
@alexm-redhat
Copy link
Collaborator Author

Added a test and fixed the fallback

tests/spec_decode/e2e/conftest.py Outdated Show resolved Hide resolved
vllm/spec_decode/spec_decode_worker.py Outdated Show resolved Hide resolved
@comaniac
Copy link
Collaborator

btw can you rebase or merge the latest main to make sure everything is up to date?

@alexm-redhat alexm-redhat force-pushed the prepare_inputs_on_gpu branch from bb9c4d8 to e5f4265 Compare July 17, 2024 18:49
@alexm-redhat
Copy link
Collaborator Author

rebased

@alexm-redhat
Copy link
Collaborator Author

Added Cody's mock test

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution @alexm-neuralmagic! thanks also to @comaniac for all the help

vllm/spec_decode/draft_model_runner.py Show resolved Hide resolved
@comaniac
Copy link
Collaborator

I'll merge this first and make a small patch for the variable naming.
Thanks @alexm-neuralmagic @cadedaniel !

@comaniac comaniac merged commit e76466d into vllm-project:main Jul 17, 2024
72 checks passed
fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 19, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants