[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338

alexm-redhat · 2024-07-11T14:13:09Z

This PR moves the implementation of prepare_inputs(..) for advance_step(..) (inside draft_model_runner.py) to GPU. It adds a new GPU kernel to modify prepare_inputs tensors on GPU directly, and also introduces some improvements to the sampler inside draft_model_runner to reduce unnecessary GPU<=>CPU overheads. Here is a quick performance check on A100 GPU. For 50% percentile latency, the current main gets 500ms, with sampler improvements of this PR, it gets to 477ms, and this sampler improvements + lowering of prepare_inputs to GPU it gets to 329ms, so 65% improvement vs main.

TODOs

Add support for more attn backends
Cleanup debug and code

draft_model_runner: new - prepare_inputs on GPU + sampler improvements

python benchmark_latency.py --model JackFram/llama-160m --input-len 256 --output-len 128 --batch-size 1 --use-v2-block-manager --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --num-iters-warmup 5 --num-iters 10
  
10% percentile latency: 0.3126291814725846 seconds
25% percentile latency: 0.32072093698661774 seconds
50% percentile latency: 0.32989367097616196 seconds
75% percentile latency: 0.38147771276999265 seconds
90% percentile latency: 0.40094744893722234 seconds
99% percentile latency: 0.44348421014379713 seconds

draft_model_runner: new - prepare_inputs on CPU + sampler improvements

python benchmark_latency.py --model JackFram/llama-160m --input-len 256 --output-len 128 --batch-size 1 --use-v2-block-manager --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --num-iters-warmup 5 --num-iters 10
10% percentile latency: 0.46580099742859604 seconds
25% percentile latency: 0.4689853365998715 seconds
50% percentile latency: 0.47728427220135927 seconds
75% percentile latency: 0.49721202708315104 seconds
90% percentile latency: 0.5107997851446271 seconds
99% percentile latency: 0.5294410690665246 seconds

draft_model_runner: main - prepare_inputs on CPU

python benchmark_latency.py --model JackFram/llama-160m --input-len 256 --output-len 128 --batch-size 1 --use-v2-block-manager --speculative-model JackFram/llama-68m --num-speculative-tokens 5 --num-iters-warmup 5 --num-iters 10

10% percentile latency: 0.4636695141904056 seconds
25% percentile latency: 0.4841108355903998 seconds
50% percentile latency: 0.5005039046518505 seconds
75% percentile latency: 0.5458661855664104 seconds
90% percentile latency: 0.5690251242369413 seconds
99% percentile latency: 0.5881028406135738 seconds

csrc/prepare_inputs/advance_step.cu

vllm/model_executor/layers/sampler.py

vllm/model_executor/sampling_metadata.py

vllm/spec_decode/draft_model_runner.py

alexm-redhat · 2024-07-12T14:41:21Z

@comaniac addressed review comments. Looks much better now.

comaniac

It's much better now!
Left some minor comments. The major question I have is whether it's beneficial to also include attention metadata updating in the CUDA kernel.

vllm/spec_decode/multi_step_worker.py

vllm/spec_decode/draft_model_runner.py

alexm-redhat · 2024-07-13T17:40:47Z

@comaniac addressed the second set of review comments, ready for final pass. Now working on getting all tests green.

comaniac

LGTM. Thanks! Last batch of miner comments. If possible, it'd be great to have some e2e benchmarks (I could help with that if needed. Just let me know)

I'll leave to @cadedaniel to take a pass.

vllm/_custom_ops.py

vllm/model_executor/layers/sampler.py

vllm/model_executor/sampling_metadata.py

vllm/spec_decode/draft_model_runner.py

vllm/worker/model_runner.py

cadedaniel · 2024-07-15T06:15:15Z

I will take a look at this tomorrow. One thing that needs validation is that the draft acceptance rate is still 100% when using the same draft and target model. I ran this PR locally and got <100%, which indicates some correctness issue.

comaniac · 2024-07-15T06:44:27Z

That's a good way to evaluate. We should add this test to the CI in this PR.

alexm-redhat · 2024-07-15T16:52:44Z

@cadedaniel @comaniac addressed the acceptance rate issue, currently 100% for both GPU case and non GPU case of prepare_inputs

alexm-redhat · 2024-07-15T16:53:20Z

Addressing other review comments now

alexm-redhat · 2024-07-15T17:55:47Z

Addressed review comments

cadedaniel

Looks good, all comments are nits/docs. Missing thing is a testing story: can we guarantee that this codepath runs in CI, and fail a test if it doesn't? e.g. in the test for this codepath we assert that the GPU backend is being used at least once. This is important as otherwise we could lose coverage of this codepath silently if there is a dependency change in CI.

Would be great to have an explicit case for when the GPU path is disabled as well, so both are tested regardless of CI conditions.

If these are too hard, let's add a big docstring about it and revisit.

csrc/prepare_inputs/advance_step.cu

vllm/_custom_ops.py

vllm/model_executor/layers/sampler.py

cadedaniel · 2024-07-16T19:47:29Z

vllm/spec_decode/draft_model_runner.py

+        # Update query lengths. Note that we update only queries and not seqs,
+        # since tensors may be padded due to captured cuda graph batch size


FYI @LiuXiaoxuanPKU , want to make sure you're aware for CUDA graph <> MQA integration

vllm/spec_decode/draft_model_runner.py

alexm-redhat · 2024-07-17T17:35:10Z

Added a test and fixed the fallback

tests/spec_decode/e2e/conftest.py

vllm/spec_decode/spec_decode_worker.py

comaniac · 2024-07-17T17:59:33Z

btw can you rebase or merge the latest main to make sure everything is up to date?

alexm-redhat · 2024-07-17T18:51:47Z

rebased

alexm-redhat · 2024-07-17T18:51:55Z

Added Cody's mock test

cadedaniel

Great contribution @alexm-neuralmagic! thanks also to @comaniac for all the help

vllm/spec_decode/draft_model_runner.py

comaniac · 2024-07-17T21:30:26Z

I'll merge this first and make a small patch for the variable naming.
Thanks @alexm-neuralmagic @cadedaniel !

…e_step (vllm-project#6338)

…e_step (vllm-project#6338) Signed-off-by: Alvant <[email protected]>

alexm-redhat force-pushed the prepare_inputs_on_gpu branch from 9711114 to bfa1647 Compare July 11, 2024 15:20

comaniac reviewed Jul 11, 2024

View reviewed changes

comaniac mentioned this pull request Jul 12, 2024

[Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561

Closed

comaniac reviewed Jul 12, 2024

View reviewed changes

comaniac approved these changes Jul 14, 2024

View reviewed changes

comaniac mentioned this pull request Jul 15, 2024

[Kernel] Unify the kernel used in flash attention backend #6052

Open

cadedaniel reviewed Jul 16, 2024

View reviewed changes

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 17, 2024

comaniac reviewed Jul 17, 2024

View reviewed changes

tests/spec_decode/e2e/conftest.py Outdated Show resolved Hide resolved

vllm/spec_decode/spec_decode_worker.py Outdated Show resolved Hide resolved

alexm-redhat added 12 commits July 17, 2024 18:46

tmp commit

cea2a11

sync

9f0b0f8

sync

1e8ecf8

start

f7c67ee

tmp

02617c4

sync

f2af7c3

works

1d831a2

correct

aa9747e

add cuda graphs

baeae6a

sync

de941c7

skip logprobs

2ea329d

sync

2c9eefc

alexm-redhat added 18 commits July 17, 2024 18:47

fix rebase bugs

7d426a9

Cody's refactor proposal

50a1edf

Cody's refactor proposal

e1ef1f4

sync

e8e10f2

format

2231012

fix bug

4e97a8a

restore test

a90a085

sync

89fd609

Cody review fixes

f0bd7ef

Fix acceptance rate to 100% for the fallback case

3ab5e9f

format

c6eacc8

Cody's review comments

5626bfc

restore multistep test

2aabb31

Cade's comments

5cf3f59

Cade and Cody comments

917bea6

fix bug

a4968d3

Cody's review

a0d2384

add Cody's mock test

e5f4265

alexm-redhat force-pushed the prepare_inputs_on_gpu branch from bb9c4d8 to e5f4265 Compare July 17, 2024 18:49

cadedaniel approved these changes Jul 17, 2024

View reviewed changes

vllm/spec_decode/draft_model_runner.py Show resolved Hide resolved

comaniac merged commit e76466d into vllm-project:main Jul 17, 2024
72 checks passed

comaniac mentioned this pull request Jul 17, 2024

[Misc] Minor patch for draft model runner #6523

Merged

mawong-amd mentioned this pull request Jul 18, 2024

[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes #6543

Merged

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 19, 2024

[Core] draft_model_runner: Implement prepare_inputs on GPU for advanc…

d9124a4

…e_step (vllm-project#6338)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Core] draft_model_runner: Implement prepare_inputs on GPU for advanc…

81614e7

…e_step (vllm-project#6338)

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024

[Core] draft_model_runner: Implement prepare_inputs on GPU for advanc…

d1d85a4

…e_step (vllm-project#6338)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Core] draft_model_runner: Implement prepare_inputs on GPU for advanc…

92f1f6b

…e_step (vllm-project#6338) Signed-off-by: Alvant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338

[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338

alexm-redhat commented Jul 11, 2024 •

edited

Loading

alexm-redhat commented Jul 12, 2024 •

edited

Loading

comaniac left a comment

alexm-redhat commented Jul 13, 2024

comaniac left a comment

cadedaniel commented Jul 15, 2024

comaniac commented Jul 15, 2024

alexm-redhat commented Jul 15, 2024

alexm-redhat commented Jul 15, 2024

alexm-redhat commented Jul 15, 2024

cadedaniel left a comment

cadedaniel Jul 16, 2024

alexm-redhat commented Jul 17, 2024

comaniac commented Jul 17, 2024

alexm-redhat commented Jul 17, 2024

alexm-redhat commented Jul 17, 2024

cadedaniel left a comment

comaniac commented Jul 17, 2024

		# Update query lengths. Note that we update only queries and not seqs,
		# since tensors may be padded due to captured cuda graph batch size

[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338

[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step #6338

Conversation

alexm-redhat commented Jul 11, 2024 • edited Loading

alexm-redhat commented Jul 12, 2024 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

alexm-redhat commented Jul 13, 2024

comaniac left a comment

Choose a reason for hiding this comment

cadedaniel commented Jul 15, 2024

comaniac commented Jul 15, 2024

alexm-redhat commented Jul 15, 2024

alexm-redhat commented Jul 15, 2024

alexm-redhat commented Jul 15, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

cadedaniel Jul 16, 2024

Choose a reason for hiding this comment

alexm-redhat commented Jul 17, 2024

comaniac commented Jul 17, 2024

alexm-redhat commented Jul 17, 2024

alexm-redhat commented Jul 17, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

comaniac commented Jul 17, 2024

alexm-redhat commented Jul 11, 2024 •

edited

Loading

alexm-redhat commented Jul 12, 2024 •

edited

Loading