[Speculative decoding] Add ngram prompt lookup decoding #4237

leiwen83 · 2024-04-21T09:38:15Z

Algo details could refer to this blog post:
https://huggingface.co/blog/assisted-generation

Code directly refer to transformers's current implementation. huggingface/transformers#27775

Since we directly get draft from prompt, there is no need another model or modified model to get the proposal, it would be the most convenient way to enjoy the speedup of speculation.

leiwen83 · 2024-04-21T09:42:07Z

Implementaion for the feature of ngram prompt lookup mentioned in #2188

leiwen83 · 2024-04-21T09:44:37Z

@cadedaniel
Could you help to review this PR, especially whether we shall have some more abstraction taking some other speculatiation type like medusa into consideration?

cadedaniel · 2024-04-22T02:25:57Z

thanks for the PR @leiwen83 , we'll take a look!

@comaniac do you have bandwidth to shepherd this PR?

vllm/spec_decode/multi_step_worker.py

vllm/spec_decode/spec_decode_worker.py

tests/spec_decode/e2e/test_correctness.py

comaniac · 2024-04-22T02:42:02Z

Thanks for the PR! I'll review it

vllm/spec_decode/multi_step_worker.py

comaniac

Another batch of comments. Also we would need unit tests for this PR.

vllm/spec_decode/ngram_worker.py

tests/spec_decode/e2e/test_correctness.py

vllm/spec_decode/ngram_worker.py

leiwen83 · 2024-04-23T09:42:42Z

Another batch of comments. Also we would need unit tests for this PR.

I current have one e2e unittest in tests/spec_decode/e2e/test_correctness.py. Do you mean may need some other unit tests?

leiwen83 · 2024-04-23T11:37:31Z

@cadedaniel @comaniac ,
Since #3951 is merged, and there is some changes have been made for spec infer sub system, shall we squash this commit and rebase over latest code for easy further code review?
And if squash/rebase is needed, whether I shall force update in this thread or open another PR?

comaniac · 2024-04-23T15:44:50Z

I current have one e2e unittest in tests/spec_decode/e2e/test_correctness.py. Do you mean may need some other unit tests?

Sorry let me make it more clear. The unit tests should cover as many cases as possible. For example, batch size > 1 with some seqs find a match but others don't; n-gram size from 1 to 3; long/short speculative sizes, etc.

cadedaniel · 2024-04-23T22:51:18Z

@cadedaniel @comaniac ,
Since #3951 is merged, and there is some changes have been made for spec infer sub system, shall we squash this commit and rebase over latest code for easy further code review?
And if squash/rebase is needed, whether I shall force update in this thread or open another PR?

You're free to use what strategy you prefer to remove the merge conflicts. I often will merge main into my dev branch but there's tradeoffs. I'd keep this PR open, you can push/force push as necessary!

And sorry for the conflicts -- PR 7/9 was the last large one to the subsystem. I expect less merge conflicts going forward.

I current have one e2e unittest in tests/spec_decode/e2e/test_correctness.py. Do you mean may need some other unit tests?

Sorry let me make it more clear. The unit tests should cover as many cases as possible. For example, batch size > 1 with some seqs find a match but others don't; n-gram size from 1 to 3; long/short speculative sizes, etc.

+1. We'll want the following tests:

Batch size 1 greedy equality
Batch size >1 greedy equality
A test that covers when there's no ngram match for any sequence
A test that covers when there's ngram matches for some sequences in a batch, but not all
A test that covers when there's ngram matches for all seqs in the batch (can omit since bs>1 greedy equality will likely cover it)
Test various ngram sizes / speculative sizes
Test greedy equality under preemption.

Most of these can be copied from the existing tests and refitted for ngram speculation; I suggest making a new file for spec ngram test correctness (so the current one can be draft model).

leiwen83 · 2024-04-25T12:26:33Z

@cadedaniel @comaniac
I squash and rebase over latest main branch, while fix all issues so far and add the necessary test cases, would you mind to take a look at the refreshed PR?

Thx~

comaniac

Otherwise LGTM (the n-gram worker). I'll leave the rest to @cadedaniel

tests/spec_decode/e2e/test_compatibility.py

tests/spec_decode/e2e/test_ngram_correctness.py

vllm/spec_decode/ngram_worker.py

cadedaniel · 2024-04-26T13:28:23Z

will take another pass today

cadedaniel

Just minor comments, otherwise LGTM! Can you address/respond and then we'll merge.

cadedaniel · 2024-04-30T16:55:58Z

tests/spec_decode/e2e/test_ngram_correctness.py

@@ -0,0 +1,241 @@
+"""The tests in this file verify end-to-end speculative decoding correctness.


Remove this note

cadedaniel · 2024-04-30T16:56:42Z

tests/spec_decode/e2e/test_ngram_correctness.py

+    ])
+@pytest.mark.parametrize("batch_size", [1, 64])
+@pytest.mark.parametrize("seed", [1])
+def test_spec_decode_e2e_greedy_correctness_tiny_model(baseline_llm_generator,


rename test_ngram_e2e_greedy_correctness

cadedaniel · 2024-04-30T16:56:58Z

tests/spec_decode/e2e/test_ngram_correctness.py

+    ])
+@pytest.mark.parametrize("batch_size", [4])
+@pytest.mark.parametrize("seed", [1])
+def test_spec_decode_e2e_greedy_correctness_with_preemption(


test_ngram_e2e_greedy_correctness_with_preemption

cadedaniel · 2024-04-30T16:57:59Z

tests/spec_decode/e2e/test_ngram_correctness.py

+    """Verify greedy equality on a tiny model with batch size of one.
+
+    Since this test is cheaper than other e2e correctness tests, we generate
+    with a higher output_len.
+    """


Update comment -- this does more than bs=1

cadedaniel · 2024-04-30T16:59:34Z

tests/spec_decode/e2e/test_ngram_correctness.py

+        {
+            "speculative_model": "[ngram]",
+            "num_speculative_tokens": k,
+            "ngram_prompt_lookup_max": 3,
+        }
+        # Try a range of common k, as well as large speculation.
+        for k in [1, 3, 5, 7, 10, 63]
+    ] + [
+        {
+            "speculative_model": "[ngram]",
+            "num_speculative_tokens": k,
+            "ngram_prompt_lookup_max": 1,
+        }
+        # Try a range of common k, as well as large speculation.
+        for k in [1, 3, 5, 7, 10, 63]


To improve test time, we can reduce the space covered -- I suggest k=[1, 3, 5] x ngram_prompt_lookup_max=[1, 3].

cadedaniel · 2024-04-30T17:03:25Z

vllm/engine/arg_utils.py

+        parser.add_argument(
+            '--ngram-prompt-lookup-max',
+            type=int,
+            default=None,


nit: default=EngineArgs.ngram_prompt_lookup_max

cadedaniel · 2024-04-30T17:03:31Z

vllm/engine/arg_utils.py

+        parser.add_argument(
+            '--ngram-prompt-lookup-min',
+            type=int,
+            default=None,


nit: default=EngineArgs.ngram_prompt_lookup_min

cadedaniel · 2024-04-30T17:05:00Z

vllm/config.py

+            draft_model_config = target_model_config
+            draft_parallel_config = target_parallel_config


Can you add a TODO here to set these to None?

cadedaniel · 2024-04-30T17:05:54Z

vllm/spec_decode/multi_step_worker.py

+    ) -> Tuple[List[SamplerOutput], bool]:
+        """Run the model forward pass sample_len times. Returns the list of


Add docs on the new return value ?

cadedaniel · 2024-04-30T17:10:01Z

vllm/spec_decode/util.py

-    sampler_output_list: List[SamplerOutput],
-) -> Tuple[torch.Tensor, torch.Tensor]:
+        sampler_output_list: List[SamplerOutput],
+        sampler_transposed: bool) -> Tuple[torch.Tensor, torch.Tensor]:
    """Utility function which converts a list of SamplerOutput to tensors.


Add a docstring with new arg

Algo details could refer to this blog post: https://huggingface.co/blog/assisted-generation Code directly refer to transformers's current implementation. huggingface/transformers#27775 Since we directly get draft from prompt, there is no need another model or modified model to get the proposal, it would be the most convenient way to enjoy the speedup of speculation.

leiwen83 · 2024-05-01T14:13:22Z

@cadedaniel all note has been addressed, and rebase it against latest code, could you take another look?

cadedaniel · 2024-05-01T18:13:01Z

Looks good, thanks @leiwen83 ! Thanks for contributing to vLLM 😃

…#4237) Co-authored-by: Lei Wen <[email protected]>

cadedaniel reviewed Apr 22, 2024

View reviewed changes

vllm/spec_decode/multi_step_worker.py Outdated Show resolved Hide resolved

cadedaniel reviewed Apr 22, 2024

View reviewed changes

vllm/spec_decode/spec_decode_worker.py Outdated Show resolved Hide resolved

cadedaniel reviewed Apr 22, 2024

View reviewed changes

tests/spec_decode/e2e/test_correctness.py Outdated Show resolved Hide resolved

comaniac reviewed Apr 22, 2024

View reviewed changes

leiwen83 marked this pull request as ready for review April 22, 2024 10:42

comaniac reviewed Apr 22, 2024

View reviewed changes

cadedaniel reviewed Apr 22, 2024

View reviewed changes

leiwen83 force-pushed the ngram_lookahead_specinfer branch from 44a1530 to e870757 Compare April 25, 2024 12:19

comaniac reviewed Apr 25, 2024

View reviewed changes

cadedaniel approved these changes Apr 30, 2024

View reviewed changes

leiwen83 force-pushed the ngram_lookahead_specinfer branch from 748c687 to 8adaf38 Compare May 1, 2024 13:06

fix spell

c277197

cadedaniel merged commit b38e42f into vllm-project:main May 1, 2024
48 checks passed

leiwen83 deleted the ngram_lookahead_specinfer branch May 2, 2024 08:58

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024

[Speculative decoding] Add ngram prompt lookup decoding (vllm-project…

862330a

…#4237) Co-authored-by: Lei Wen <[email protected]>

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Speculative decoding] Add ngram prompt lookup decoding (vllm-project…

4107938

…#4237) Co-authored-by: Lei Wen <[email protected]>

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024

[Speculative decoding] Add ngram prompt lookup decoding (vllm-project…

2266a5e

…#4237) Co-authored-by: Lei Wen <[email protected]>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Speculative decoding] Add ngram prompt lookup decoding (vllm-project…

de13d52

…#4237) Co-authored-by: Lei Wen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding] Add ngram prompt lookup decoding #4237

[Speculative decoding] Add ngram prompt lookup decoding #4237

leiwen83 commented Apr 21, 2024

leiwen83 commented Apr 21, 2024

leiwen83 commented Apr 21, 2024

cadedaniel commented Apr 22, 2024 •

edited

Loading

comaniac commented Apr 22, 2024 •

edited

Loading

comaniac left a comment

leiwen83 commented Apr 23, 2024

leiwen83 commented Apr 23, 2024 •

edited

Loading

comaniac commented Apr 23, 2024

cadedaniel commented Apr 23, 2024

leiwen83 commented Apr 25, 2024 •

edited

Loading

comaniac left a comment

cadedaniel commented Apr 26, 2024

cadedaniel left a comment

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

cadedaniel Apr 30, 2024

leiwen83 commented May 1, 2024

cadedaniel commented May 1, 2024

		@@ -0,0 +1,241 @@
		"""The tests in this file verify end-to-end speculative decoding correctness.

		draft_model_config = target_model_config
		draft_parallel_config = target_parallel_config

		) -> Tuple[List[SamplerOutput], bool]:
		"""Run the model forward pass sample_len times. Returns the list of

[Speculative decoding] Add ngram prompt lookup decoding #4237

[Speculative decoding] Add ngram prompt lookup decoding #4237

Conversation

leiwen83 commented Apr 21, 2024

leiwen83 commented Apr 21, 2024

leiwen83 commented Apr 21, 2024

cadedaniel commented Apr 22, 2024 • edited Loading

comaniac commented Apr 22, 2024 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

leiwen83 commented Apr 23, 2024

leiwen83 commented Apr 23, 2024 • edited Loading

comaniac commented Apr 23, 2024

cadedaniel commented Apr 23, 2024

leiwen83 commented Apr 25, 2024 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

cadedaniel commented Apr 26, 2024

cadedaniel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leiwen83 commented May 1, 2024

cadedaniel commented May 1, 2024

cadedaniel commented Apr 22, 2024 •

edited

Loading

comaniac commented Apr 22, 2024 •

edited

Loading

leiwen83 commented Apr 23, 2024 •

edited

Loading

leiwen83 commented Apr 25, 2024 •

edited

Loading