Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine #3894

Merged
merged 120 commits into from
Apr 16, 2024

Conversation

cadedaniel
Copy link
Collaborator

@cadedaniel cadedaniel commented Apr 7, 2024

This PR integrates the existing speculative decoding components with the LLMEngine. With this PR, one can run speculative decoding with a draft model / target model e2e. Note however that the output is garbage as there remains e2e correctness implementation (PR 7/9 in the open sourcing plan).

Notes:

  • Testing: This PR relies on e2e tests with speculative decoding enabled. We compare that the number of emitted tokens is correct, and that the detokenized string matches what HF transformers produces.
  • Testing: this PR includes some unit tests for new detokenization/eos management in LLMEngine
  • LLMEngine Refactor: Because it is complicated to combine beam search logic with lookahead scheduling, we introduce a new interface inside the LLMEngine that allows us to process new token ids separately for beam search and lookahead scheduling.
  • Spec decode support: Currently the e2e test only passes for TP=1 on GPU. The work to add TP>1 support is trivial, but will be wasted as we need to modify the interfaces for TP>1 anyways. It is left out of this PR.
  • This PR modifies the WorkerBase interface slightly, such that all workers return List[SamplerOutput] from execute_model instead of just SamplerOutput.
  • This PR is stacked on top of [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" #3837. A mirror PR has been created for easy viewing until that PR is merged.

Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the PR! Just some minor comments.

# Remove the parent sequence if it is not selected for next
# iteration
seq_group.remove(seq.seq_id)
self.scheduler.free_seq(seq)
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, output processor owns the scheduler and calls scheduler's functions in processing the output. This is a bit wired. Is it possible to move the scheduler part out of the output processor? Or is it too much work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! Note this weirdness is not introduced in this PR, it already existed in the LLMEngine. If it had unit tests we could easily refactor it, but I'm afraid without good unit tests it's risky to refactor things without a good reason.

"""

def __init__(self, scheduler, scheduler_config, get_tokenizer_for_seq):
self.scheduler = scheduler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. scheduler is not used in this class, why do we need scheduler here?
  2. we need to read max_model_len from scheduler config, can we pass max_model_len directly?
    In general, we might want to avoid the dependency on scheduler/scheduler_config in this class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. removed scheduler dependency and added type annotations (good catch!)
  2. passed max model len directly

# If no spec tokens, call the proposer and scorer workers normally.
# Used for prefill.
if num_spec_tokens == 0 or len(seq_group_metadata_list) == 0:
if num_lookahead_slots == 0 or len(seq_group_metadata_list) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can len(seq_group_metadata_list) be 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any use in mind for it, but I want to maintain compatibility with the other workers.

cpu worker

# If there is no input, we don't need to execute the model.
if num_seq_groups == 0:
return {}

gpu worker

vllm/vllm/worker/worker.py

Lines 231 to 233 in 92cd2e2

# If there is no input, we don't need to execute the model.
if num_seq_groups == 0:
return {}

neuron worker
# If there is no input, we don't need to execute the model.
if num_seq_groups == 0:
return {}

@@ -0,0 +1,117 @@
from typing import Callable, Iterable, List
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question:
From my understanding, multi_step processor is used for target model, which is not a muti_step_worker.
single_step processor is used for the draft model, which is a multi_step_worker.
Correct me if I'm wrong. If this is the case, can we have a better name for multi_step output processor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LLMEngine only interacts with the SpecDecodeWorker, which internally manages workers for the draft and target models. See this diagram from the meetup below:

Screenshot 2024-04-10 at 2 09 30 PM

This means that from the LLMEngine's perspective, there is only one worker type, and only one output processor type at a time.

@cadedaniel cadedaniel enabled auto-merge (squash) April 11, 2024 01:35
@cadedaniel
Copy link
Collaborator Author

update: seems there is some conflict with #3884 and block manager v2 that only shows up in the spec decode e2e test added in this PR. will investigate.

@rkooo567
Copy link
Collaborator

^ sorry for the inconvenience @cadedaniel!

@cadedaniel
Copy link
Collaborator Author

np glad chunked prefill got in

@leiwen83
Copy link
Contributor

leiwen83 commented Apr 11, 2024

Hi @cadedaniel ,

Great job for this feature enabling!
I locally pull your draft patch #3951, and test it with both drafter/verifer use the same model. The correctness of output is confirmed!

However, I find that in https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/multi_step_worker.py#L347, you append -1 towards the proposal to fill num_lookahead_slots number. Which maybe a bit of waste for the verification stage? Since the proposal may not always num_lookahead_slots length? Is there any improvement space here?

Thx~

@cadedaniel
Copy link
Collaborator Author

Hi @cadedaniel ,

Great job for this feature enabling! I locally pull your draft patch #3951, and test it with both drafter/verifer use the same model. The correctness of output is confirmed!

Thanks for testing it out.

However, I find that in https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/multi_step_worker.py#L347, you append -1 towards the proposal to fill num_lookahead_slots number. Which maybe a bit of waste for the verification stage? Since the proposal may not always num_lookahead_slots length? Is there any improvement space here?

Thx~

Yes, it would be wasteful if we verified these padding token ids. We avoid this via two means:

  • num_lookahead_slots is always k during speculation.
  • a proposal is either of len k or 0 -- we either speculate k tokens ahead of do not speculate at all.

This means we never have any -1 proposal tokens to verify.

In the future, we can support per-sequence proposal lens.

@cadedaniel cadedaniel merged commit e95cd87 into vllm-project:main Apr 16, 2024
46 checks passed
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants