-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Support VLMs with fine-grained scheduling #9871
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new code looks great. Also the performance should be better. Some nit comments
continue | ||
if not self.encoder_cache_manager.can_allocate(request, i): | ||
# Cannot schedule because the encoder cache is full. | ||
num_new_tokens = start_pos - num_computed_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the meaning of this num_new_tokens update here? For example, start_pos can be < num_computed_tokens, and then the result may be potentially negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to have start_pos < num_computed_tokens
here: This is because num_computed_tokens
are tokens already processed, which means if there were an image with start_pos < num_computed_tokens
, it should have been already processed in the previous iteration (either stored in KV cache, or cached in encoder cache).
If I understand correctly, the point of this update is that if we cannot run encoder here, then we want to stop at exactly before where the first encoder position is, to run decoder only processing for this current iteration. However, I think it is possible to have start_pos == num_computed_tokens
for a running request? (e.g, the first image token in a placeholder is exactly the first scheduled token, but the cache cannot allocate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible when prefix caching is enabled (whilst we currently don't support prefix caching for VLMs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to stop at exactly before where the first encoder position is, to run decoder only processing for this current iteration.
Exactly.
request.num_tokens) | ||
|
||
# Encoder-related. | ||
if encoder_inputs_to_schedule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a code duplication with the running case. Maybe the duplication can be avoided somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code was simplified a bit. I found it difficult to further refactor, since it's only 5 lines of code, and it involves updating the local variables like scheduled_encoder_inputs
and encoder_budget
. The code looks ok to me. WDYT?
num_computed_tokens = req_state.num_computed_tokens | ||
mm_positions = req_state.mm_positions | ||
for i, (start_pos, num_encoder_tokens) in enumerate(mm_positions): | ||
start_idx = max(num_computed_tokens - start_pos, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: A quick doc for this start/end indices computation would be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments above to help understand the logic.
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the very delayed review - left some comments!
FWIW - I did some mini benchmark on this branch vs V0
on 1 x A100-80G.
Command: python vllm/examples/offline_inference_vision_language.py --num-prompts 1000
V0:
1000/1000 [01:33<00:00, 10.69it/s, est. speed input: 6369.87 toks/s, output: 682.44 toks/s]
V1 with this PR and default budget & cache:
1000/1000 [01:01<00:00, 16.13it/s, est. speed input: 9614.21 toks/s, output: 1029.49 toks/s]
V1 with encoder budget and cache size = 576 (This should be more or less equivalent to V1 with previous design of VLM)
1000/1000 [01:15<00:00, 13.18it/s, est. speed input: 7856.67 toks/s, output: 841.03 toks/s]
continue | ||
if not self.encoder_cache_manager.can_allocate(request, i): | ||
# Cannot schedule because the encoder cache is full. | ||
num_new_tokens = start_pos - num_computed_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to have start_pos < num_computed_tokens
here: This is because num_computed_tokens
are tokens already processed, which means if there were an image with start_pos < num_computed_tokens
, it should have been already processed in the previous iteration (either stored in KV cache, or cached in encoder cache).
If I understand correctly, the point of this update is that if we cannot run encoder here, then we want to stop at exactly before where the first encoder position is, to run decoder only processing for this current iteration. However, I think it is possible to have start_pos == num_computed_tokens
for a running request? (e.g, the first image token in a placeholder is exactly the first scheduled token, but the cache cannot allocate).
self._schedule_encoder_inputs(request, | ||
request.num_computed_tokens, | ||
num_new_tokens, encoder_budget)) | ||
assert num_new_tokens > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment on when num_new_tokens
can be 0 for a running sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also fixed this (for the above decoder tokens) in the prefix caching PR. Also to clarify the semantic of num_new_tokens
:
- Before calling
_schedule_encoder_inputs
,num_new_tokens
would be the text tokens as well as image tokens (placeholder). - After calling
_schedule_encoder_inputs
,num_new_tokens
may be the same as before if encoder budget allows; otherwise it would be reduced to only include text tokens.
Is this understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac Yes, correct. When the encoder cache or budget is insufficient, num_new_tokens
can decrease up to the point just before the encoder input (e.g., image placeholder).
request.num_tokens) | ||
|
||
# Encoder-related. | ||
if encoder_inputs_to_schedule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@ywang96 Thanks for the review! QQ: How did you measure the perf of V1 without this PR? |
I have updated my original review comment - PTAL! |
This pull request has merge conflicts that must be resolved before it can be |
FYI, I did a quick performance benchmark for microsoft/Phi-3.5-vision-instruct when you have a separate process for mm_mapper (on the old version of this PR) and when you don't have a separate process. Results below show that separate process has large TTFT overhead, even when the RPS goes up (which is a bit surprising) - I think it is related to pickle/socket overheads most likely. I did some manual timings specifically on the roundtrip times to the separate process and I saw that mm_mapper is 5X slower with separate process than simply running directly.
When there is no separate process, the performance looks much better:
The commands are used are: server: vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code --max-model-len 4096 --enforce-eager --disable-async-output-proc client: python benchmarks/benchmark_serving.py --backend openai-chat --base-url http://0.0.0.0:8000/v1 --endpoint /chat/completions --model microsoft/Phi-3.5-vision-instruct --dataset-path lmms-lab/LLaVA-OneVision-Data --dataset-name hf --hf-subset "chart2text(cauldron)" --hf-split train --num_prompts=100 --request-rate 5 |
Thanks for the benchmarking. Could you also benchmark throughput? I suppose the benefit of separate processes should be more obvious in throughput instead of latency, as long as we pipeline mm_mapper well? |
# in the "partial" state, where the request has some tokens computed | ||
# but not all. The constraint is due to the persistent batch in the | ||
# V1 model runner. | ||
# TODO(woosuk): Remove this constraint after refactoring model runner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situation this limitation would hurt the performance?
self._schedule_encoder_inputs(request, | ||
request.num_computed_tokens, | ||
num_new_tokens, encoder_budget)) | ||
assert num_new_tokens > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also fixed this (for the above decoder tokens) in the prefix caching PR. Also to clarify the semantic of num_new_tokens
:
- Before calling
_schedule_encoder_inputs
,num_new_tokens
would be the text tokens as well as image tokens (placeholder). - After calling
_schedule_encoder_inputs
,num_new_tokens
may be the same as before if encoder budget allows; otherwise it would be reduced to only include text tokens.
Is this understanding correct?
vllm/v1/core/scheduler.py
Outdated
def _schedule_encoder_inputs( | ||
self, | ||
request: Request, | ||
num_computed_tokens: int, | ||
num_new_tokens: int, | ||
encoder_budget: int, | ||
) -> Tuple[List[int], int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please docstring this function for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thanks for the suggestion.
vllm/model_executor/models/llava.py
Outdated
elif inputs_embeds is None: | ||
vision_embeddings = self.process_mm_inputs(**kwargs) | ||
# always pass the input via `inputs_embeds` | ||
# to make sure the computation graph is consistent | ||
inputs_embeds = self.get_inputs_embeds(input_ids, | ||
vision_embeddings) | ||
input_ids = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're putting the encoder forward pass and embedding merge at model_runner
level, then I don't think the code here is needed? (Is it possible for inputs_embeds
to be None
here when there's multimodal data in the request? If not, we just need to call embed_tokens
here to get the text embeddings)
nvm - I see that it's needed here to be compatible with V0 - I will add a note accordingly in my PR to indicate that this needs to be cleaned up after we fully deprecate v0
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
@ywang96 Addressed comments. PTAL. |
@ywang96 This PR actually requires adding If we don't want to add this method to the text models, we can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WoosukKwon Overall looks good to me! I left a few more comments mainly around code clarifications so please take a look.
@WoosukKwon Everything looks good to me now - can you merge with main after #10272 is merged for the test fix? After that we can merge this. |
Signed-off-by: Woosuk Kwon <[email protected]>
87d966c
to
04edd1c
Compare
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: OmerD <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
# FIXME(woosuk): The input mapping (e.g., PIL images to tensors) may | ||
# take 10-50 ms, which can cause a spike in the latency. We should | ||
# consider moving this to a separate thread. | ||
if req.mm_data: | ||
req.mm_inputs = self.mm_input_mapper.process_inputs( | ||
req.mm_data, req.mm_processor_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One very nice property of V0 + #8348 is that the input mapper can be skipped entirely if the multimodal item is covered by the prefix cache (in our use case with Ultravox we can have many audio chunks in each inference). Not sure if that's practical to preserve in V1?
This PR implements the basic vision language model support in V1.
Motivation
Multi-modal inputs are difficult to deal with because they often have complex (or non-trivial) dependencies. For example, the model can take a prompt with interleaved texts and images like
Here, different colors represent different types of dependencies:
In V0, we didn't consider those dependencies. V0 circumvented it by always processing the entire prompt (all images & text) at once. However, this is not desirable, since it doesn't fit with other optimizations such as chunked prefills and prefix caching.
Proposal
To address this limitation, this PR proposes to make the V1 scheduler consider & track these dependencies explicitly, and do flexible & fine-grained scheduling based on it. One example can be like following:
Implementation
Limitations
Misc
To reduce the conflicts, I reverted back the changes in detokenizer. Plus, the MM input mapper will run on the same process as the engine (scheduler) for now. We will move it to a separate process later.