[V1] Support VLMs with fine-grained scheduling #9871

WoosukKwon · 2024-10-31T07:33:51Z

This PR implements the basic vision language model support in V1.

Motivation

Multi-modal inputs are difficult to deal with because they often have complex (or non-trivial) dependencies. For example, the model can take a prompt with interleaved texts and images like

Here, different colors represent different types of dependencies:

Red: Can be computed independently of each other
Yellow: Depends on Image Embedding 0
Green: Depends on Image Embedding 1

In V0, we didn't consider those dependencies. V0 circumvented it by always processing the entire prompt (all images & text) at once. However, this is not desirable, since it doesn't fit with other optimizations such as chunked prefills and prefix caching.

Proposal

To address this limitation, this PR proposes to make the V1 scheduler consider & track these dependencies explicitly, and do flexible & fine-grained scheduling based on it. One example can be like following:

The scheduler leverages chunked prefills for the decoder inputs, so that TPOT is under control.
Furthermore, the scheduler ensures that not too many images are processed by the vision encoder in the same step, because this can cause a spike in TTFT/TPOT.
This fine-grained scheduling will also allow using prefix caching for VLMs, although it's not implemented in this PR.

Implementation

The scheduler has “encoder budget” (e.g., number of input image tokens in ViT) and “decoder budget” (number of input tokens).
The scheduler explicitly schedules the encoder and decoder inputs, considering the input dependencies.
- The vision encoder and LLM decoder will live in the same GPU.
- In every step, the model runner will first (optionally) run the vision encoder, and then run the LLM decoder possibly with the output of the encoder.
The model runner caches the encoder outputs (e.g., image embeddings) in encoder cache on GPU until the entire tensor is consumed by the decoder.
- We should limit the maximum size of the cache, since the encoder outputs can be large. This will work as a scheduling constraint in the scheduler.

Limitations

Currently, the design only consider Llava-style model architectures (e.g., Pixtral, Molmo). It didn't consider other model architectures like multi-modal Llama.
Currently, the implementation in the PR only supports Llava v1.5 and Phi3v because of the necessary changes in model's input processor. Support for other models will be implemented in a followup PR.
Currently, the encoder cache is just a pool of tensors. For more precise memory management, we need to store it in paged memory, just like the paged KV cache. I leave this as future work.
Currently, the scheduling logic for encoder inputs is a bit hacky because of some limitations on the V1 model runner. This needs to be further refined in the next PR.

Misc

To reduce the conflicts, I reverted back the changes in detokenizer. Plus, the MM input mapper will run on the same process as the engine (scheduler) for now. We will move it to a separate process later.

github-actions · 2024-10-31T07:34:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

alexm-neuralmagic

The new code looks great. Also the performance should be better. Some nit comments

alexm-neuralmagic · 2024-10-31T16:09:55Z

vllm/v1/core/scheduler.py

+                continue
+            if not self.encoder_cache_manager.can_allocate(request, i):
+                # Cannot schedule because the encoder cache is full.
+                num_new_tokens = start_pos - num_computed_tokens


What's the meaning of this num_new_tokens update here? For example, start_pos can be < num_computed_tokens, and then the result may be potentially negative?

I don't think it's possible to have start_pos < num_computed_tokens here: This is because num_computed_tokens are tokens already processed, which means if there were an image with start_pos < num_computed_tokens, it should have been already processed in the previous iteration (either stored in KV cache, or cached in encoder cache).

If I understand correctly, the point of this update is that if we cannot run encoder here, then we want to stop at exactly before where the first encoder position is, to run decoder only processing for this current iteration. However, I think it is possible to have start_pos == num_computed_tokens for a running request? (e.g, the first image token in a placeholder is exactly the first scheduled token, but the cache cannot allocate).

It's possible when prefix caching is enabled (whilst we currently don't support prefix caching for VLMs).

we want to stop at exactly before where the first encoder position is, to run decoder only processing for this current iteration.

Exactly.

vllm/v1/core/scheduler.py

alexm-neuralmagic · 2024-10-31T19:54:43Z

vllm/v1/core/scheduler.py

+                                       request.num_tokens)
+
+                # Encoder-related.
+                if encoder_inputs_to_schedule:


Seems like a code duplication with the running case. Maybe the duplication can be avoided somehow.

The code was simplified a bit. I found it difficult to further refactor, since it's only 5 lines of code, and it involves updating the local variables like scheduled_encoder_inputs and encoder_budget. The code looks ok to me. WDYT?

vllm/v1/request.py

vllm/v1/worker/gpu_model_runner.py

alexm-neuralmagic · 2024-10-31T20:06:29Z

vllm/v1/worker/gpu_model_runner.py

+            num_computed_tokens = req_state.num_computed_tokens
+            mm_positions = req_state.mm_positions
+            for i, (start_pos, num_encoder_tokens) in enumerate(mm_positions):
+                start_idx = max(num_computed_tokens - start_pos, 0)


nit: A quick doc for this start/end indices computation would be helpful here.

Added some comments above to help understand the logic.

vllm/v1/worker/gpu_model_runner.py

mergify · 2024-11-01T05:21:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. @WoosukKwon please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ywang96

Sorry for the very delayed review - left some comments!

FWIW - I did some mini benchmark on this branch vs V0 on 1 x A100-80G.

Command: python vllm/examples/offline_inference_vision_language.py --num-prompts 1000

V0:

1000/1000 [01:33<00:00, 10.69it/s, est. speed input: 6369.87 toks/s, output: 682.44 toks/s]

V1 with this PR and default budget & cache:

1000/1000 [01:01<00:00, 16.13it/s, est. speed input: 9614.21 toks/s, output: 1029.49 toks/s]

V1 with encoder budget and cache size = 576 (This should be more or less equivalent to V1 with previous design of VLM)

1000/1000 [01:15<00:00, 13.18it/s, est. speed input: 7856.67 toks/s, output: 841.03 toks/s]

ywang96 · 2024-11-05T10:12:55Z

vllm/v1/core/scheduler.py

+                continue
+            if not self.encoder_cache_manager.can_allocate(request, i):
+                # Cannot schedule because the encoder cache is full.
+                num_new_tokens = start_pos - num_computed_tokens


I don't think it's possible to have start_pos < num_computed_tokens here: This is because num_computed_tokens are tokens already processed, which means if there were an image with start_pos < num_computed_tokens, it should have been already processed in the previous iteration (either stored in KV cache, or cached in encoder cache).

If I understand correctly, the point of this update is that if we cannot run encoder here, then we want to stop at exactly before where the first encoder position is, to run decoder only processing for this current iteration. However, I think it is possible to have start_pos == num_computed_tokens for a running request? (e.g, the first image token in a placeholder is exactly the first scheduled token, but the cache cannot allocate).

ywang96 · 2024-11-05T10:16:55Z

vllm/v1/core/scheduler.py

+                self._schedule_encoder_inputs(request,
+                                              request.num_computed_tokens,
+                                              num_new_tokens, encoder_budget))
+            assert num_new_tokens > 0


See my other comment on when num_new_tokens can be 0 for a running sequence.

I also fixed this (for the above decoder tokens) in the prefix caching PR. Also to clarify the semantic of num_new_tokens:

Before calling _schedule_encoder_inputs, num_new_tokens would be the text tokens as well as image tokens (placeholder).

After calling _schedule_encoder_inputs, num_new_tokens may be the same as before if encoder budget allows; otherwise it would be reduced to only include text tokens.

Is this understanding correct?

@comaniac Yes, correct. When the encoder cache or budget is insufficient, num_new_tokens can decrease up to the point just before the encoder input (e.g., image placeholder).

vllm/v1/core/scheduler.py

ywang96 · 2024-11-05T10:36:29Z

vllm/v1/core/scheduler.py

+                                       request.num_tokens)
+
+                # Encoder-related.
+                if encoder_inputs_to_schedule:


vllm/v1/core/scheduler.py

WoosukKwon · 2024-11-05T19:05:47Z

@ywang96 Thanks for the review!

QQ: How did you measure the perf of V1 without this PR?

ywang96 · 2024-11-05T23:14:47Z

@ywang96 Thanks for the review!

QQ: How did you measure the perf of V1 without this PR?

I have updated my original review comment - PTAL!

mergify · 2024-11-06T06:17:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. @WoosukKwon please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

alexm-neuralmagic · 2024-11-06T15:24:48Z

FYI,

I did a quick performance benchmark for microsoft/Phi-3.5-vision-instruct when you have a separate process for mm_mapper (on the old version of this PR) and when you don't have a separate process. Results below show that separate process has large TTFT overhead, even when the RPS goes up (which is a bit surprising) - I think it is related to pickle/socket overheads most likely. I did some manual timings specifically on the roundtrip times to the separate process and I saw that mm_mapper is 5X slower with separate process than simply running directly.

RPS	V0 - TTFT	V1 - TTFT (separate process mm_mapper)
1	67.05	127.99
5	73.23	143.1
10	84.28	190.66

RPS	V0 - TPOT	V1 - TPOT (separate process mm_mapper)
1	14.27	14.44
5	17.59	18.89
10	25.47	27.8

When there is no separate process, the performance looks much better:

RPS	V0 - TTFT	V1 - TTFT (direct mm_mapper)
1	67.05	69.91
5	73.23	78.47
10	84.28	89.23

RPS	V0 - TPOT	V1 - TPOT (direct mm_mapper)
1	14.27	13.10
5	17.59	14.19
10	25.47	16.17

The commands are used are:

server: vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code --max-model-len 4096 --enforce-eager --disable-async-output-proc

client: python benchmarks/benchmark_serving.py --backend openai-chat --base-url http://0.0.0.0:8000/v1 --endpoint /chat/completions --model microsoft/Phi-3.5-vision-instruct --dataset-path lmms-lab/LLaVA-OneVision-Data --dataset-name hf --hf-subset "chart2text(cauldron)" --hf-split train --num_prompts=100 --request-rate 5

comaniac · 2024-11-06T17:10:40Z

Thanks for the benchmarking. Could you also benchmark throughput? I suppose the benefit of separate processes should be more obvious in throughput instead of latency, as long as we pipeline mm_mapper well?

vllm/v1/core/encoder_cache_manager.py

comaniac · 2024-11-06T17:16:38Z

vllm/v1/core/scheduler.py

+        # in the "partial" state, where the request has some tokens computed
+        # but not all. The constraint is due to the persistent batch in the
+        # V1 model runner.
+        # TODO(woosuk): Remove this constraint after refactoring model runner.


In what situation this limitation would hurt the performance?

comaniac · 2024-11-06T17:17:09Z

vllm/v1/core/scheduler.py

+                self._schedule_encoder_inputs(request,
+                                              request.num_computed_tokens,
+                                              num_new_tokens, encoder_budget))
+            assert num_new_tokens > 0


I also fixed this (for the above decoder tokens) in the prefix caching PR. Also to clarify the semantic of num_new_tokens:

Before calling _schedule_encoder_inputs, num_new_tokens would be the text tokens as well as image tokens (placeholder).

After calling _schedule_encoder_inputs, num_new_tokens may be the same as before if encoder budget allows; otherwise it would be reduced to only include text tokens.

Is this understanding correct?

comaniac · 2024-11-06T17:24:14Z

vllm/v1/core/scheduler.py

+    def _schedule_encoder_inputs(
+        self,
+        request: Request,
+        num_computed_tokens: int,
+        num_new_tokens: int,
+        encoder_budget: int,
+    ) -> Tuple[List[int], int]:


Please docstring this function for readability.

Added. Thanks for the suggestion.

ywang96 · 2024-11-06T19:55:25Z

vllm/model_executor/models/llava.py

+        elif inputs_embeds is None:
+            vision_embeddings = self.process_mm_inputs(**kwargs)
+            # always pass the input via `inputs_embeds`
+            # to make sure the computation graph is consistent
+            inputs_embeds = self.get_inputs_embeds(input_ids,
+                                                   vision_embeddings)
+            input_ids = None


If we're putting the encoder forward pass and embedding merge at model_runner level, then I don't think the code here is needed? (Is it possible for inputs_embeds to be None here when there's multimodal data in the request? If not, we just need to call embed_tokens here to get the text embeddings)

nvm - I see that it's needed here to be compatible with V0 - I will add a note accordingly in my PR to indicate that this needs to be cleaned up after we fully deprecate v0

vllm/model_executor/models/phi3v.py

mergify · 2024-11-11T10:02:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-11T23:07:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2024-11-12T09:48:58Z

@ywang96 Addressed comments. PTAL.

WoosukKwon · 2024-11-12T09:55:04Z

@ywang96 This PR actually requires adding get_input_embeddings method to all models (while I only added it to llama, opt, llava, and phi3v in this PR), because it know executes the model's embedding layer and the other parts separately.

If we don't want to add this method to the text models, we can use self.model.model.get_input_embeddings instead, while it looks a bit hacky.

ywang96

@WoosukKwon Overall looks good to me! I left a few more comments mainly around code clarifications so please take a look.

vllm/model_executor/models/llava.py

vllm/v1/core/scheduler.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/core/scheduler.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/core/scheduler.py

ywang96 · 2024-11-12T23:54:33Z

@WoosukKwon Everything looks good to me now - can you merge with main after #10272 is merged for the test fix? After that we can merge this.

Signed-off-by: Woosuk Kwon <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: OmerD <[email protected]>

Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

petersalas · 2024-11-15T09:41:31Z

vllm/v1/engine/core.py

+        # FIXME(woosuk): The input mapping (e.g., PIL images to tensors) may
+        # take 10-50 ms, which can cause a spike in the latency. We should
+        # consider moving this to a separate thread.
+        if req.mm_data:
+            req.mm_inputs = self.mm_input_mapper.process_inputs(
+                req.mm_data, req.mm_processor_kwargs)


One very nice property of V0 + #8348 is that the input mapper can be skipped entirely if the multimodal item is covered by the prefix cache (in our use case with Ultravox we can have many audio chunks in each inference). Not sure if that's practical to preserve in V1?

ywang96 self-assigned this Oct 31, 2024

alexm-neuralmagic reviewed Oct 31, 2024

View reviewed changes

mergify bot added needs-rebase and removed needs-rebase labels Nov 1, 2024

ywang96 reviewed Nov 5, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 6, 2024

comaniac reviewed Nov 6, 2024

View reviewed changes

ywang96 reviewed Nov 6, 2024

View reviewed changes

ywang96 mentioned this pull request Nov 7, 2024

[V1][VLM] Enable proper chunked prefill for multimodal models #9950

Draft

11 tasks

mergify bot removed the needs-rebase label Nov 8, 2024

WoosukKwon marked this pull request as ready for review November 8, 2024 06:27

WoosukKwon changed the title ~~[V1] Support VLMs~~ [V1] Support VLMs with fine-grained scheduling Nov 8, 2024

ywang96 reviewed Nov 11, 2024

View reviewed changes

vllm/model_executor/models/phi3v.py Outdated Show resolved Hide resolved

mergify bot added needs-rebase and removed needs-rebase labels Nov 11, 2024

WoosukKwon mentioned this pull request Nov 12, 2024

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest #10245

Merged

WoosukKwon requested a review from ywang96 November 12, 2024 09:48

ywang96 approved these changes Nov 12, 2024

View reviewed changes

DCO

04edd1c

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon force-pushed the v1-vlm-sched branch from 87d966c to 04edd1c Compare November 13, 2024 00:27

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2024

WoosukKwon enabled auto-merge (squash) November 13, 2024 00:32

WoosukKwon added 2 commits November 12, 2024 18:47

Fix for CI

0da5df8

Signed-off-by: Woosuk Kwon <[email protected]>

fix

07ef65c

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon merged commit bbd3e86 into main Nov 13, 2024
50 checks passed

WoosukKwon deleted the v1-vlm-sched branch November 13, 2024 04:53

rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 13, 2024

[V1] Support VLMs with fine-grained scheduling (vllm-project#9871)

3d1b227

Signed-off-by: Woosuk Kwon <[email protected]> Co-authored-by: Roger Wang <[email protected]>

petersalas reviewed Nov 15, 2024

View reviewed changes

This was referenced Nov 15, 2024

[V1] Refactor model executable interface for all text-only language models #10374

Merged

[RFC]: Multi-modality Support Refactoring #4194

Open

[V1] Support VLMs with fine-grained scheduling #9871

[V1] Support VLMs with fine-grained scheduling #9871

Conversation

WoosukKwon commented Oct 31, 2024 • edited Loading

Motivation

Proposal

Implementation

Limitations

Misc

github-actions bot commented Oct 31, 2024

alexm-neuralmagic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywang96 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Nov 1, 2024

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

ywang96 Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Nov 5, 2024

ywang96 commented Nov 5, 2024

mergify bot commented Nov 6, 2024

alexm-neuralmagic commented Nov 6, 2024

comaniac commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywang96 Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

mergify bot commented Nov 11, 2024

mergify bot commented Nov 11, 2024

WoosukKwon commented Nov 12, 2024

WoosukKwon commented Nov 12, 2024

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 commented Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

WoosukKwon commented Oct 31, 2024 •

edited

Loading

ywang96 Nov 5, 2024 •

edited

Loading

ywang96 left a comment •

edited

Loading

ywang96 Nov 5, 2024 •

edited

Loading

ywang96 Nov 6, 2024 •

edited

Loading

ywang96 commented Nov 12, 2024 •

edited

Loading