[model] Support for Llava-Next-Video model #7559

TKONIY · 2024-08-15T15:57:19Z

Roadmap

Add VideoPlugin to MultiModalPlugin.

LLM.generate API for a single video input.

LLM.generate({
    "prompt": "<video> please summarize this video",
    "multi_modal_data": {
        "video": video # currently only support type of np.ndarry
    }
})

Support LlavaNextVideoForConditionalGeneration model with single video input.
[15 Aug. Update] Waiting for the configuration file in hugging-face to be fixed. https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf/discussions/4
Add example for llava-next-video.

Support all kinds of video input type like transformers

VideoInput = Union[
    List["PIL.Image.Image"], # Supported
    "np.ndarray",
    "torch.Tensor",
    List["np.ndarray"],
    List["torch.Tensor"],
    List[List["PIL.Image.Image"]],
    List[List["np.ndarrray"]],
    List[List["torch.Tensor"]],
]

Support multiple image-video-mixed input.
Support Siglip.
Support Chat Completion APIs.
Support prefix caching

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-08-15T16:46:30Z

This is an exciting development! This should be the last of the common modality types to cover for now.

TKONIY · 2024-08-19T19:02:32Z

@DarkLight1337 @ywang96 The initial support for Llava-Next-Video is done, which is also the first support for video in vLLM. Could you help review this PR?

It now supports integrating a single video into a prompt with the "<video>" symbol. I am happy to tell you that compared to SGLang which fixes the number of input frames when launching the model, vLLM now has a stronger implementation that supports any number of input frames per video.

ywang96 · 2024-08-19T20:30:11Z

@DarkLight1337 @ywang96 The initial support for Llava-Next-Video is done, which is also the first support for video in vLLM. Could you help review this PR?

It now supports integrating a single video into a prompt with the "

This is great and thank you very much for making this contribution! @TKONIY

I will review this PR either today or tomorrow!

TKONIY · 2024-08-20T02:44:48Z

Thank you!

TKONIY · 2024-08-21T09:42:21Z

@DarkLight1337 @ywang96 The initial support for Llava-Next-Video is done, which is also the first support for video in vLLM. Could you help review this PR?
It now supports integrating a single video into a prompt with the "

This is great and thank you very much for making this contribution! @TKONIY

I will review this PR either today or tomorrow!

Dear @ywang96, are you available today to review this PR?

ywang96 · 2024-08-21T15:06:25Z

@TKONIY Reviewing it now!

ywang96

Thank you for the PR @TKONIY! It seems that this PR only focuses on the offline inference, is that correct? OpenAI Vision API does support video frames as input, so we should make changes to our frontend to support the end-to-end online inference. (This can be in a later PR if you plan to work on it as well)

Overall I think PR is in the right track! I left a few comments as the first round of the review, mostly regarding the code cleanups. Please take a look, thanks!

examples/offline_inference_vision_language.py

vllm/model_executor/models/llava_next_video.py

vllm/multimodal/video.py

vllm/model_executor/models/llava_next_video.py

vllm/transformers_utils/image_processor.py

litianjian · 2024-08-22T13:12:35Z

vllm/model_executor/models/llava_next_video.py

+    tokens_per_frame = get_llava_next_video_frame_feature_size(hf_config)
+    video_feature_size = frames_per_video * tokens_per_frame
+
+    if isinstance(vision_config, CLIPVisionConfig):


I think LLaVA models should enable the Siglip vision encoder.

You are right! It seems not complicated, but I haven't had time to verify the Siglip implementation. I would like to leave this to other PRs.

I am excited about it. Many of our models use the siglip. BTW, I am currently using your PR to verify our video models.

Feel free to take a look at our llava-next implementation, where we allow both CLIP and SIGLIP to be the vision tower

Thanks! I saw it in Llava Next implementation. I'll try.

@litianjian @ywang96 I tried to integrate SIGLIP like llava-next did. But I am not confident about its correctness. If would be nice if you could have a check on my latest commits.

TKONIY · 2024-08-22T18:45:07Z

Thank you for the PR @TKONIY! It seems that this PR only focuses on the offline inference, is that correct? OpenAI Vision API does support video frames as input, so we should make changes to our frontend to support the end-to-end online inference. (This can be in a later PR if you plan to work on it as well)

Overall I think PR is in the right track! I left a few comments as the first round of the review, mostly regarding the code cleanups. Please take a look, thanks!

Thanks for your review. It would be exciting if video input is supported by chat completion APIs. I have read the code of our front end and I think it requires changes to some interfaces, so I did not implement it in this PR. I may open an RFC to discuss it if I got time.

TKONIY · 2024-08-23T01:22:56Z

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

ywang96 · 2024-08-23T15:54:05Z

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!

Could you add a correctness test for this model in tests/models, like for all other models we have?

TKONIY · 2024-08-26T15:31:49Z

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!

Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

ywang96 · 2024-08-26T16:12:26Z

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

TKONIY · 2024-08-31T05:43:32Z

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

I am back and working on it now.

ywang96 · 2024-08-31T06:11:48Z

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

I am back and working on it now.

Hey @TKONIY! Sounds good - I was actually just working on this branch but still run into the issue of ValueError: The checkpoint you are trying to load has model type `llava_next_video` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.. Are you also seeing this in your dev environment?

Also FYI, #8049 should be merged pretty soon in case you also want to add the frontend support.

TKONIY · 2024-08-31T06:29:53Z

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

I am back and working on it now.

Hey @TKONIY! Sounds good - I was actually just working on this branch but still run into the issue of ValueError: The checkpoint you are trying to load has model type `llava_next_video` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.. Are you also seeing this in your dev environment?

Also FYI, #8049 should be merged pretty soon in case you also want to add the frontend support.

There is an error in the newest release version of the Transformers library. I fixed them with this commit huggingface/transformers@a27182b so that vLLM can correctly load the model config. Therefore, vLLM could only run with the Transformers library after this commit, which has not been released.

ywang96 · 2024-08-31T08:09:25Z

vllm/model_executor/models/llava_next_video.py

+        video_pixels = inputs["data"]
+
+        if isinstance(video_pixels, torch.Tensor):
+            b, num_frames, c, h, w = video_pixels.shape


For some reason the shape of video_pixels has an additional dimension between batch size and num_frames so I had to change this line to b, _, num_frames, c, h, w = video_pixels.shape to make this PR work with the main branch of transformers

I think this is due to the recent update to batched inputs, which introduced an additional level to the tensor shape where the second dimension now refers to the number of multimodal inputs (here, the number of videos).

I think it is because of the multi-multimodal input supports. My PR doesn't support multiple video and images per input yet. So I will just leave it as b, _, num_frames, c, h, w yet.

DarkLight1337 · 2024-09-10T10:35:50Z

Can you check the build failure for AMD? Other PRs don't have this problem so it is likely caused by your changes.

TKONIY · 2024-09-10T12:49:24Z

Can you check the build failure for AMD? Other PRs don't have this problem so it is likely caused by your changes.

I will check, thank you.

Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

PancakeAwesome · 2024-09-14T14:43:33Z

So we can now use the main branch to implement qwenvl2, internvl2 and other multi-modal models of openai vllm server video inference?

DarkLight1337 · 2024-09-14T14:49:44Z

No, OpenAI API support is not out yet (unless you consider multi-image input).

PancakeAwesome · 2024-09-14T14:52:23Z

No, OpenAI API support is not out yet (unless you consider multi-image input).不，OpenAI API目前还不支持这种功能（除非你把多张图片作为输入）。

So you mean that the openai api must support video input?

DarkLight1337 · 2024-09-14T14:55:19Z

I mean that for now, you can pass in multiple images as a "video" to the model. Explicit video support will be added later as mentioned in #7558.

Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272) [Frontend] Add progress reporting to run_batch.py (vllm-project#8060) Co-authored-by: Adam Lugowski <[email protected]> [Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292) [Misc] GPTQ Activation Ordering (vllm-project#8135) [Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217) Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (vllm-project#8319) [Bugfix] Fix missing `post_layernorm` in CLIP (vllm-project#8155) [CI/Build] enable ccache/scccache for HIP builds (vllm-project#8327) [Frontend] Clean up type annotations for mistral tokenizer (vllm-project#8314) [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (vllm-project#8130) Fix ppc64le buildkite job (vllm-project#8309) [Spec Decode] Move ops.advance_step to flash attn advance_step (vllm-project#8224) [Misc] remove peft as dependency for prompt models (vllm-project#8162) [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (vllm-project#8342) [Bugfix] lookahead block table with cuda graph max capture (vllm-project#8340) [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (vllm-project#8340) [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (vllm-project#8172) [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (vllm-project#8043) [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (vllm-project#8329) [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (vllm-project#8299) [Hardware][NV] Add support for ModelOpt static scaling checkpoints. (vllm-project#6112) [model] Support for Llava-Next-Video model (vllm-project#7559) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> [Frontend] Create ErrorResponse instead of raising exceptions in run_batch (vllm-project#8347) [Model][VLM] Add Qwen2-VL model support (vllm-project#7905) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (vllm-project#7257) [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (vllm-project#8373) [Bugfix] Add missing attributes in mistral tokenizer (vllm-project#8364) [Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917) Co-authored-by: Sage Moore <[email protected]> [Misc] Move device options to a single place (vllm-project#8322) [Speculative Decoding] Test refactor (vllm-project#8317) Co-authored-by: youkaichao <[email protected]> Pixtral (vllm-project#8377) Co-authored-by: Roger Wang <[email protected]> Bump version to v0.6.1 (vllm-project#8379) [MISC] Dump model runner inputs when crashing (vllm-project#8305) [misc] remove engine_use_ray (vllm-project#8126) [TPU] Use Ray for default distributed backend (vllm-project#8389) Fix the AMD weight loading tests (vllm-project#8390) [Bugfix]: Fix the logic for deciding if tool parsing is used (vllm-project#8366) [Gemma2] add bitsandbytes support for Gemma2 (vllm-project#8338) [Misc] Raise error when using encoder/decoder model with cpu backend (vllm-project#8355) [Misc] Use RoPE cache for MRoPE (vllm-project#8396) [torch.compile] hide slicing under custom op for inductor (vllm-project#8384) [Hotfix][VLM] Fixing max position embeddings for Pixtral (vllm-project#8399) [Bugfix] Fix InternVL2 inference with various num_patches (vllm-project#8375) Co-authored-by: DarkLight1337 <[email protected]> [Model] Support multiple images for qwen-vl (vllm-project#8247) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (vllm-project#8403) [BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (vllm-project#8423) [Bugfix] Offline mode fix (vllm-project#8376) Signed-off-by: Joe Runde <[email protected]> [multi-step] add flashinfer backend (vllm-project#7928) [Core] Add engine option to return only deltas or final output (vllm-project#7381) [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (vllm-project#8427) [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (vllm-project#8425) [CI/Build] Disable multi-node test for InternVL2 (vllm-project#8428) [Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415) [Bugfix] Fix weight loading issue by rename variable. (vllm-project#8293) [Misc] Update Pixtral example (vllm-project#8431) [BugFix] fix group_topk (vllm-project#8430) [Core] Factor out input preprocessing to a separate class (vllm-project#7329) [Bugfix] Mapping physical device indices for e2e test utils (vllm-project#8290) [Bugfix] Bump fastapi and pydantic version (vllm-project#8435) [CI/Build] Update pixtral tests to use JSON (vllm-project#8436) [Bugfix] Fix async log stats (vllm-project#8417) [bugfix] torch profiler bug for single gpu with GPUExecutor (vllm-project#8354) bump version to v0.6.1.post1 (vllm-project#8440) [CI/Build] Enable InternVL2 PP test only on single node (vllm-project#8437) [doc] recommend pip instead of conda (vllm-project#8446) [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (vllm-project#8442) [misc][ci] fix quant test (vllm-project#8449) [Installation] Gate FastAPI version for Python 3.8 (vllm-project#8456) [plugin][torch.compile] allow to add custom compile backend (vllm-project#8445) [CI/Build] Reorganize models tests (vllm-project#7820) [Doc] Add oneDNN installation to CPU backend documentation (vllm-project#8467) [HotFix] Fix final output truncation with stop string + streaming (vllm-project#8468) bump version to v0.6.1.post2 (vllm-project#8473) [Hardware][intel GPU] bump up ipex version to 2.3 (vllm-project#8365) Co-authored-by: Yan Ma <[email protected]> [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (vllm-project#8310) [Model] support minicpm3 (vllm-project#8297) Co-authored-by: DarkLight1337 <[email protected]> [torch.compile] fix functionalization (vllm-project#8480) [torch.compile] add a flag to disable custom op (vllm-project#8488) [TPU] Implement multi-step scheduling (vllm-project#8489) [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (vllm-project#8490) [Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (vllm-project#8357) [Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032) Co-authored-by: Dipika <[email protected]> [Frontend] Expose revision arg in OpenAI server (vllm-project#8501) [BugFix] Fix clean shutdown issues (vllm-project#8492) [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506) [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270) [doc] update doc on testing and debugging (vllm-project#8514) [Bugfix] Bind api server port before starting engine (vllm-project#8491) [perf bench] set timeout to debug hanging (vllm-project#8516) [misc] small qol fixes for release process (vllm-project#8517) [Bugfix] Fix 3.12 builds on main (vllm-project#8510) Signed-off-by: Joe Runde <[email protected]> [refactor] remove triton based sampler (vllm-project#8524) [Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525) Signed-off-by: Alex-Brooks <[email protected]> [Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521) [torch.compile] register allreduce operations as custom ops (vllm-project#8526) [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509) Signed-off-by: Rui Qiao <[email protected]> [Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495) [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631) [Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434) [Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515) Co-authored-by: Cyrus Leung <[email protected]> [Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527) [Bugfix] Fix TP > 1 for new granite (vllm-project#8544) Signed-off-by: Joe Runde <[email protected]> [doc] improve installation doc (vllm-project#8550) Co-authored-by: Andy Dai <[email protected]> [CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520) [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012) [CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540) [Misc] Add argument to disable FastAPI docs (vllm-project#8554) [CI/Build] Avoid CUDA initialization (vllm-project#8534) [CI/Build] Update Ruff version (vllm-project#8469) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157) Co-authored-by: Nick Hill <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Simon Mo <[email protected]> [Core] *Prompt* logprobs support in Multi-step (vllm-project#8199) [Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543) Signed-off-by: Russell Bryant <[email protected]> [Model] Support Solar Model (vllm-project#8386) Co-authored-by: Michael Goin <[email protected]> [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380) Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: Michael Goin <[email protected]> [Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039) [BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572) [Bugfix] add `dead_error` property to engine client (vllm-project#8574) Signed-off-by: Joe Runde <[email protected]> [Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573) Co-authored-by: [email protected] [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (vllm-project#8545) Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593) [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616) [MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615) [Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584) [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577) [Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619) [Doc] Add documentation for GGUF quantization (vllm-project#8618) Create SECURITY.md (vllm-project#8642) [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551) [Misc] guard against change in cuda library name (vllm-project#8609) [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571) [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474) [Core] Support Lora lineage and base model metadata management (vllm-project#6315) [Model] Add OLMoE (vllm-project#7922) [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670) [Bugfix] Validate SamplingParam n is an int (vllm-project#8548) [Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649) [Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556) [Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640) [Doc] neuron documentation update (vllm-project#8671) Signed-off-by: omrishiv <[email protected]> [Hardware][AWS] update neuron to 2.20 (vllm-project#8676) Signed-off-by: omrishiv <[email protected]> [Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496) [Core] Rename `PromptInputs` and `inputs`(vllm-project#8673) [MISC] add support custom_op check (vllm-project#8557) Co-authored-by: youkaichao <[email protected]> [Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675) [beam search] add output for manually checking the correctness (vllm-project#8684) [Kernel] Build flash-attn from source (vllm-project#8245) [VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687) [Doc] Fix typo in AMD installation guide (vllm-project#8689) [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646) [dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518) [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643) [Bugfix] Refactor composite weight loading logic (vllm-project#8656) [ci][build] fix vllm-flash-attn (vllm-project#8699) [Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407) [Misc] Use NamedTuple in Multi-image example (vllm-project#8705) Signed-off-by: Alex-Brooks <[email protected]> [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703) [Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486) Co-authored-by: litianjian <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701) [build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713) [misc] upgrade mistral-common (vllm-project#8715) [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702) [Bugfix] Fix CPU CMake build (vllm-project#8723) Co-authored-by: Yuan <[email protected]> [Bugfix] fix docker build for xpu (vllm-project#8652) [Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657) Signed-off-by: Alex-Brooks <[email protected]> [Hardware][CPU] Refactor CPU model runner (vllm-project#8729) [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733) [Model] Support pp for qwen2-vl (vllm-project#8696) [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707) [CI/Build] use setuptools-scm to set __version__ (vllm-project#4738) Co-authored-by: youkaichao <[email protected]> [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701) Co-authored-by: mgoin <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> [Kernel][LoRA] Add assertion for punica sgmv kernels (vllm-project#7585) [Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575) Signed-off-by: Russell Bryant <[email protected]> Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562) Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335) [Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674) Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728) re-implement beam search on top of vllm core (vllm-project#8726) Co-authored-by: Brendan Wong <[email protected]> Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750) [MISC] Skip dumping inputs when unpicklable (vllm-project#8744) [Core][Model] Support loading weights by ID within models (vllm-project#7931) [Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> [Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558) [Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661) Co-authored-by: mgoin <[email protected]> [Frontend] Batch inference for llm.chat() API (vllm-project#8648) Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748) [CI/Build] fix setuptools-scm usage (vllm-project#8771) [misc] soft drop beam search (vllm-project#8763) [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768) [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047) Signed-off-by: Travis Johnson <[email protected]> [Core] Adding Priority Scheduling (vllm-project#5958) [Bugfix] Use heartbeats instead of health checks (vllm-project#8583) Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780) [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776) Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752) [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250) [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770) [Bugfix] load fc bias from config for eagle (vllm-project#8790) [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (vllm-project#8672) [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (vllm-project#8767) Signed-off-by: darthhexx <[email protected]> [Misc] Fix minor typo in scheduler (vllm-project#8765) [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (vllm-project#8777) [Kernel] Fullgraph and opcheck tests (vllm-project#8479) [[Misc]] Add extra deps for openai server image (vllm-project#8792) [VLM][Bugfix] internvl with num_scheduler_steps > 1 (vllm-project#8614) rename PromptInputs and inputs with backward compatibility (vllm-project#8760) [Frontend] MQLLMEngine supports profiling. (vllm-project#8761) [Misc] Support FP8 MoE for compressed-tensors (vllm-project#8588) Revert "rename PromptInputs and inputs with backward compatibility (vllm-project#8760) (vllm-project#8810) [Model] Add support for the multi-modal Llama 3.2 model (vllm-project#8811) Co-authored-by: simon-mo <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> [Doc] Update doc for Transformers 4.45 (vllm-project#8817) [Misc] Support quantization of MllamaForCausalLM (vllm-project#8822) [Misc] Update config loading for Qwen2-VL and remove Granite (vllm-project#8837) [Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814) [Docs] Add README to the build docker image (vllm-project#8825) [CI/Build] Fix missing ci dependencies (vllm-project#8834) [misc][installation] build from source without compilation (vllm-project#8818) [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872) Signed-off-by: kevin <[email protected]> [Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861) [Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820) [Bugfix] Fix print_warning_once's line info (vllm-project#8867) fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568) [Bugfix] Fixup advance_step.cu warning (vllm-project#8815) [BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829) [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764) [Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343) Signed-off-by: Max de Bayser <[email protected]> [Core] rename`PromptInputs` and `inputs` (vllm-project#8876) [misc] fix collect env (vllm-project#8894) [MISC] Fix invalid escape sequence '\' (vllm-project#8830) Signed-off-by: Peter Pan <[email protected]> [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892) [TPU] Update pallas.py to support trillium (vllm-project#8871) [torch.compile] use empty tensor instead of None for profiling (vllm-project#8875) [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271) [Bugfix] fix for deepseek w4a16 (vllm-project#8906) Co-authored-by: mgoin <[email protected]> [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911) [Core] Priority-based scheduling in async engine (vllm-project#8850) [misc] fix wheel name (vllm-project#8919) [Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824) Signed-off-by: tylertitsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> [Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921) [Bugfix] Fix code for downloading models from modelscope (vllm-project#8443) [Bugfix] Fix PP for Multi-Step (vllm-project#8887) [CI/Build] Update models tests & examples (vllm-project#8874) Co-authored-by: Roger Wang <[email protected]> [Frontend] Make beam search emulator temperature modifiable (vllm-project#8928) Co-authored-by: Eduard Balzin <[email protected]> [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891) [doc] organize installation doc and expose per-commit docker (vllm-project#8931) [Core] Improve choice of Python multiprocessing method (vllm-project#8823) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: youkaichao <[email protected]> [Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824) [Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741) Co-authored-by: Tyler Michael Smith <[email protected]> [CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925) [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930) [Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896) [Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199) [BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870) Co-authored-by: Roger Wang <[email protected]> [Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944) [Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942) [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533)

Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Alvant <[email protected]>

Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Amit Garg <[email protected]>

TKONIY mentioned this pull request Aug 15, 2024

[RFC]: Support for video input #7558

Closed

TKONIY force-pushed the llava-next-video branch from c27fd58 to d95701b Compare August 15, 2024 16:21

DarkLight1337 mentioned this pull request Aug 15, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

DarkLight1337 self-assigned this Aug 15, 2024

TKONIY force-pushed the llava-next-video branch 2 times, most recently from 2beafbe to d6c783b Compare August 19, 2024 18:53

TKONIY marked this pull request as ready for review August 19, 2024 18:54

TKONIY force-pushed the llava-next-video branch from 393f206 to 3179393 Compare August 19, 2024 18:56

TKONIY force-pushed the llava-next-video branch from 12f4d45 to d4a0112 Compare August 20, 2024 12:19

ywang96 reviewed Aug 21, 2024

View reviewed changes

TKONIY force-pushed the llava-next-video branch from e28f0b6 to 65dd96a Compare August 22, 2024 08:01

litianjian reviewed Aug 22, 2024

View reviewed changes

DarkLight1337 mentioned this pull request Aug 29, 2024

[Model][VLM] Add Qwen2-VL model support #7905

Merged

ywang96 reviewed Aug 31, 2024

View reviewed changes

TKONIY force-pushed the llava-next-video branch from 46960dc to 925c49e Compare August 31, 2024 16:01

TKONIY added 5 commits September 10, 2024 09:48

formatting

4888357

Fix docker file

37371a4

add notes for llava-next-video about upstream transformers library

fe6e1d3

update notes

37f0209

formatting

6d481dd

TKONIY force-pushed the llava-next-video branch from 44e239b to 6d481dd Compare September 10, 2024 09:48

Fix notes

8bd43fc

Fix dockerfile rocm

66bf496

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2024

TKONIY and others added 5 commits September 10, 2024 18:17

Add open-cv for tests

808dcae

Merge branch 'main' into llava-next-video

bdf429d

Skip LLaVA-NeXT-Video tests for now

ff16521

Fix missing reason

ec6d340

Reword

f36b6ca

DarkLight1337 enabled auto-merge (squash) September 11, 2024 03:35

youkaichao disabled auto-merge September 11, 2024 05:21

youkaichao merged commit 6a512a0 into vllm-project:main Sep 11, 2024
70 of 72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model] Support for Llava-Next-Video model #7559

[model] Support for Llava-Next-Video model #7559

TKONIY commented Aug 15, 2024 •

edited

Loading

github-actions bot commented Aug 15, 2024

DarkLight1337 commented Aug 15, 2024

TKONIY commented Aug 19, 2024 •

edited

Loading

ywang96 commented Aug 19, 2024

TKONIY commented Aug 20, 2024

TKONIY commented Aug 21, 2024

ywang96 commented Aug 21, 2024

ywang96 left a comment •

edited

Loading

litianjian Aug 22, 2024

TKONIY Aug 22, 2024

litianjian Aug 22, 2024

ywang96 Aug 22, 2024

TKONIY Aug 22, 2024

TKONIY Aug 22, 2024

TKONIY commented Aug 22, 2024

TKONIY commented Aug 23, 2024

ywang96 commented Aug 23, 2024

TKONIY commented Aug 26, 2024

ywang96 commented Aug 26, 2024

TKONIY commented Aug 31, 2024

ywang96 commented Aug 31, 2024

TKONIY commented Aug 31, 2024

ywang96 Aug 31, 2024 •

edited

Loading

DarkLight1337 Aug 31, 2024

TKONIY Aug 31, 2024

DarkLight1337 commented Sep 10, 2024 •

edited

Loading

TKONIY commented Sep 10, 2024

PancakeAwesome commented Sep 14, 2024

DarkLight1337 commented Sep 14, 2024

PancakeAwesome commented Sep 14, 2024

DarkLight1337 commented Sep 14, 2024

[model] Support for Llava-Next-Video model #7559

[model] Support for Llava-Next-Video model #7559

Conversation

TKONIY commented Aug 15, 2024 • edited Loading

Roadmap

Related

github-actions bot commented Aug 15, 2024

DarkLight1337 commented Aug 15, 2024

TKONIY commented Aug 19, 2024 • edited Loading

ywang96 commented Aug 19, 2024

TKONIY commented Aug 20, 2024

TKONIY commented Aug 21, 2024

ywang96 commented Aug 21, 2024

ywang96 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TKONIY commented Aug 22, 2024

TKONIY commented Aug 23, 2024

ywang96 commented Aug 23, 2024

TKONIY commented Aug 26, 2024

ywang96 commented Aug 26, 2024

TKONIY commented Aug 31, 2024

ywang96 commented Aug 31, 2024

TKONIY commented Aug 31, 2024

ywang96 Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarkLight1337 commented Sep 10, 2024 • edited Loading

TKONIY commented Sep 10, 2024

PancakeAwesome commented Sep 14, 2024

DarkLight1337 commented Sep 14, 2024

PancakeAwesome commented Sep 14, 2024

DarkLight1337 commented Sep 14, 2024

TKONIY commented Aug 15, 2024 •

edited

Loading

TKONIY commented Aug 19, 2024 •

edited

Loading

ywang96 left a comment •

edited

Loading

ywang96 Aug 31, 2024 •

edited

Loading

DarkLight1337 commented Sep 10, 2024 •

edited

Loading