[RFC]: Support for video input #7558

TKONIY · 2024-08-15T15:44:12Z

Motivation.

Currently models like llava-hf/llava-next-video* recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.

Proposed Change.

API

LLM.generate() API for video

LLM.generate({
    "prompt": "<video> please summarize this video",
    "multi_modal_data": {
        "video": video
    }
})

OpenAI compatible chat completion APIs

Roadmap

Add VideoPlugin for MultiModalPlugin
[model] Support for Llava-Next-Video model #7559
- Add initial support for replacing a <video> token with a single video.
- Add support for replacing all <video> and <image> tokens with multiple multi-modal inputs.
Support prefix caching for the same videos.
Support openai chat completion APIs.

Feedback Period.

A week

CC List.

@DarkLight1337
@zifeitong
@ywang

Any Other Things.

No response

The text was updated successfully, but these errors were encountered:

PancakeAwesome · 2024-09-12T12:27:06Z

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

TKONIY · 2024-09-12T15:44:39Z

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC
@DarkLight1337
@ywang96

ywang96 · 2024-09-14T08:14:09Z

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC @DarkLight1337 @ywang96

@TKONIY Sounds good - we will take it over from here. Thank you for the contribution for the model!

sayakpaul · 2024-10-08T12:23:43Z

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: #9128 (comment).

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

sayakpaul · 2024-10-13T16:22:39Z

Cc @TKONIY if #7558 (comment) sounds interesting.

TKONIY · 2024-10-14T10:09:34Z

@sayakpaul Sorry that I did not quite follow up with the vLLM development recently. If you need to implement the online serving support for video/ multi video, I think you can start from an RFC to define the http API (which has not been supported by official openai api). Then modify some frontend code in entrypoint/ to connect it with the llm engine.

sayakpaul · 2024-10-14T13:42:32Z

Currently, I am doing this:
#9128 (comment)

ywang96 · 2024-10-15T04:26:33Z

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: #9128 (comment).

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

Hey @sayakpaul! Sorry for the late reply and this is definitely interesting.

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability (In my knowledge, video language inference is more or less the same as multi-image inference, so it's a matter of whether the model itself was trained to do this task)

sayakpaul · 2024-10-15T04:49:17Z

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

ywang96 · 2024-10-15T05:13:25Z

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

Yea - I'll keep that in mind and update this issue once we have some bandwidth to get #8905 merged. Will probably update our example scripts too!

DarkLight1337 · 2024-11-08T03:40:15Z

Closing as completed since #8905 and #9842 have both been resolved.

sayakpaul · 2024-11-08T10:53:13Z

@DarkLight1337 which doc part we should refer to for this?

DarkLight1337 · 2024-11-08T10:55:51Z

Please see #9842 on how to use it.

DarkLight1337 · 2024-11-08T10:56:40Z

@litianjian can you add some docs/examples for this?

sayakpaul · 2024-11-09T16:02:54Z

@DarkLight1337 thanks!

Is #10020 (comment) a sufficiently good example for me as a reference?

DarkLight1337 · 2024-11-09T16:04:49Z

@DarkLight1337 thanks!

Is #10020 (comment) a sufficiently good example for me as a reference?

Yes, that should be clear enough.

litianjian · 2024-11-11T02:36:52Z

@litianjian can you add some docs/examples for this?
Sure

TKONIY added the RFC label Aug 15, 2024

TKONIY mentioned this issue Aug 15, 2024

[model] Support for Llava-Next-Video model #7559

Merged

9 tasks

DarkLight1337 mentioned this issue Aug 15, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

DarkLight1337 mentioned this issue Sep 12, 2024

[Model][VLM] Add Qwen2-VL model support #7905

Merged

DarkLight1337 mentioned this issue Oct 8, 2024

[Bug]: assert len(indices) == len(inputs) with Qwen/Qwen2-VL-2B-Instruct #9128

Closed

1 task

DarkLight1337 closed this as completed Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Support for video input #7558

[RFC]: Support for video input #7558

TKONIY commented Aug 15, 2024 •

edited

Loading

PancakeAwesome commented Sep 12, 2024

TKONIY commented Sep 12, 2024 •

edited

Loading

ywang96 commented Sep 14, 2024 •

edited

Loading

sayakpaul commented Oct 8, 2024

sayakpaul commented Oct 13, 2024

TKONIY commented Oct 14, 2024 •

edited

Loading

sayakpaul commented Oct 14, 2024

ywang96 commented Oct 15, 2024 •

edited

Loading

sayakpaul commented Oct 15, 2024

ywang96 commented Oct 15, 2024

DarkLight1337 commented Nov 8, 2024

sayakpaul commented Nov 8, 2024

DarkLight1337 commented Nov 8, 2024

DarkLight1337 commented Nov 8, 2024

sayakpaul commented Nov 9, 2024

DarkLight1337 commented Nov 9, 2024

litianjian commented Nov 11, 2024

[RFC]: Support for video input #7558

[RFC]: Support for video input #7558

Comments

TKONIY commented Aug 15, 2024 • edited Loading

Motivation.

Proposed Change.

API

Roadmap

Feedback Period.

CC List.

Any Other Things.

PancakeAwesome commented Sep 12, 2024

TKONIY commented Sep 12, 2024 • edited Loading

ywang96 commented Sep 14, 2024 • edited Loading

sayakpaul commented Oct 8, 2024

sayakpaul commented Oct 13, 2024

TKONIY commented Oct 14, 2024 • edited Loading

sayakpaul commented Oct 14, 2024

ywang96 commented Oct 15, 2024 • edited Loading

sayakpaul commented Oct 15, 2024

ywang96 commented Oct 15, 2024

DarkLight1337 commented Nov 8, 2024

sayakpaul commented Nov 8, 2024

DarkLight1337 commented Nov 8, 2024

DarkLight1337 commented Nov 8, 2024

sayakpaul commented Nov 9, 2024

DarkLight1337 commented Nov 9, 2024

litianjian commented Nov 11, 2024

TKONIY commented Aug 15, 2024 •

edited

Loading

TKONIY commented Sep 12, 2024 •

edited

Loading

ywang96 commented Sep 14, 2024 •

edited

Loading

TKONIY commented Oct 14, 2024 •

edited

Loading

ywang96 commented Oct 15, 2024 •

edited

Loading