Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Support for video input #7558

Closed
TKONIY opened this issue Aug 15, 2024 · 17 comments
Closed

[RFC]: Support for video input #7558

TKONIY opened this issue Aug 15, 2024 · 17 comments
Labels

Comments

@TKONIY
Copy link
Contributor

TKONIY commented Aug 15, 2024

Motivation.

Currently models like llava-hf/llava-next-video* recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.

Proposed Change.

API

Roadmap

  • Add VideoPlugin for MultiModalPlugin
  • [model] Support for Llava-Next-Video model #7559
    • Add initial support for replacing a <video> token with a single video.
    • Add support for replacing all <video> and <image> tokens with multiple multi-modal inputs.
  • Support prefix caching for the same videos.
  • Support openai chat completion APIs.

Feedback Period.

A week

CC List.

@DarkLight1337
@zifeitong
@ywang

Any Other Things.

No response

@PancakeAwesome
Copy link

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

@TKONIY
Copy link
Contributor Author

TKONIY commented Sep 12, 2024

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC
@DarkLight1337
@ywang96

@ywang96
Copy link
Member

ywang96 commented Sep 14, 2024

Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you

Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that.

CC @DarkLight1337 @ywang96

@TKONIY Sounds good - we will take it over from here. Thank you for the contribution for the model!

@sayakpaul
Copy link
Contributor

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: #9128 (comment).

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

@sayakpaul
Copy link
Contributor

Cc @TKONIY if #7558 (comment) sounds interesting.

@TKONIY
Copy link
Contributor Author

TKONIY commented Oct 14, 2024

@sayakpaul Sorry that I did not quite follow up with the vLLM development recently. If you need to implement the online serving support for video/ multi video, I think you can start from an RFC to define the http API (which has not been supported by official openai api). Then modify some frontend code in entrypoint/ to connect it with the llm engine.

@sayakpaul
Copy link
Contributor

Currently, I am doing this:
#9128 (comment)

@ywang96
Copy link
Member

ywang96 commented Oct 15, 2024

@ywang96 I am quite interested in using vllm for high-performance video captioning. This will be tremendously helpful for furthering research on video generation from language.

I was able to use Qwen for this: #9128 (comment).

I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful!

Hey @sayakpaul! Sorry for the late reply and this is definitely interesting.

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability (In my knowledge, video language inference is more or less the same as multi-image inference, so it's a matter of whether the model itself was trained to do this task)

@sayakpaul
Copy link
Contributor

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

@ywang96
Copy link
Member

ywang96 commented Oct 15, 2024

AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability

Agree. Thanks for your inputs.

Would it be possible to update this thread with an example on doing multi-video captioning?

Yea - I'll keep that in mind and update this issue once we have some bandwidth to get #8905 merged. Will probably update our example scripts too!

@DarkLight1337
Copy link
Member

Closing as completed since #8905 and #9842 have both been resolved.

@sayakpaul
Copy link
Contributor

@DarkLight1337 which doc part we should refer to for this?

@DarkLight1337
Copy link
Member

Please see #9842 on how to use it.

@DarkLight1337
Copy link
Member

@litianjian can you add some docs/examples for this?

@sayakpaul
Copy link
Contributor

@DarkLight1337 thanks!

Is #10020 (comment) a sufficiently good example for me as a reference?

@DarkLight1337
Copy link
Member

@DarkLight1337 thanks!

Is #10020 (comment) a sufficiently good example for me as a reference?

Yes, that should be clear enough.

@litianjian
Copy link
Contributor

@litianjian can you add some docs/examples for this?
Sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants