-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Support for video input #7558
Comments
Is there a roadmap for 'openai vllm server supports video interface' feature? Thank you |
Thank you. It is a very critical feature, but it needs a careful discussion for the APIs. I recently do not have a plan to work on that. |
@TKONIY Sounds good - we will take it over from here. Thank you for the contribution for the model! |
@ywang96 I am quite interested in using I was able to use Qwen for this: #9128 (comment). I want to how to extend this to multi-video captioning. If you have some pointers for me, that would be helpful! |
Cc @TKONIY if #7558 (comment) sounds interesting. |
@sayakpaul Sorry that I did not quite follow up with the vLLM development recently. If you need to implement the online serving support for video/ multi video, I think you can start from an RFC to define the http API (which has not been supported by official openai api). Then modify some frontend code in |
Currently, I am doing this: |
Hey @sayakpaul! Sorry for the late reply and this is definitely interesting. AFAIK, only Llava-OneVision supports multi-video captioning once #8905 is merged, and this is more of a model capability than of the inference infrastructure capability (In my knowledge, video language inference is more or less the same as multi-image inference, so it's a matter of whether the model itself was trained to do this task) |
Agree. Thanks for your inputs. Would it be possible to update this thread with an example on doing multi-video captioning? |
Yea - I'll keep that in mind and update this issue once we have some bandwidth to get #8905 merged. Will probably update our example scripts too! |
@DarkLight1337 which doc part we should refer to for this? |
Please see #9842 on how to use it. |
@litianjian can you add some docs/examples for this? |
@DarkLight1337 thanks! Is #10020 (comment) a sufficiently good example for me as a reference? |
Yes, that should be clear enough. |
|
Motivation.
Currently models like
llava-hf/llava-next-video*
recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.Proposed Change.
API
LLM.generate()
API for videoOpenAI compatible chat completion APIs
Roadmap
VideoPlugin
forMultiModalPlugin
Feedback Period.
A week
CC List.
@DarkLight1337
@zifeitong
@ywang
Any Other Things.
No response
The text was updated successfully, but these errors were encountered: