Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Inquiry about Multi-modal Support in VLLM for MiniCPM-V2.6 #7546

Closed
Dong148 opened this issue Aug 15, 2024 · 4 comments
Closed

[Feature]: Inquiry about Multi-modal Support in VLLM for MiniCPM-V2.6 #7546

Dong148 opened this issue Aug 15, 2024 · 4 comments

Comments

@Dong148
Copy link

Dong148 commented Aug 15, 2024

🚀 The feature, motivation and pitch

I am currently exploring the capabilities of the VLLM library and am interested in understanding its support for multi-modal inputs, particularly for models like MiniCPM-V2.6. I would like to know if VLLM is designed to handle multi-image and video inputs for such models.

Alternatives

  1. Model of Interest: MiniCPM-V2.6
  2. Types of Input: Multi-image and video
  3. Current Understanding:
    • I have reviewed the documentation and initial examples provided with VLLM.
  • It seems that both multiple 'image_url' input and list value in image_url is currently not supported.
  • However, I am not sure if it supports the processing of multiple images or videos as input to a model like MiniCPM-V2.6.

Questions

  1. Does VLLM support the integration of MiniCPM-V2.6 for processing multi-image and video inputs?
  2. If yes, could you provide an example or a guide on how to set up and use this feature?
  3. If not, are there any plans to extend VLLM's capabilities to support such inputs in the future?

Additional context

image

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 15, 2024

Multi-image input is currently supported for MiniCPU-V specifically (#7122), with some caveats:

  • It only works in offline inference, not the OpenAI API-compatible server.
  • Until the next release, you have to build from source (main branch) to use it.

We are actively working on extending the support for multi-image input - please refer to #4194 for details.

@Dong148
Copy link
Author

Dong148 commented Aug 15, 2024

Multi-image input is currently supported for MiniCPU-V specifically (#7122), with some caveats:

  • It only works in offline inference, not the OpenAI API-compatible server.
  • Until the next release, you have to build from source (main branch) to use it.

We are actively working on extending the support for multi-image input - please refer to #4194 for details.

Thank you for your assistance and for taking the time to help me out. I look forward to exploring more features of VLLM and potentially contributing to its development in the future.

@Dong148 Dong148 closed this as completed Aug 15, 2024
@Patrick10203
Copy link

  • It only works in offline inference, not the OpenAI API-compatible server.
  • Until the next release, you have to build from source (main branch) to use it.

Are you sure that building the main branch supports multi image over the open ai api? Because the line https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py#L179 is still in the main branch

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 23, 2024

  • It only works in offline inference, not the OpenAI API-compatible server.
  • Until the next release, you have to build from source (main branch) to use it.

Are you sure that building the main branch supports multi image over the open ai api? Because the line https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py#L179 is still in the main branch

I was referring to multi-modal support for MiniCPM-V specifically, not for multi-modal models (+OpenAI server) in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants