Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track progress for VLMs refactoring #33374

Open
13 of 16 tasks
zucchini-nlp opened this issue Sep 8, 2024 · 1 comment
Open
13 of 16 tasks

Track progress for VLMs refactoring #33374

zucchini-nlp opened this issue Sep 8, 2024 · 1 comment
Assignees
Labels
Generation Multimodal Vision WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Sep 8, 2024

This issue tracks the progress on improving the handling and testing of Vision-Language Models. The main goals are to enhance/enable generation tests, handle other generation techniques like assisted decoding and ensure all models pass CI checks.

I already started working on it and merged/opened some PRs. This issue should help us track how much is left until VLMs are standardized from modeling code perspective.

  • Enable Generation Tests for VLMs

    • Merged a PR to calculate and expand text with "image" tokens in processing. VLMs currently add only one placeholder per visual. During the modeling phase, we expand the inputs to match the actual length of image embeddings. This approach limits the functionality of generate() , especially in enabling other cache formats and torch.compile and introduces hidden bugs. (Expand inputs in processors for VLMs #30962)
    • Verify that the addition of processor_config.json on the hub does not break existing functionality. Related discussion on slack: https://huggingface.slack.com/archives/C01N44FJDHT/p171957701917237). TL;DR: we can't avoid breaking BC but we still want the feature as it has so many benefits. So we'll just try again and hope that users don't use the old version anymore
  • Fix Failing Edge Cases in Current VLMs

  • Add Generation Tests to VLM Classes

Motivation

,

Your contribution

.

@zucchini-nlp zucchini-nlp added WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress Vision Generation Multimodal labels Sep 8, 2024
@zucchini-nlp zucchini-nlp changed the title Progress Tracking for VLMs Refactoring Track progress for VLMs refactoring Sep 8, 2024
@zucchini-nlp
Copy link
Member Author

cc @gante 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Generation Multimodal Vision WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

1 participant