Track progress for VLMs refactoring #33374
Labels
Generation
Multimodal
Vision
WIP
Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
This issue tracks the progress on improving the handling and testing of Vision-Language Models. The main goals are to enhance/enable generation tests, handle other generation techniques like assisted decoding and ensure all models pass CI checks.
I already started working on it and merged/opened some PRs. This issue should help us track how much is left until VLMs are standardized from modeling code perspective.
Enable Generation Tests for VLMs
generate()
, especially in enabling other cache formats and torch.compile and introduces hidden bugs. (Expand inputs in processors for VLMs #30962)processor_config.json
on the hub does not break existing functionality. Related discussion on slack: https://huggingface.slack.com/archives/C01N44FJDHT/p171957701917237). TL;DR: we can't avoid breaking BC but we still want the feature as it has so many benefits. So we'll just try again and hope that users don't use the old version anymoreFix Failing Edge Cases in Current VLMs
num_image_tokens
attribute for specifying image sequence length. It ensures text expansion to the correct length based on the image backbone, otherwise we can't currently use the same processing class for different image backbones. VLMs:patch_size
->num_image_tokens
in processing #33424Add Generation Tests to VLM Classes
Already added in LLaVA-Onevision and Qwen2-VL (Llava Onevision: add model #32673, Qwen2-VL: clean-up and add more tests #33354)
Implement
GenerationTesterMixin
to include tests with both image and text inputs. Current tests accept only text as input. Enable for all models except BLIP (draft available locally)Add tests for Idefics models and fix Mllama tests which are a bit different from llava style Idefics: enable generation tests #34062
Special Case for BLIP
main_input_name
which is notinput_ids
like in other model, but ispixel_values
. Check that we don't cause red CI if we rely on model'smain_input_name
for tests (related or fixed by Generate tests: modality-agnostic input preparation #33685)BOS
token in modeling code (BLIP: enable generation tests #34174)Finalizing CI for VLMs
attention_Implementation
related failures to make CI fully happy for VLMs (Attn implementation for composite models #32238)Motivation
,
Your contribution
.
The text was updated successfully, but these errors were encountered: