[Llama3.2-11b-vision] Add support for text-only inference through generator api #17105
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ticket
tenstorrent/vllm#53
Problem description
What's changed
text_only_inference
to the cross attention transformer to skip cross attention layers entirely (possible since prefill is done with 1 user at a time)full_text_mask
in theTtLlamaCrossAttentionTransformerBlock::forward
and propagating thefull_text_mask_expand_11SD
inputsimple_vision_demo.py
for processing a batch with mixed text-only and text-image promptsllama_vision_model.py
Checklist