[FIX] MM Eval Mask Sizes #1920

pbontrager · 2024-10-29T20:02:20Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Issue first reported in #1874 found that eval fails for llama3.2 vision 11B. This issue found was that the VisionCrossAttentionMask was padding the masks to 4 tiles during inference while at the same time the padded_collate_tiled_images_and_mask function was assuming that the masks weren't padded and inferring incorrect shape information.

The solution is to remove any inference time padding logic from the mask transform and pass pad_max_tiles=4 to the collate function during inference and eval to let the collate function handle all the padding.

During this investigation I found that padded_collate_tiled_images_and_mask was using "image_seq_len" variable from the last definition in a loop, meaning that if there were multiple images with different sizes the variable would be wrong at the end. Updated this as well.

Changelog

What are the changes made in this PR?

Set pad_max_tiles in generation_v2 and eval
removed pad to max tiles from the vision encoder mask transform
replaced image_seq_len with tokens_per_tile * max_num_tiles in the collate function
updated the collate tests to cover the extra tiling

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Ran as usual

tune run dev/generate_v2 --config llama3_2_vision/generation_v2

Ran as usual

tune run full_finetune_single_device --config llama2_2_vision/11B_full_single_device

Fixed

tune run eleuther_eval --config llama3_2_vision/evaluation

pytorch-bot · 2024-10-29T20:02:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1920

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e988093 with merge base d338066 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2024-10-29T21:11:34Z

torchtune/data/_collate.py

@@ -426,7 +426,8 @@ def padded_collate_tiled_images_and_mask(
    if pad_max_images is not None:
        _, _, img_seq = concat_masks.shape
        concat_masks = F.pad(
-            concat_masks, (0, pad_max_images * image_seq_len - img_seq)
+            concat_masks,


where does the pad to max images happen? this is just padding the masks to max num images? and would pad direction affect image padding at all?

You don’t actually need to add them or you’d waste compute on them. If you have a I cache for 7 images, then you want to mask out the additional images. It’s similar to how you mask extra token positions during inference but don’t add padding tokens.

Moved inference mask padding to collate

3685b78

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 29, 2024

Update tests

e324f89

pbontrager requested a review from joecummings October 29, 2024 20:11

pbontrager changed the title ~~[FIX]~~ [FIX] MM Eval Mask Sizes Oct 29, 2024

Merge branch 'main' into pad_images

e988093

RdoubleA reviewed Oct 29, 2024

View reviewed changes

joecummings approved these changes Oct 30, 2024

View reviewed changes

pbontrager merged commit a1bcb97 into pytorch:main Oct 30, 2024
17 checks passed

joecummings mentioned this pull request Oct 30, 2024

[BUG] Llama3.2 vision eleuther eval recipe RuntimeError: stack expects each tensor to be equal size, but got [259, 6404] #1874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] MM Eval Mask Sizes #1920

[FIX] MM Eval Mask Sizes #1920

pbontrager commented Oct 29, 2024 •

edited

Loading

pytorch-bot bot commented Oct 29, 2024 •

edited

Loading

RdoubleA Oct 29, 2024 •

edited

Loading

pbontrager Oct 30, 2024

[FIX] MM Eval Mask Sizes #1920

[FIX] MM Eval Mask Sizes #1920

Conversation

pbontrager commented Oct 29, 2024 • edited Loading

Context

Changelog

Test plan

Ran as usual

Ran as usual

Fixed

pytorch-bot bot commented Oct 29, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1920

✅ No Failures

RdoubleA Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

pbontrager Oct 30, 2024

Choose a reason for hiding this comment

pbontrager commented Oct 29, 2024 •

edited

Loading

pytorch-bot bot commented Oct 29, 2024 •

edited

Loading

RdoubleA Oct 29, 2024 •

edited

Loading