Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] MM Eval Mask Sizes #1920

Merged
merged 3 commits into from
Oct 30, 2024
Merged

[FIX] MM Eval Mask Sizes #1920

merged 3 commits into from
Oct 30, 2024

Conversation

pbontrager
Copy link
Contributor

@pbontrager pbontrager commented Oct 29, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Issue first reported in #1874 found that eval fails for llama3.2 vision 11B. This issue found was that the VisionCrossAttentionMask was padding the masks to 4 tiles during inference while at the same time the padded_collate_tiled_images_and_mask function was assuming that the masks weren't padded and inferring incorrect shape information.

The solution is to remove any inference time padding logic from the mask transform and pass pad_max_tiles=4 to the collate function during inference and eval to let the collate function handle all the padding.

During this investigation I found that padded_collate_tiled_images_and_mask was using "image_seq_len" variable from the last definition in a loop, meaning that if there were multiple images with different sizes the variable would be wrong at the end. Updated this as well.

Changelog

What are the changes made in this PR?

  • Set pad_max_tiles in generation_v2 and eval
  • removed pad to max tiles from the vision encoder mask transform
  • replaced image_seq_len with tokens_per_tile * max_num_tiles in the collate function
  • updated the collate tests to cover the extra tiling

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Ran as usual

tune run dev/generate_v2 --config llama3_2_vision/generation_v2

Ran as usual

tune run full_finetune_single_device --config llama2_2_vision/11B_full_single_device

Fixed

tune run eleuther_eval --config llama3_2_vision/evaluation

Screenshot 2024-10-29 at 4 11 24 PM

Copy link

pytorch-bot bot commented Oct 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1920

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e988093 with merge base d338066 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 29, 2024
@pbontrager pbontrager changed the title [FIX] [FIX] MM Eval Mask Sizes Oct 29, 2024
@@ -426,7 +426,8 @@ def padded_collate_tiled_images_and_mask(
if pad_max_images is not None:
_, _, img_seq = concat_masks.shape
concat_masks = F.pad(
concat_masks, (0, pad_max_images * image_seq_len - img_seq)
concat_masks,
Copy link
Contributor

@RdoubleA RdoubleA Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the pad to max images happen? this is just padding the masks to max num images? and would pad direction affect image padding at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t actually need to add them or you’d waste compute on them. If you have a I cache for 7 images, then you want to mask out the additional images. It’s similar to how you mask extra token positions during inference but don’t add padding tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants