Remove pad_max_tiles in CLIP #1836

pbontrager · 2024-10-15T18:05:15Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Currently we have pad_max_tiles in the CLIP transform that will pad the image to 4 tiles. The CLIP model doesn't mask out the padding tiles, so by default we always added padding tiles in case the model relied on these extra tokens. Though in testing, I find the model completely ignores the padding tiles and it's not necessary to include them unless needed for the batch. Apart from this, I found that doing the pad_max_tiles in the CLIP transform instead of padded_collate_tiled_images_and_mask leads to a subtle bug where the cross attention mask is not aware, downstream, of which tiles are padding and should be masked and which shouldn't be.

Changelog

remove pad_max_tiles from CLIPTransform
add pad_max_tiles to padded_collate_tiled_images_and_mask
- doing padding during collation (how we do for all other padding) ensures mask transforms are accurate
Set pad_max_tiles to None by default
- not padding more masks than necessary enables much faster training

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

This shows three runs for tune run full_finetune_single_device --config llama3_2_vision/11B_full_single_device. The "original" run is what's currently in main, the "new_pad_to_4" is with the fixed padding but adding extra padding tiles, and the "new_pad_to_batch" only pads to the max tiles in the batch. All three of these have the same loss, suggesting that the padding isn't important for model quality, but pad_to_batch gets almost double the qps on the ocr dataset since the cross attention sequence lengths can be much shorter.

pytorch-bot · 2024-10-15T18:05:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1836

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6f555db with merge base 4107cc4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings

Two comments but otherwise fine.

joecummings · 2024-10-16T15:01:38Z

recipes/configs/llama3_2_vision/11B_lora_single_device.yaml

@@ -32,6 +32,7 @@ tokenizer:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
  path: /tmp/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
  image_size: 560
+  max_seq_len: 8192


Why are you limiting to 8k?

This is not for the model but for this dataset. There are a few rows that are very long and cause a huge memory spike, so this limits those rows. Most of the rows are under 8k

joecummings · 2024-10-16T15:01:57Z

torchtune/data/_collate.py

@@ -355,6 +359,13 @@ def padded_collate_tiled_images_and_mask(
        for sample in batch
        for image in sample["encoder_input"]["images"]
    )
+    if pad_max_tiles is not None:
+        if pad_max_tiles < max_num_tiles:


max_num_tiles is such a misleading name.

Summary: Following changes in torchtune: - pytorch/torchtune#1836 - pytorch/torchtune#1853 Update ET downstream and remove pad-max-tiles from preprocess. Pull Request resolved: #6295 Test Plan: With AOTI tests commented out (not working atm): ``` python -m unittest examples/models/llama3_2_vision/preprocess/test_preprocess.py ... ---------------------------------------------------------------------- Ran 4 tests in 21.129s OK ``` Reviewed By: larryliu0820 Differential Revision: D64481012 Pulled By: lucylq fbshipit-source-id: e822c235c5555e0682d181c4c482dec7c170c96e

removed pad_max_tiles

7fb6c3e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 15, 2024

pbontrager requested a review from ebsmothers October 15, 2024 18:05

pbontrager added 2 commits October 15, 2024 11:32

update unit test

98e8eb8

Added max seq length for ocrvqa recipe

6f555db

joecummings approved these changes Oct 16, 2024

View reviewed changes

pbontrager merged commit 6a8a027 into pytorch:main Oct 16, 2024
17 checks passed

pbontrager deleted the remove_clip_pad branch October 16, 2024 15:32

This was referenced Oct 16, 2024

Remove pad_max_tiles in CLIP inference #1853

Merged

Remove pad_max_tiles from preprocess pytorch/executorch#6295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove pad_max_tiles in CLIP #1836

Remove pad_max_tiles in CLIP #1836

pbontrager commented Oct 15, 2024

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading

joecummings left a comment

joecummings Oct 16, 2024

pbontrager Oct 16, 2024

joecummings Oct 16, 2024

Remove pad_max_tiles in CLIP #1836

Remove pad_max_tiles in CLIP #1836

Conversation

pbontrager commented Oct 15, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Oct 15, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1836

✅ No Failures

joecummings left a comment

Choose a reason for hiding this comment

joecummings Oct 16, 2024

Choose a reason for hiding this comment

pbontrager Oct 16, 2024

Choose a reason for hiding this comment

joecummings Oct 16, 2024

Choose a reason for hiding this comment

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading