Llama 3.2 Vision - 90B #1880

felipemello1 · 2024-10-22T16:52:26Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

This PR adds llama 90B to torchtune. A few issues were found:

[MITIGATED - NEEDS FIX] 11B was using meta checkpointer. Switching to HF raised an error regarding the peft adapter. For now, we just save the adapter on meta format.
[MITIGATED - NEED ASYNC CKPT] Saving a checkpoint was timing out because rank 0 would stop working. Adding breakpoints solved it.
[FIXED] Optimizer in backward doesnt work with MM. Removed the options from the config.
[NEEDS FIX] QLoRA errors with nproc=8 but works with nproc=2. Set config to 2.
[NEEDS FIX] Saving recipe takes a long time (200~400s). We should make it optional and add save frequency cc: @joecummings

tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_2_vision/90B_qlora  metric_logger=torchtune.training.metric_logging.WandBLogger compile=True log_peak_memory_stats=True enable_activation_checkpointing=True max_steps_per_epoch=25 gradient_accumulation_steps=1 epochs=1 tokenizer.max_seq_len=2048 batch_size=6 num_warmup_steps=0 save_adapter_weights_only=True

[rank4]:     self._model = self._setup_model(
[rank4]:                   ^^^^^^^^^^^^^^^^^^
[rank4]:   File "/data/users/felipemello/torchtune/recipes/lora_finetune_distributed.py", line 456, in _setup_model
[rank4]:     training.shard_model(
[rank4]:   File "/data/users/felipemello/torchtune/torchtune/training/_distributed.py", line 674, in shard_model
[rank4]:     fully_shard(m, **fsdp_kwargs)
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/distributed/_composable/contract.py", line 125, in wrapper
[rank4]:     updated = func(inp_module, *args, **kwargs)
[rank4]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/fully_shard.py", line 132, in fully_shard
[rank4]:     state._fsdp_param_group = FSDPParamGroup(
[rank4]:                               ^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 114, in __init__
[rank4]:     self.fsdp_params = [
[rank4]:                        ^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 115, in <listcomp>
[rank4]:     FSDPParam(
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param.py", line 231, in __init__
[rank4]:     self._init_sharded_param(param, device)
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param.py", line 335, in _init_sharded_param
[rank4]:     chunks = _chunk_with_empty(param_data, shard_world_size, dim=0)
[rank4]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_common.py", line 95, in _chunk_with_empty
[rank4]:     chunks = list(torch.chunk(tensor, num_chunks, dim=dim))
[rank4]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torchao/dtypes/nf4tensor.py", line 850, in __torch_function__
[rank4]:     return func(*args, **kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank4]:     return fn(*args, **kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torchao/dtypes/nf4tensor.py", line 831, in __torch_dispatch__
[rank4]:     return NF4_OPS_TABLE[func](func, args, kwargs)
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torchao/dtypes/nf4tensor.py", line 195, in nf4_split
[rank4]:     inner_tensor.numel() % num_chunks == 0
[rank4]: AssertionError: quantization_factor.numel() not divisible by 8

Changelog

What are the changes made in this PR?
*

logs added

NFO:torchtune.utils._logging:Saving checkpoint. This may take some time. Retrieving full model state dict...
INFO:torchtune.utils._logging:Getting full model state dict took 1.25 secs
INFO:torchtune.utils._logging:Retrieving optimizer state dict...
INFO:torchtune.utils._logging:Getting optimizer state dict took 1.86 secs
INFO:torchtune.utils._logging:Adapter checkpoint of size 0.04 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/adapter_-1.pt
WARNING:torchtune.utils._logging:Saving Llama3.2 Vision adapter weights to PEFT format is not supported, saving to torchtune format instead
WARNING:torchtune.utils._logging:PEFT integration for Llama3.2 Vision is not supported, skipping adapter config save
INFO:torchtune.utils._logging:Recipe checkpoint of size 0.09 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/recipe_state.pt
INFO:torchtune.utils._logging:Saving checkpoint took 0.37 secs

NFO:torchtune.utils._logging:Saving checkpoint. This may take some time, depending on the size of your model. Getting full model state dict...
INFO:torchtune.utils._logging:Getting full model state dict took 142.20 secs
INFO:torchtune.utils._logging:Model checkpoint of size 4.60 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0001_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0002_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0003_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0004_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0005_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0006_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0007_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0008_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0009_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0010_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0011_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0012_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0013_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0014_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0015_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0016_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0017_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0018_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0019_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0020_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0021_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0022_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0023_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0024_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0025_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0026_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0027_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0028_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0029_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0030_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0031_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0032_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 5.00 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0033_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.97 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0034_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0035_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.66 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0036_0.pt
INFO:torchtune.utils._logging:Model checkpoint of size 4.88 GB saved to /tmp/Llama-3.2-90B-Vision-Instruct/hf_model_0037_0.pt
INFO:torchtune.utils._logging:Saving final epoch checkpoint.
INFO:torchtune.utils._logging:The full model checkpoint, including all weights and configurations, has been saved successfully.You can now use this checkpoint for further training or inference.
INFO:torchtune.utils._logging:Saving checkpoint took 412.63 secs

Test plan

tune run --nproc_per_node 8 lora_finetune_distributed --config llama3_2_vision/90B_lora  metric_logger=torchtune.training.metric_logging.WandBLogger compile=True log_peak_memory_stats=True enable_activation_checkpointing=True max_steps_per_epoch=25 gradient_accumulation_steps=1 epochs=1 tokenizer.max_seq_len=2048 batch_size=2 num_warmup_steps=0 save_adapter_weights_only=True

pytorch-bot · 2024-10-22T16:52:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1880

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 612dd63 with merge base d3039da ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2024-10-24T20:58:10Z