Universal Checkpoint for Sequence Parallelism #305

samadejacobs · 2023-11-29T21:35:40Z

This PR extends the universal checkpoint to support DS sequence parallelism and training scenarios where pipeline parallelism is not enabled.

The attached Tensorboard chart show a training scenario (validation curve) where a GPT model is pre-trained with data parallelism (4 GPUs), and checkpoints are saved at the 100th and 200th iterations. The checkpoint at the 100th iteration is later loaded for continual pre-training with different configurations (more GPU resources, data parallelism = 4 GPUs, sequence parallelism = 2 GPUs).

…and ZeRO stage 2

megatron/model/gpt_model.py

examples_deepspeed/universal_checkpointing/run_universal_fp16.sh

examples_deepspeed/universal_checkpointing/run_fp16.sh

* Extend universal checkpoint support for ds sequence parallelism (SP) and ZeRO stage 2 * Extend universal checkpoint support for ds sequence parallelism (SP) and ZeRO stage 2 * Extend README and batch scripts discussion * Extend README and batch scripts discussion * Extend README and batch scripts discussion * Remove debug statement * Script no pipeline parallel for ZeRO stage 2

samadejacobs added 2 commits November 29, 2023 10:52

Extend universal checkpoint support for ds sequence parallelism (SP) …

68f781a

…and ZeRO stage 2

Extend universal checkpoint support for ds sequence parallelism (SP) …

3905068

…and ZeRO stage 2

samadejacobs self-assigned this Nov 29, 2023

samadejacobs requested review from jeffra, tjruwase, ShadenSmith, conglongli, awan-10, eltonzheng, minjiaz, RezaYazdaniAminabadi, duli2012, mrwyattii, arashb, xiaoxiawu-microsoft and GuanhuaWang as code owners November 29, 2023 21:35

samadejacobs added 3 commits December 11, 2023 21:20

Extend README and batch scripts discussion

23cef48

Extend README and batch scripts discussion

b94d104

Extend README and batch scripts discussion

3329a6d

tjruwase removed request for arashb, ShadenSmith, jeffra, duli2012, conglongli, awan-10, GuanhuaWang, mrwyattii, eltonzheng, minjiaz and RezaYazdaniAminabadi December 11, 2023 21:42

tjruwase requested review from lekurile and removed request for xiaoxiawu-microsoft December 11, 2023 21:42

tjruwase reviewed Dec 11, 2023

View reviewed changes

megatron/model/gpt_model.py Outdated Show resolved Hide resolved

tjruwase reviewed Dec 11, 2023

View reviewed changes

examples_deepspeed/universal_checkpointing/run_universal_fp16.sh Outdated Show resolved Hide resolved

tjruwase reviewed Dec 11, 2023

View reviewed changes

examples_deepspeed/universal_checkpointing/run_fp16.sh Outdated Show resolved Hide resolved

samadejacobs added 3 commits December 12, 2023 02:49

Remove debug statement

f730564

Script no pipeline parallel for ZeRO stage 2

19ab4cf

Merge branch 'main' into universal_ckpt_sp

ddfa096

samadejacobs merged commit 71e8407 into main Dec 14, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal Checkpoint for Sequence Parallelism #305

Universal Checkpoint for Sequence Parallelism #305

samadejacobs commented Nov 29, 2023

Universal Checkpoint for Sequence Parallelism #305

Universal Checkpoint for Sequence Parallelism #305

Conversation

samadejacobs commented Nov 29, 2023