Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal Checkpoint for Sequence Parallelism #305

Merged
merged 8 commits into from
Dec 14, 2023
Merged

Conversation

samadejacobs
Copy link

This PR extends the universal checkpoint to support DS sequence parallelism and training scenarios where pipeline parallelism is not enabled.

The attached Tensorboard chart show a training scenario (validation curve) where a GPT model is pre-trained with data parallelism (4 GPUs), and checkpoints are saved at the 100th and 200th iterations. The checkpoint at the 100th iteration is later loaded for continual pre-training with different configurations (more GPU resources, data parallelism = 4 GPUs, sequence parallelism = 2 GPUs).

Screenshot 2023-11-28 at 9 11 55 AM

@tjruwase tjruwase requested review from lekurile and removed request for xiaoxiawu-microsoft December 11, 2023 21:42
@samadejacobs samadejacobs merged commit 71e8407 into main Dec 14, 2023
1 check passed
zdaiot pushed a commit to zdaiot/Megatron-DeepSpeed that referenced this pull request Jan 20, 2024
* Extend universal checkpoint support for ds sequence parallelism (SP) and ZeRO stage 2

* Extend universal checkpoint support for ds sequence parallelism (SP) and ZeRO stage 2

* Extend README and batch scripts discussion

* Extend README and batch scripts discussion

* Extend README and batch scripts discussion

* Remove debug statement

* Script no pipeline parallel for ZeRO stage 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants