Adding the new feature of FPDT (microsoft#441) #72

saforem2 · 2024-12-25T14:26:54Z

from upstream:

pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1

add FPDT support; add Ulysses rotary position embedding support

remove unnecessary files

set the warmup length to be FPDT chunk size if enabled

Co-authored-by: Jinghan Yao [email protected]

(Meta-) Review

Caution

Note: Copilot summary below fails to capture the changes in megatron/model/transformer.py, which unfortunately:

contains bulk of merge conflicts that need to be resolved
breaks previously successful training workflow using --use-flash-attn-builder on Intel XPU

Copilot Summary

This pull request includes several updates to improve the fine-tuning process and configuration for DeepSpeed and Megatron-LM models. The most important changes include adding new conversion commands, updating configuration files, and introducing new arguments and logic for sequence parallelism with FPDT.

Updates to fine-tuning process and configuration:

examples_deepspeed/finetune_hf_llama/README.md: Updated the conversion command to include convert_hf2mds and added information about convert_mds2hf for converting models between Hugging Face and Megatron-Deepspeed formats.
examples_deepspeed/finetune_hf_llama/ds_config.json: Modified the configuration to include zero_optimization and bf16 settings, and updated steps_per_print to 100.
examples_deepspeed/finetune_hf_llama/finetune_llama.sh: Added logic to select appropriate configuration files based on the conversion command and updated arguments for fine-tuning. [1] [2] [3]

Introduction of new sequence parallelism with FPDT:

megatron/arguments.py: Added new arguments for DeepSpeed sequence parallelism with FPDT, including ds-sequence-parallel-fpdt, ds-sequence-parallel-fpdt-chunk-size, and ds-sequence-parallel-fpdt-offloading.
megatron/initialize.py: Updated the warmup function to handle FPDT sequence length and avoid OOM issues. [1] [2]
megatron/model/gpt_model.py: Integrated FPDT logits loss in the post-language model processing function. [1] [2]

Enhancements to rotary position embeddings:

megatron/model/rotary_pos_embedding.py: Modified the RotaryEmbedding class to use the current device and updated the forward method to return cosine and sine components separately. [1] [2] [3]

Miscellaneous changes:

pretrain_gpt.py: Added support for FPDT input construction in the get_batch function. [1] [2]

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * remove unnecessary files * set the warmup length to be FPDT chunk size if enabled --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Jinghan Yao <[email protected]>

* [tools]GQA convert support * fix readme

Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.

…into microsoft-main-fpdt

YJHMITWEB and others added 6 commits December 4, 2024 17:34

[tool]GQA convert support (microsoft#454)

c3df187

* [tools]GQA convert support * fix readme

Fix import error in deepspeed_to_megatron.py (microsoft#455)

f4157be

Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`.

feat: Pull in Microsoft FPDT from upstream

36abfe7

Add comment with details of bug when not using flash in transformer.py

188d37b

Merge branch 'main' of https://github.com/microsoft/Megatron-DeepSpeed …

1a21057

…into microsoft-main-fpdt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the new feature of FPDT (microsoft#441) #72

Adding the new feature of FPDT (microsoft#441) #72

saforem2 commented Dec 25, 2024

Adding the new feature of FPDT (microsoft#441) #72

Are you sure you want to change the base?

Adding the new feature of FPDT (microsoft#441) #72

Conversation

saforem2 commented Dec 25, 2024

(Meta-) Review

Copilot Summary

Updates to fine-tuning process and configuration:

Introduction of new sequence parallelism with FPDT:

Enhancements to rotary position embeddings:

Miscellaneous changes: