Adding the new feature of FPDT (microsoft#441) #72
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
from upstream:
(Meta-) Review
Caution
Note: Copilot summary below fails to capture the changes in
megatron/model/transformer.py
, which unfortunately:--use-flash-attn-builder
on Intel XPUCopilot Summary
This pull request includes several updates to improve the fine-tuning process and configuration for DeepSpeed and Megatron-LM models. The most important changes include adding new conversion commands, updating configuration files, and introducing new arguments and logic for sequence parallelism with FPDT.
Updates to fine-tuning process and configuration:
examples_deepspeed/finetune_hf_llama/README.md
: Updated the conversion command to includeconvert_hf2mds
and added information aboutconvert_mds2hf
for converting models between Hugging Face and Megatron-Deepspeed formats.examples_deepspeed/finetune_hf_llama/ds_config.json
: Modified the configuration to includezero_optimization
andbf16
settings, and updatedsteps_per_print
to 100.examples_deepspeed/finetune_hf_llama/finetune_llama.sh
: Added logic to select appropriate configuration files based on the conversion command and updated arguments for fine-tuning. [1] [2] [3]Introduction of new sequence parallelism with FPDT:
megatron/arguments.py
: Added new arguments for DeepSpeed sequence parallelism with FPDT, includingds-sequence-parallel-fpdt
,ds-sequence-parallel-fpdt-chunk-size
, andds-sequence-parallel-fpdt-offloading
.megatron/initialize.py
: Updated the warmup function to handle FPDT sequence length and avoid OOM issues. [1] [2]megatron/model/gpt_model.py
: Integrated FPDT logits loss in the post-language model processing function. [1] [2]Enhancements to rotary position embeddings:
megatron/model/rotary_pos_embedding.py
: Modified theRotaryEmbedding
class to use the current device and updated the forward method to return cosine and sine components separately. [1] [2] [3]Miscellaneous changes:
pretrain_gpt.py
: Added support for FPDT input construction in theget_batch
function. [1] [2]