Merge with latest upstream #5

Quentin-Anthony · 2023-12-06T04:49:52Z

No description provided.

* universal-ckp: fix gpt model param names Signed-off-by: Moshe Island <[email protected]> * universal-ckp: reconfigure model parameter rng tracker When loading from universal checkpoint with a different model parameter configuration, the loaded tensor parallel RNG tracker states are incorrect. In this case, we reconfigure the tensor parallel RNG tracker states with new seed values (each tp rank with a unique seed). We add an offset=iteration to the base seed. This is to ensure that when we load multiple times from universal checkpoint, we will use a different random sequence at each run. This commit requires a counter change in DeepSpeed repo. Signed-off-by: Moshe Island <[email protected]> * universal-ckp: remove embedding norm patterns Embedding norm patterns originate from Bloom, but are not in vanilla GPT. Therefore, Remove the patterns. Signed-off-by: Moshe Island <[email protected]> --------- Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>

…ackage (#288) If using the command "pip install .", the megatron/model/vision will not appear in the python/dist-packages/megatron, in order to fix it, add a __init__.py. Signed-off-by: yuanwu <[email protected]>

#289) Signed-off-by: yuanwu <[email protected]>

Signed-off-by: yuanwu <[email protected]>

Modify universal checkpoint parameter patterns based on the specific model configuration. This commit adds support for llama family of models. Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>

* Extend universal checkpoint support for ds sequence parallelism (SP) and ZeRO stage 2 * Extend universal checkpoint support for ds sequence parallelism (SP) and ZeRO stage 2 * Extend README and batch scripts discussion * Extend README and batch scripts discussion * Extend README and batch scripts discussion * Remove debug statement * Script no pipeline parallel for ZeRO stage 2

* Revert "Modify the args_default usage (#250)" This reverts commit 3095a51. * Add the external arguments Add the external_arguments for passing the arguments from function call. Signed-off-by: yuanwu <[email protected]> --------- Signed-off-by: yuanwu <[email protected]>

* alpaca hf weight finetune clean up update update update update update update update arg fix update clean up update update update refine weight converter don't cat when dim=0 format update update update * add finetune script * add condition for no padded token case * add reference --------- Co-authored-by: Conglong Li <[email protected]>

* Fixed incorrect argument supplied to deepspeed init * Added suggestion to make fix backwards compatible

This PR updates the Universal Checkpointing README with instructions on how to download the GPT dataset and cleans up a few nits in the corresponding bash scripts.

) * Clean up UC scripts and update UC README * Revert LOAD_TP change * Update parallelism degrees * UC Matplotlib generation script * Add matplotlib code * Script rename * Source label names using regex * Update plot gen script * Revert 3D parallelism change * regex matches to py variables * Move location of script * Update regex to search for multi-digit parallelism degrees * Create ABC class for analyzer and remove UC specific analysis elements * Move args to separate folder, add sns switch * add bash script for UC analysis * Change name of script * Move UC specific label name to class * Rename script * clean up script * Update analyzer return * Update bash script * remove log_dir * Address PR comments

This PR updates the Megatron type check to check against the accelerator specific dtype instead of the class. The change is necessary to account for warning fixes in microsoft/DeepSpeed#5018.

…urther on device (#411) * improve performance by keeping attention_mask on device and run ops further on device * add copyrights

* improve RoPE perf by using cached sin/cos tensors * add copyrights

* Extend test utilities to support more accelerators * Add Intel Copyright

* Update arguments.py * Update training.py * Create profiler.py * add copyrights * Update profiler.py * add copyrights * Update help * add copyrights

* Refine wandb logging function * Address comments * enable user to specify wandb local save dir * Update and fix comments * Update

…412) * Update arguments.py * Update training.py * Update utils.py * add copyrights * add copyrights * add copyrights * Update arguments.py help * Update arguments.py * Update training.py * Update utils.py * Update arguments.py

…rocessing (#421) * Update arguments.py * Update tokenizer.py * Update preprocess_data.py

* Update module.py * Update preprocess_data.py * add copyrights * add copyrights * Update tokenizer.py * add copyrights

This PR adds a Llama universal checkpointing example to examples_deepspeed/universal_checkpointing. It also includes changes to the README, some minor changes, and an update to the TensorBoard analysis script.

…sing flash_attn_cuda in sequence parallel (#406) Co-authored-by: Jinghan Yao <[email protected]>

…on for supporting batch size larger than 1 (#433) Co-authored-by: Jinghan Yao <[email protected]>

* add support converting checkpoint from hf to mds * Fix PP issue * update

* fix TFLOPs calculation when GQA used, we observe right TFLOPs after this fix. when GQA is not used, huge difference in TFLOPs is solved with selective recompute . some other minor difference will also be observed as logits macs also added. * add copyrights

Co-authored-by: Logan Adams <[email protected]>

…l divided the gradient (#428)

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * add FPDT support; add Ulysses rotary position embedding support * remove unnecessary files * set the warmup length to be FPDT chunk size if enabled --------- Co-authored-by: Jinghan Yao <[email protected]> Co-authored-by: Jinghan Yao <[email protected]>

zhangsheng377 and others added 30 commits November 8, 2023 18:12

fix circular dependencies when using Lora (#284)

27d7865

Enable the combination of sequence length warmup and RoPE (#285)

4c0bc7f

Change the text name of sample for compatible with Huggingface trainer (

8760390

#289) Signed-off-by: yuanwu <[email protected]>

Enable the args.deepspeed_config to use dict type (#290)

15355af

Signed-off-by: yuanwu <[email protected]>

Add tgs metrics (#286)

155ce98

Fix lm_eval_harness for GPT models (#292)

37050b8

universal-ckp: support llama model (#287)

2348eed

Modify universal checkpoint parameter patterns based on the specific model configuration. This commit adds support for llama family of models. Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>

fuse q and kve parameters for qga case (#291)

8415d03

fix dropout of flash attention (#295)

b93495a

fix typo error (#299)

7ca477d

fix reshape for split qga (#307)

61d5d61

support huggingface tokenizer (#306)

bdef2b0

add RMSnorm torch fallback path (#312)

aad7ad9

Update Universal Checkpointing README visualization PR (#314)

b2b8b01

bug fix on args.deepspeed_config_dict (#328)

a4f8079

Add worker option for preprocess_data.py (#339)

a3635ad

Align variable before all_reduce with other files (#340)

141bfbe

use fused_adam in deepspeed (#335)

b09c6a6

fix rms norm import on non cuda device (#341)

f9323e3

Fixed incorrect argument supplied to deepspeed init (#323)

89e03fd

* Fixed incorrect argument supplied to deepspeed init * Added suggestion to make fix backwards compatible

Track additional metrics with W&B in megatron/training.py (#343)

6d4c535

Clean up UC scripts and update UC README (#344)

d47f3cd

This PR updates the Universal Checkpointing README with instructions on how to download the GPT dataset and cleans up a few nits in the corresponding bash scripts.

Remove duplicate ctx save backward in cross_entropy.py (#347)

3a30913

Update Megatron type check (#346)

ea82c14

This PR updates the Megatron type check to check against the accelerator specific dtype instead of the class. The change is necessary to account for warning fixes in microsoft/DeepSpeed#5018.

polisettyvarma and others added 30 commits July 8, 2024 15:58

improve performance by keeping attention_mask on device and run ops f…

af06d14

…urther on device (#411) * improve performance by keeping attention_mask on device and run ops further on device * add copyrights

Improve RoPE perf by using cached sin/cos tensors (#410)

ec3f1f4

* improve RoPE perf by using cached sin/cos tensors * add copyrights

Extend test utilities to support more accelerators (#418)

354e420

* Extend test utilities to support more accelerators * Add Intel Copyright

clear document (#395)

73252c0

add PyTorch profiler support (#414)

0971e68

* Update arguments.py * Update training.py * Create profiler.py * add copyrights * Update profiler.py * add copyrights * Update help * add copyrights

[Wandb] Refine wandb logging function (#416)

73029ed

* Refine wandb logging function * Address comments * enable user to specify wandb local save dir * Update and fix comments * Update

add support to run custom Hf tokenizer for training and dataset pre-p…

7d23e33

…rocessing (#421) * Update arguments.py * Update tokenizer.py * Update preprocess_data.py

improve repeat_kv GQA perf (#419)

13f2673

acquire device when required (#420)

3af2e25

* Update module.py * Update preprocess_data.py * add copyrights * add copyrights * Update tokenizer.py * add copyrights

Add basic compilation test (#426)

08b9376

Update yml to be valid (#427)

3afd267

Update/add GPT/Llama universal checkpointing scripts (#391)

8822a5c

This PR adds a Llama universal checkpointing example to examples_deepspeed/universal_checkpointing. It also includes changes to the README, some minor changes, and an update to the TensorBoard analysis script.

fixing the bug of flash_attn import and the wrong gather index when u…

1bfc35c

…sing flash_attn_cuda in sequence parallel (#406) Co-authored-by: Jinghan Yao <[email protected]>

add fused_rms_norm support on XPU device (#431)

53b241f

pass batch_dim_idx to deepspeed sequence parallel distributed attenti…

61350c5

…on for supporting batch size larger than 1 (#433) Co-authored-by: Jinghan Yao <[email protected]>

[LLaMa] Adding support converting checkpoint from mds to hf (#432)

f132876

* add support converting checkpoint from hf to mds * Fix PP issue * update

add device check when import ipex (#436)

cdf5194

fix nan issue when running megatron-deepspeed (#434)

4f9f1f6

enable empty cache on XPU device (#438)

8e9d973

[wandb] disable wandb more gracefully (#422)

543543a

Co-authored-by: Logan Adams <[email protected]>

[Bug] Fix crash when logging optimizer state to tb (#417)

1280f59

Enable Sequence Parallelism (#429)

0d6e379

grad_wei can't be NoneType when running with DeepSpeed, for zero3 wil…

598c092

…l divided the gradient (#428)

fix init issue for rms_norm in squence_parallel (#448)

8be7f48

enable profiler for specific ranks (#451)

4448492

fix init issue for silently ignoring the deepspeed config (#452)

deb95cd

fix moe tflops (#445)

6acc370

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with latest upstream #5

Merge with latest upstream #5

Quentin-Anthony commented Dec 6, 2023

Merge with latest upstream #5

Are you sure you want to change the base?

Merge with latest upstream #5

Conversation

Quentin-Anthony commented Dec 6, 2023