Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quantized-inference & Add generic support of checkpoint loading #2547

Merged
merged 13 commits into from
Dec 6, 2022

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

@RezaYazdaniAminabadi RezaYazdaniAminabadi commented Nov 24, 2022

This PR generalizes the checkpoint loading/saving at DeepSpeed-Inference, to make the model-initialization faster for different tensor-parallelism degrees. This also allows to store the model in different tp-sharded configurations as requested by several users (#2442 and #2379). This has been already tested with several model architectures, GPT-J, GPT-NeoX, OPT, and BLOOM.
Moreover, it enables users to quantize the model using the saved checkpoint.

To use the inference on these models using the new model-checkpoint loading, you can use this PR, and more specifically this test suite. Here, I show some examples of running inference with GPTJ-6B and OPT-30B.

GPTJ-6B

Let's assume you download the checkpoint somewhere under '/tmp' directory and you want to use DeepSpeed with meta-tensor to load the checkpoint. To do so, you can use the following command:

deepspeed --num_gpus 1 inference-test.py --name EleutherAI/gpt-j-6B --ds_inference --use_kernel --use_meta_tensor --checkpoint_path '/tmp/

To save the new checkpoint, with any TP size, you can pass the directory in which you want to store it through the save_mp_checkpoint_path argument. Then, in order to run the model in INT8-quantized mode, you can pass --dtype int8 to convert the model to INT8 format, and by passing a new path to save the model, you can get the quantized model. Here is the output of this model in INT8-quantized mode:

[2022-11-29 00:11:48,186] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 16, 'num_hidden_laye
rs': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': True, 'scale_attention': True, 'triangular_masking': True
, 'local_attention': False, 'window_size': 1, 'rotary_dim': 64, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': False, 'mlp_act_func_type': <Activatio
nFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False}
Loading 8 checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:17<00:00,  2.19s/it]
checkpoint loading time at rank 0: 17.803102493286133 sec                                                                                                               | 0/1 [00:00<?, ?it/s]
Loading 1 checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.64it/s]
[2022-11-29 00:12:06,689] [INFO] [utils.py:827:see_memory_usage] after init_inference
[2022-11-29 00:12:06,690] [INFO] [utils.py:828:see_memory_usage] MA 6.03 GB         Max_MA 6.4 GB         CA 7.0 GB         Max_CA 7 GB
[2022-11-29 00:12:06,690] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 21.48 GB, percent = 1.2%
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
------------------------------------------------------
Free memory : 70.691223 (GigaBytes)
Total memory: 79.169678 (GigaBytes)
Requested memory: 0.625000 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
------------------------------------------------------
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
generation time is 1.0636017322540283 sec

in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework available for Python and R users to build accurate deep learning engines, for example for the training of deep residual network models. DeepSpeed comes with pre-trained models for tasks like image classification, image segmentation, super-resolution, semantic segmentation,

As expected, the memory usage drops from about 12 GB form fp16-run to 6 GB for INT8, and we can still get good accuracy for the text-generation.

OPT-30B

Here, I compare the initialization cost of the model-creation plus checkpoint loading on two A100 GPUs, between HF baseline and the meta-tensor with DeepSpeed-Inference load-checkpoint module:

Baseline

we use the following command:

deepspeed --num_gpus 2 inference-test.py --name facebook/opt-30b --batch_size 1 --ds_inference --use_kernel

By using this approach, the initialization takes 2.7 minutes, and we require 188 GB of CPU-RAM to load/create the model on CPU before sharding it on the 2 GPUs.

[2022-11-29 00:47:13,191] [INFO] [utils.py:827:see_memory_usage] before init
[2022-11-29 00:47:13,192] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-11-29 00:47:13,192] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 20.45 GB, percent = 1.2%
[2022-11-29 00:49:55,007] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.6+1e4c3ea8, git-hash=1e4c3ea8, git-branch=fix-opt-injection
[2022-11-29 00:49:55,008] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2022-11-29 00:49:55,009] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
**initialization time: 164.7 sec**
[2022-11-29 00:49:59,638] [INFO] [utils.py:827:see_memory_usage] after init
[2022-11-29 00:49:59,639] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-11-29 00:49:59,639] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 188.38 GB, percent = 10.6%

Use Meta-tensor to load the model on GPU directly

To remedy this problem, we use the meta-tensors to create the model and then inject the DeepSpeed-Inference transformer kernels, finally we can load the checkpoint into the real-tensors on GPU directly. Here is the changes you need to make to the inference command to use this feature:

deepspeed --num_gpus 2 inference-test.py --name facebook/opt-30b --batch_size 1 --ds_inference --use_kernel --use_meta_tensor --checkpoint_path '/tmp/'

As we can see, the initialization latency reduces to 22 sec, which is about 8x faster than baseline, with using very small CPU memory:

[2022-11-29 01:02:28,876] [INFO] [utils.py:827:see_memory_usage] before init
[2022-11-29 01:02:28,877] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-11-29 01:02:28,877] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 20.38 GB, percent = 1.2%
[2022-11-29 01:02:29,750] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.6+1e4c3ea8, git-hash=1e4c3ea8, git-branch=fix-opt-injection
[2022-11-29 01:02:29,751] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2022-11-29 01:02:29,751] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
initialization time: 896.4962959289551ms
[2022-11-29 01:02:31,599] [INFO] [utils.py:827:see_memory_usage] after init
[2022-11-29 01:02:31,600] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-11-29 01:02:31,600] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 20.47 GB, percent = 1.2%
[2022-11-29 01:02:33,141] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 7168, 'intermediate_size': 28672, 'heads': 56, 'num_hidden_layer
s': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 2, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True
, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <Activation
FuncType.ReLU: 2>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False}
Loading 7 checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:21<00:00,  2.31s/it]
checkpoint loading time at rank 0: 21.937642335891724 sec
Loading 7 checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:21<00:00,  3.13s/it]
Loading 7 checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:22<00:00,  2.32s/it]
checkpoint loading time at rank 1: 22.054820775985718 sec
Loading 7 checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:22<00:00,  3.15s/it]
[2022-11-29 01:02:59,037] [INFO] [utils.py:827:see_memory_usage] after init_inference
[2022-11-29 01:02:59,038] [INFO] [utils.py:828:see_memory_usage] MA 28.36 GB         Max_MA 29.13 GB         CA 29.71 GB         Max_CA 30 GB
[2022-11-29 01:02:59,038] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 22.99 GB, percent = 1.3%

Copy link
Contributor

@mrwyattii mrwyattii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with new unit tests in #2561 and working. Code changes look good, but the maybe_copy and maybe_copy1 functions in load_checkpoint.py are confusing. @RezaYazdaniAminabadi could we be more descriptive with the names or add a comment to distinguish how they are different?

@RezaYazdaniAminabadi
Copy link
Contributor Author

Tested with new unit tests in #2561 and working. Code changes look good, but the maybe_copy and maybe_copy1 functions in load_checkpoint.py are confusing. @RezaYazdaniAminabadi could we be more descriptive with the names or add a comment to distinguish how they are different?

will do!

@chenwuperth
Copy link

chenwuperth commented Jan 8, 2023

Hi when i tried the same test suite in the latest versions (>= 0.7.7) on GPT-NeoX, it can load the model directly from checkpoints to GPU (with small CPU memory footprint), but (consistently) generated some gibberish output:
deepspeed --num_gpus 4 inference-test.py --name EleutherAI/gpt-neox-20b --ds_inference --use_kernel --use_meta_tensor --checkpoint_path /tmp
where /tmp is HF model (https://huggingface.co/EleutherAI/gpt-neox-20b) is downloaded to.

The output is:
2023-01-08 16:12:33,147 [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 8.54 GB, percent = 3.6% Settingpad_token_idtoeos_token_id`:0 for open-end generation.

Free memory : 3.142761 (GigaBytes)
Total memory: 15.781921 (GigaBytes)
Requested memory: 0.843750 (GigaBytes)
Setting maximum total tokens (input + output) to 1024

Setting pad_token_id to eos_token_id:0 for open-end generation.
Setting pad_token_id to eos_token_id:0 for open-end generation.
Setting pad_token_id to eos_token_id:0 for open-end generation.
Setting pad_token_id to eos_token_id:0 for open-end generation.
generation time is 1.3323054313659668 secgeneration time is 1.3322982788085938 secgeneration time is 1.332315444946289 sec

in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework##
101ISS101"["[ -#ISOVCl #[[ISInd[IS #
IS#Ty"OM##ISOMEUkOVOISIndMovIS101UnsAgainBUkIS�comp
Hub

------------------------------------------------------------generation time is 1.3323240280151367 sec

[2023-01-08 16:12:37,474] [INFO] [launch.py:350:main] Process 4969 exits successfully.
[2023-01-08 16:12:38,476] [INFO] [launch.py:350:main] Process 4970 exits successfully.
[2023-01-08 16:12:38,476] [INFO] [launch.py:350:main] Process 4971 exits successfully.
[2023-01-08 16:12:38,476] [INFO] [launch.py:350:main] Process 4972 exits successfully.
`
Is this somehow related to #2401 ? or have you seen this before? I am running on a single node with 4-V100 GPUs. The unit tests (#2561 ) seem to skip content generation... should I report a new issue? thanks

Quentin-Anthony added a commit to EleutherAI/DeeperSpeed that referenced this pull request Mar 9, 2023
* refactor to use mem_access (#2317)

* add quant unit test (#2315)

* add quant unit test

* add codeowner

* format fix

* fix undefined symbol: curandSetPseudoRandomGeneratorSeed

* modify ref fn name and add comment

* add comments

* add 4bit quant 16groups

* fix

* modify groups in ref code

* parameterize tensor shape

* single param

* detach tensor

* remove -lcurand flag

* add back -lcurand flag

Co-authored-by: Ammar Ahmad Awan <[email protected]>

* only override forward if using cuda-graph (#2291)

* Add more options to inference benchmark (#2325)

* bump to 0.7.4

* MOE residual matmult unit test (#2323)

MOE residual matmul unit tests

Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>

* MOE matmult with memaccess (#2336)

* Fix formatting

* Remove redundant variable

* Refactor residual add kernels (#2333)

Co-authored-by: Ammar Ahmad Awan <[email protected]>

* mem access for quantize kernel (#2331)

* mem access for quantize kernel

* format

* format fp32

* modify quant kernel

* modify quant kernel2

* modify format

* format

* fix comments in pytest

* fix comments in pytest

* format

* rerun

Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>

* increase min pre-commit versions (#2346)

* Extend scratch buffer for long prompts (#2212)

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* fix zero docs (#2350)

* Inference profiling updates/fixes (#2348) (#2349)

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* Kernel Data Conversion Utility (#2327)

* Unify macro definitions and constants in a single file

* Conversion utility implementation.

* Fix reversion from formatting

* Bugfixes after testing with correct DeepSpeed

* Inline markers are available on both HIP + CUDA

* Add Onebit Optimzers in __init__ (#2340)

Co-authored-by: Saeyeol Lee <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* docs(mixture-of-experts-inference): fix typo in tuto (#2345)

Co-authored-by: Olatunji Ruwase <[email protected]>

* download cifar to blob storage (#2342)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Refactor gptj_residual_add kernels for better readability (#2358)

Co-authored-by: Reza Yazdani <[email protected]>

* Updated issue templates (#2363)

* Update issue templates

* fix cuda invalid config error in dequant kernel (#2362)

* format

* remove round fn

* Add missing pytest fixture scope (#2353)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* Extend residual_add kernel tests to conver pre_attn_norm (#2354)

Co-authored-by: Jeff Rasley <[email protected]>

* Refactor fused_bias_residual kernels for better readability (#2356)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Capture error message during sweep tests (#2351)

* Collect error messages in results.csv

Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* fix an exception when recursively casting dicts to fp16 (#2370)

* Refactor remaining distributed tests (#2216)

* batch of refactored tests

* more test refactoring

* fp16 test refactor

* more refactors

* added DistributedFixture class

* applied DistributedFixture to first batch of tests as a trial

* added DistributedFixture test and documentation

* last tests

* fixes for refactored tests

* remove subdirs in workflow files

* fix pytest syntax error

* fix another syntax error

* update imports

* use DistFixture with elastic checkpoint test

* missing import

* update to shared class tmpdir for elastic test

* moved test files

* avoid duplicate test file name

* last refactor and moving test files

* formatting

* fix broken import

* testing forked AMD tests

* update abstract method

* use blob storage for accelerate and transformers tests

* upgrade torch for acclerate CI

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix the MLP output tensor's shape (#2380)

* allow building with latest CUDA (11.8), it is backwards compatible (#2390)

* pin transformers version for unit tests (#2402)

* Change type to tuple in replace_wo_policy isinstance check (#2387)

Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type.

Co-authored-by: Lev Kurilenko <[email protected]>
Co-authored-by: Molly Smith <[email protected]>
Co-authored-by: Lok Chand Koppaka <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>

* Checkpoint backwards-compatbility workaround (#2384)

* Add predicated global load (#2373)

Co-authored-by: Reza Yazdani <[email protected]>

* MII blog post (#2418)

Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>

* Fix figure reference (#2419)

* [docs] update news items

* [docs] add mii repo link

* Add SLURM Multinode Runner (#2404)

Signed-off-by: Dashiell Stander <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Fix issue with corrupted output on long generation for GPT (#2359)

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* MII blog title update on Readme

* DeepSpeed-MII title change in website

* Fix GPT Neo-X multi-gpu inference (#2401)

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* MII-Public and MII-Azure subheading in mii post

* CI fixes related to triton (#2422)

* [docs] update mii blog title (#2423)

* add SD injection policy (#2381)

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>

* fix checkpoint loading when it is a dictionary (#2425)

* Make error regex more generic in collect_results.py (#2415)

Co-authored-by: Jeff Rasley <[email protected]>

* fixes #2389 (#2411)

truncating expert param storage for checkpointing

Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* Fix for inference gpt-j test (#2430)

* fix for gpt-j failing due to tokenizer error

* limit number of gpt-j tokens generated due to low memory

* Fixing bug 2361 (#2410)

* fixing bug 2361

* adding pytest for config initialization

* chaning expected output to FusedAdam

* remove print statement

* running yapf on modified files

* running pre-commit formatting

Co-authored-by: Olatunji Ruwase <[email protected]>

* Universal checkpoint for zero stage 1 (#2284)

* Refactor universal checkpointing and tensor fragments

* Formatting

* Support zero stage1; Expand TP dim

* Remove debug prints

* Detect sharded optimizer state

* Format fixes

* Encode reshaping guide

* More symbolic constants

Co-authored-by: Michael Wyatt <[email protected]>

* only add deps if extra is explictly called (#2432)

* Add TestInjectionPolicy inference unittest class for testing custom injection policies (#2426)

This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies.

This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API.

The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified.

This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see GH-2387).

Co-authored-by: Lev Kurilenko <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* [memory estimators] new config args sync (#2431)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* parallelize writing of layer checkpoint files across data parallel instances (#1419)

* parallelize layer checkpoints across data parallel groups

* use partition_uniform to determine start/end index values

* formatting fix

* config: add option for parallel write of layer checkpoints in pipeline stage

* yapf fixes

* enable parallel layer write according to config param

* avoid extraneous makedir when rank 0 writes all layers

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix broken link to DeepSpeed Megatron fork (#2440)

Co-authored-by: Lev Kurilenko <[email protected]>

* bump to 0.7.5

* Fix Bug #2319 (#2438)

Co-authored-by: Jeff Rasley <[email protected]>

* update pytorch pool operator function signiture (#2443)

* update pytorch pool operator function signiture

* fix the case where kwargs is None

* Fix build issues on Windows (#2428)

* Fix build issues on Windows

* small fix to complie with new version of Microsoft C++ Build Tools

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>

* rollback ds config changes (#2395)

* rollback ds config changes

* fix format

* Fix error when output_file is a relative path without a prefix (#2397)

Co-authored-by: Benjamin Steenhoek <[email protected]>

* fix restuls and exprs path to use absolute path

* write out optimial config after tuning

* fix format

* assert tuning result dir creation

Co-authored-by: Benjamin Steenhoek <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* Use CUDA events for inference model profiling (#2371)

* use cuda event timers for model profiling

* Fixing a mismatch in basic adam test. (#2447)

* Reduction Kernel Utility (#2436)

* Initial reduction_utils.h implementation

* Add initialization helper, ensures correct min/max behavior

* Remove unnecessary warp sync

* deepspeed/launcher/launch.py: add option '--enable_each_rank_log logdir' (#2409)

* Fixes for various CI problems (#2457)

* check only major CUDA version in CI

* update expected torch latest version

* pin torch latest to 1.12 until issues with 1.13 are resolve

* wrong expected torch version

* Update nv-torch18-v100.yml

* remove forked from pytest option due to cuda re-initialization errors

* removed expected torch version from inference tests, causing errors currently

* fix various bugs that popped up

* move all tests over to cu111 runners, cu113 runners having problems

* Cache Allocation and Softmax Fixes (#2433)

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* fixing the checkpoint loading at inference-engine (#2429)

Co-authored-by: Ammar Ahmad Awan <[email protected]>

* Create a new folder structure to isolate model-specific code in DS (#2464)

* don't gather partitioned activations for mp size 1 (#2454)

* don't gather partitioned activations for mp size 1

* add inline comment for the change

Co-authored-by: Olatunji Ruwase <[email protected]>

* Updating autotune json default in docs. (#2476)

* Updating autotune default in docs.

* Running pre-commit.

* Added MLFLOW environment variables for logging metrics within trainig… (#2477)

* Added MLFLOW environment variables for logging metrics within trainign script

* exporting MLFlow env variables from AML env

Co-authored-by: Cheng Li <[email protected]>

* fix accelerate link (#2481)

* Add correct memory-allocation at DeepSpeed-Attention (#2474)

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>

* Fix CI issues related to cupy install (#2483)

* remove any cupy install when setting up environments

* revert previous changes to run on cu111 runners

* fix for when no cupy is installed

* remove cupy uninstall for workflows not using latest torch version

* update to cu116 for inference tests

* fix pip uninstall line

* move python environment list to after DS install

* remove cupy uninstall

* re-add --forked

* fix how we get cupy version (should be based on nvcc version)

* [docs] add SD tutorial to news

* [docs] add SD tutorial to deepspeed.ai news

* Add `scale_attn_by_inverse_layer_idx` feature (#2486)

* Add scale_attn_by_inverse_layer_idx feature

* Fix layer_id bug

* Fix scaling value

Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>

* Stable Diffusion Enhancements (#2491)

Co-authored-by: cmikeh2 <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>

* stage_1_and_2.py: no allreduce needed when mp size is 1 (#2494)

* Make bf16_optimizer work for non pipeline (#2470)

* Fix nightly CI tests (#2493)

* fix for lm-eval nightly tests and add gpt-j to MPtest because OOM on single GPU

* add nv-nightly badge

* Make data contiguous before the inplace reshape-copy_ function (#2489)

Co-authored-by: Michael Wyatt <[email protected]>

* Fix typos: deepseed -> deepspeed (#2499)

* bump to 0.7.6

* DeepSpeed inference config. (#2459) (#2472)

Changes to inference API to use accept a config dict and cleaning up Inference Engine to utilize the newly added inference config.

Co-authored-by: Michael Wyatt <[email protected]>

* Update docs to autogenerate pydantic config model docs (#2509)

* update zero config docs
* add autogenerated docs for pydantic models used in ZeRO and Inference configs

* Add max_tokens alias to max_out_tokens arg to maintain backwards compatibility (#2508)

This PR adds a max_tokens alias to the max_out_tokens argument in the init_inference API to support backwards compatibility after the config refactor PR https://github.com/microsoft/DeepSpeed/pull/2472.

Thanks @molly-smith and @mrwyattii.

* Deepspeed quantization library v0.1 (#2450)

* Initial commit Deepspeed quantization library

* Match function signatures

* Add Quantization Kernel

* adding offset comparision and precommit changes

* format fixes

* FIle name changes

* pt_binding_changes

* test name change

* Integer quantization, minor refactors

* Add directed test_case

* format fixes

* Move param calculation to constructor of params class

* Use local function and add elemsPerBlock

* change function to be specalized

* sub block reduce

* add new schedule

* Add new schedule test case

* fix illegal writes in sch1

* Style fixes in comments

Co-authored-by: Connor Holmes <[email protected]>

* Fix backward compatibility for InferenceConfig (#2516)

* Make new InferenceConfig backwards compatible with previous init_inference API

Co-authored-by: Jeff Rasley <[email protected]>

* Add missing Inference sub-configs (#2518)

* Add note about nvcc/hipcc requirement (#2519)

* Update codeowners (#2525)

* Initial dequant library implementation (#2521)

* Fixes for torch 1.14 due to new torch.numel return type (#2522)

* fixes for new torch.numel return type

* address comment

* Ensure  is initialized for SD (#2534)

* Make DS-Inference config readable from JSON (#2537)

* Add MII tests (#2533)

Adding MII tests to ensure changes to DS-Inference do not break MII

* Remove mutable default parameter in init_inference() (#2540)

A mutable default value is dangerous because editing it will change the
value for all future calls to the function. The value is itself edited
later in the function, so this problem will likely be encountered sooner
or later.

Co-authored-by: Michael Wyatt <[email protected]>

* Change Where DS/Triton is Used in Stable Diffusion (#2536)

* Change utilization of DS/Triton kernels

* add config at Clip-encoder

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>

* Pass down the new DS inference config to replace_transformer_layer. (#2539)

* pass down the new DS inference config to replace_transformer_layer.

* remove quantize_settings and rename the ep_mp_group.

* Fix model_config passing. Fixes gptj issue with wrong output.

* fix small bug in gpt-neo.

Co-authored-by: Reza Yazdani and Michael Wyatt

* Adding Gradient Accumulation Data Type Config (#2512)

* Adding gradient accumulation dtype config.

* Switching to new DtypeEnum

* Adding standalone check function, and unit tests

* Variable disambiguation

* Adding checks for unsupported states.

* Updating for PR comments.

* Reorganizing unit test.

Co-authored-by: Olatunji Ruwase <[email protected]>

* Report progress at gradient accumulation boundary (#2553)

* report progress at gradient accumulation boundary

* format

* format

* encoded ds config into command line argument when launching child processes in autotuning (#2524)

* rollback ds config changes

* fix format

* Fix error when output_file is a relative path without a prefix (#2397)

Co-authored-by: Benjamin Steenhoek <[email protected]>

* fix restuls and exprs path to use absolute path

* use base64 encoded ds config as cmd arg

* fix format

* remove assert

* write out optimial config after tuning

* fix format

* no need to update ds config path when encoding ds config

* udpate

* do not use abs path for result and expr dir

* fix conflicts

* fix run mode

* fix format

* fix format

Co-authored-by: Benjamin Steenhoek <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* add missing moe deprecated fields to inference config (#2556)

* Abstract accelerator (step 1) (#2504)

* Establish building block of abstract accelerator

* Change .*Tensor variable to @property

* [op builder] add op builder reflection to allow enumerate of builders in all_ops.py and builder_names.py

* change @abstractproperty to @property @abstractmethod

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix invalid check of recorded parameter orders in zero stage3. (#2550)

Co-authored-by: Olatunji Ruwase <[email protected]>

* bump to 0.7.7

* docs: Update the recent url for Megatron-LM (#2564)

* use get_global_rank if available (#2567)

* Add Determined to open-source DL frameworks (#2573)

* Support fp32 gradaccum for bf16 model (#2566)

* allow bf16 model with fp32 gradient accumulation datatype

* allow fp32 gradient accumulation and bfloat16 model in amp mode

* alternative fix for grad accumulation type mismatch.  In the case of zero optimizer we should have grad accum type == model data type

Co-authored-by: Olatunji Ruwase <[email protected]>

* Drop Maxwell Support (#2574)

* Officially drop Maxwell support

* Formatting

* Comparison mismatch fix

* Fix quantized-inference & Add generic support of checkpoint loading (#2547)

* fix checkpoint loading when it is a dictionary

* fix some issues with saving ckpt & int8 inference

* fix quantized-inference & add generic support of checkpoint loading

* remove int8 hard-coded flag

* fix mlp return tensors

* fix several issue to load checkpoints of GPT-J, GPT-NEOX, and OPT with different TP-size

* add more comments & description for checkpoint-loading module

Co-authored-by: Michael Wyatt <[email protected]>

* Fix MegatronLayerPolicy to have megatron_v2=True (#2579)

This PR updates the MegatronLayerPolicy to set megatron_v2=True, which is required in order to properly transpose in the replace_with_policy() function.

After the change in this PR, in conjunction with PR #99 in the Megatron-DeepSpeed fork, the Megatron text-generation example works with DS inference.

* Update barrier and reduce_scatter_base to conform to PyTorch signatures (#2570)

Co-authored-by: Jeff Rasley <[email protected]>

* Support N-dimension input in quantization kernel (#2575)

* Add support for inputs > 2D

* use vec

* Add N-Dim support to Dequant kernel

* merge master and fix format

* Bug Fix

* fix num_bits

* Fix dequant

Co-authored-by: Connor Holmes <[email protected]>

* Add checkpoint sharding unit tests (#2561)

* added checkpopint sharding tests

* Updating docs README (#2587)

* Updating docs README with API update procedure.

* Addressing comments.

Co-authored-by: Jeff Rasley <[email protected]>

* Updating API docs (#2586)

Co-authored-by: Jeff Rasley <[email protected]>

* Fix issues w. python 3.6 + add py-version checks to CI (#2589)

* get mask token from tokenizer (#2592)

* bump to 0.7.8

* DeepSpeed Data Efficiency Library (#2585)

Co-authored-by: Jeff Rasley <[email protected]>

* fix blog link (#2600)

* Migrate ops tests to new inference_ops marker (#2599)

* Migrate ops tests to new inference_ops marker

* Disable by default

* Add missing test cases

* Reorder such that inference_ops will run[fail] first

* Move layer norm to new schedule (#2590)

* Move layer norm to new schedule

* Pre-commit fixes

* fix comments

* format fixes

* Merge unrolls

* format fixes

* camelCase

* format fixes

* revert unwanted file

* move pow2 function

* format fixes

Co-authored-by: Connor Holmes <[email protected]>

* [deepspeed/autotuner] Bug fix for binary search for batch size (#2162)

* bug fix for binary search for batch size

* fix binary search termination condition

* add fix for older pydantic versions (#2611)

* Use rocm/pytorch:latest (#2613)

* skip torch.zeros and tensor.copy_ when model parallel is not used (#2479)

Co-authored-by: Olatunji Ruwase <[email protected]>

* call empty_cache to really free up GPU memory as described in comment (#2620)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Remove GatheredParameters context from replace_with_policy (#2591)

This PR removes the zero-infernece GatheredParameters context from replace_with_policy due to no longer needing zero-inference after the introduction of meta tensor support for BLOOM.

* fixes #2498 (#2603)

taking gradient accumulation steps into account for throughput calculation

Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* Update AVX512 Detection (#2621)

* Update cpuinfo AVX512 detection

* Missing conversion from `_mm256` to `_mm256i`

Co-authored-by: Olatunji Ruwase <[email protected]>

* Add Megatron CI workflow (#2614)

* added megatron unit test

* Update nv-megatron.yml

Co-authored-by: Olatunji Ruwase <[email protected]>

* [inference] check for unsupported model generate args (#2627)

* [launcher] parse hostfile via regex and added error checks (#2626)

* Unit tests setup own venv (#2628)

add reusable workflow that sets up fresh venv for each test and prints relevant environment info

* add enable_each_rank_log to deepspeed/launcher/runner.py (#2571)

* Fix typo in autotuner.py (#2639)

* [zero-3] Handle forward parameter return correctly in nested cases (#2642)

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* [inference] ds-attention refactor w.r.t. ops (#2623)

* Fix issue w. bloom when changing tp size (#2645)

* fix assertion error in zero stage 3 (#2647)

* tweaks to ds-attn, distilbert policy, and mup (#2649)

* [doc] fix `min_loss_scale` default (#2660)

* [doc] fix `min_loss_scale` default

* align

* [launcher] fail gracefully if hostname -i doesn't work as expected (#2631)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix Opt injection (#2541)

* fix Opt injection & add injection verification check at inference test

* fix several issues

* remove fixture

* remove check_injection when no kerenl is injected

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Abstract accelerator (step 2) (#2560)

* Abstract accelerator (step 2)

* more flex op_builder path for both installation and runtime

* add SpatialInferenceBuilder into cuda_accelerator.py

* use reflection to make cuda_accelerator adapt to CUDA op builder change automatically

* clean up deepspeed/__init__.py

* add comments in cuda_accelerator for no torch path

* Update deepspeed/env_report.py

Change env_report.py according to suggestion

Co-authored-by: Michael Wyatt <[email protected]>

* reduce the range of try...except for better code clarity

* Add porting for deepspeed/ops/random_ltd/dropping_utils.py

* move accelerator to top directory and create symlink under deepspeed

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Remove unnecessary device synchronization for stage 2 (#2500)

* Remove unnecessary device synchronization for stage 2

* Remove unnecessary device synchronization for stage 2

Co-authored-by: liyidong.lyd <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Joe Mayer <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* [Bug Fixed] use torch.cuda.is_available() (#2661)

Co-authored-by: Olatunji Ruwase <[email protected]>

* [fp16] lower initial_scale_power (#2663)

Co-authored-by: Olatunji Ruwase <[email protected]>

* fix  Tensor contiguous bug in model_compression (#2671)

double check the unit tests

* [inference] ds-mlp refactor w.r.t. ops (#2668)

* real_accelerator validation check for both accelerator and deepspeed.accelerator path (#2685)

* remove duplicated code in ZeRO (#2655)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Add mlflow logging for aml (#2495)

* add logging changes

* try w/out abspath

* undo last change

* start mlflow debug

* remove mlflow from export_envs

* add mlflow logging for reversed

* remove mlflow.start_run

* add back start run

* don't clean cmd

* print os environment variables

* remove first start run

* add run_id to mlflow star

* remove context managers

* move last end run

* add extra parent start_runs

* add run id logging

* add logging to run_ds_config

* change run_id to run_name

* add back context managers and run_id logs

* remove context mng

* debug environment variable

* reset environment variables

* add env variable deletion

* clean up

* remove unused import

* fix yapf/whitespace errors

Co-authored-by: Cheng Li <[email protected]>

* fix import path to op_builder (#2687)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Pass training flag to forward call from Eval (#2604)

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>

* Extend quantization utils features (#2683)

* Extend quantization utils features

* remove unwanted files

* fix cahce setting

Co-authored-by: Connor Holmes <[email protected]>

* [GatheredParameters] add support for any iterator (#2664)

Co-authored-by: Olatunji Ruwase <[email protected]>

* fix for latest diffusers (#2699)

Co-authored-by: Jeff Rasley <[email protected]>

* exclude benchmarks during install (#2698)

* using correct loss scale in zero step (#2695)

Co-authored-by: Olatunji Ruwase <[email protected]>

* non-MoE stage 1 requires CG disabled (#2703)

Co-authored-by: Olatunji Ruwase <[email protected]>

* remove print side effect from importing deepspeed (#2704)

* ZeRO3 handling frozen weights] (#2653)

* bump to 0.8.1

* CUDA optional deepspeed ops (#2507)

* CPU-Adam: add compile-flag to enable param-copy from CPU to GPU

* guarde the CUDA-related include files and variables

* remove CUDA dependency from op_builder when building against CPU

* fixing the builder issues

* fix formatting

* return true when there is no mismatch on the cuda version

* guard for when cuda is not available & test with cpu-only environment

* Update cpu_adam and cpu_adagrad

* Format fixes

* Add configurable half precision type; Build/run in CUDA environment

* Run cpu_adam and cpu_adagrad in cpu only environment

* Mark CUDA only unit tests

* CPU environment CI

* Format fixes

* Remove --forked

* Add --forked

* CPU only CI should pass

* Format fixes

* Format fixes

* Remove scattered pytest.skip

* Fix cpu_adam unit test

* Update .github/workflows/nv-torch-latest-cpu.yml

Co-authored-by: Michael Wyatt <[email protected]>

* Update .github/workflows/nv-torch-latest-cpu.yml

Co-authored-by: Michael Wyatt <[email protected]>

* Address PR feedback

* OpenMP linking

* Fix unit tests

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* remove master branch from CI triggers (#2712)

* [install] only add deepspeed pkg at install (#2714)

Co-authored-by: Olatunji Ruwase <[email protected]>

* update for lm-eval==0.3.0 (#2713)

Co-authored-by: Jeff Rasley <[email protected]>

* BF16 optimizer for BF16+ZeRO Stage 1 (#2706)

* BF16 optimizer only with ZeRO stage 1.

* Updating to grad accum of fp32 for BF16 ZeRO1 case.

Co-authored-by: Olatunji Ruwase <[email protected]>

* fix typo (#2718)

Co-authored-by: Jeff Rasley <[email protected]>

* Inference Refactor (replace_with_policy, model_implementations) (#2554)

Co-authored-by: Lev Kurilenko <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Change zero_grad() argument to match pytorch (#2741)

* Automatic tensor parallelism v2 (#2670)

* loop through pipe.model

* tp_parser first draft

* client_module must be type object

* Simplify layernorm tracking. Add unittest.

* cleanup

* Add more models to unittest

* cleanup inference pytest for merging

* Add unittest

* cleanup

* pre-commit

* unittest id and pytest marker

* try marian for unittest

* precommit

* Move tp code to seperate file

* Add new auto tp file

* pre-commit and type

* Update deepspeed/module_inject/auto_tp.py

Co-authored-by: Michael Wyatt <[email protected]>

* Update deepspeed/module_inject/auto_tp.py

Co-authored-by: Michael Wyatt <[email protected]>

* Update tests/unit/inference/test_inference.py

Co-authored-by: Michael Wyatt <[email protected]>

* remove unused fillmask function

Co-authored-by: Michael Wyatt <[email protected]>

* fixing optimizer sanity check (#2742)

Co-authored-by: Olatunji Ruwase <[email protected]>

* [GatheredParameters] fix memory leak (#2665)

* [GatheredParameters] fix memory leak

* simplify

* cleanup and move

* style

* Formatting

* fix test

* fix test

* fix test take 2

* Trigger CI

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Joe Mayer <[email protected]>

* Abstract accelerator (step 3) (#2677)

* Integrate accelerator abstraction interface into deepspeed/

* Fix error message in fp16/fused_optimizer

* fix error message in fp16/unfused_optimizer.py

* assign get_accelerator().pin_memory() result to input Tensor name

* no need to check cuda and whether nvtx supported

* move try-except into inner most block

* call Event() and Stream() in get_accelerator() for data type

* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed

* Apply op_builder backend api change from #2705 from @jeffra

* fix tests where Builder NAME is used

* keep original ...Builder.NAME interface instead of ...Builder().NAME interface

* fix builder closure for installation

* fix randomltd builder

* add comments to clarify create_op_builder and get_op_builder

* fix compatibility with pip install -e

Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix autotuning so that it records Floating Point Operations per second, not microsecond (#2711)

* Fix how autotuning reports TFLOPS so that they are reported in FLOPS per second, not millisecond

Co-authored-by:  Nick Sarkauskas <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>
Signed-off-by: Dashiell Stander <[email protected]>

* Actually it is microseconds -> seconds

Signed-off-by: Dashiell Stander <[email protected]>

* Actually it is microseconds -> seconds

Signed-off-by: Dashiell Stander <[email protected]>

Signed-off-by: Dashiell Stander <[email protected]>
Co-authored-by: Nick Sarkauskas <[email protected]>
Co-authored-by: Quentin Anthony <[email protected]>

* fix a mispelled attribute (#2750)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* [zero] remove misleading dtype log (#2732)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix softmax backward (#2709)

* Reset KV-cache at the beginning of text-generation

* Add new backward kernel to handle large softmax-length

* remove unrelated changes

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>

* Skip test_bias_gelu unit test if torch < 1.12 (#2754)

This PR adds a torch version check in the test_bias_gelu unit test to skip if the torch version < 1.12. This is due to gelu implementation differences in versions prior to 1.12.

* Add environment variable to make nvcc compilation more verbose (#2759)

* Bing/formatting correction (#2764)

* modify engine.py for formatting

* commit formatting changes on engine.py

* Add links to new azureML examples (#2756)

Co-authored-by: Jeff Rasley <[email protected]>

* Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743)

* Remove hardcoded instances to fp16 in log messages.

* Add model_dtype to print the correct format

* Respond to PR feedback

---------

Co-authored-by: Olatunji Ruwase <[email protected]>

* Refactor/Pydantify monitoring config (#2640)

* pydantify monitoring configs

---------

Co-authored-by: Olatunji Ruwase <[email protected]>

* Pin minimum `packaging` requirement (#2771)

Co-authored-by: Jeff Rasley <[email protected]>

* Fix for diffusers v0.12.0 (#2753)

Co-authored-by: Jeff Rasley <[email protected]>

* some fix in flops_profiler (#2068)

* bugs in profiler:
1. Tensor.bmm missed in _patch_tensor_methods function
2. missed funtions in _reload_functionals and _reload_tensor_methods functions
3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead.

* formatting

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Cheng Li <[email protected]>

* fix upsample flops compute by skipping unused kargs (#2773)

* fix upsample flops compute by skipping unused kargs

* fix format

* Fix broken kernel inject bug (#2776)

* Fix Checkpoint-loading with Meta-tensor (#2781)

* Reset KV-cache at the beginning of text-generation

* Pass the ckpt-loading arguments to work with meta-tensor

* remove unrelated changes

* add support for hjson config files (#2783)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Reset KV-cache at the beginning of text-generation (#2669)

Co-authored-by: Martin Cai <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Container param cleanup + remove qkv_merging (#2780)

This PR cleans up some container items and removes an unused qkv_merging parameter:

- Remove qkv_merging=True from BERT containers
- Change containers config object to ds_model_config
- Remove qkv_merging param

* Common location to install libaio-dev (#2779)

* Common location to install libaio-dev

* Update .github/workflows/setup-venv/action.yml

Co-authored-by: Michael Wyatt <[email protected]>

---------

Co-authored-by: Michael Wyatt <[email protected]>

* Fixing broken link to azureml-examples recipes (#2795)

* remove outdated comment (#2786)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Enable page-locked tensors without CUDA (#2775)

* Enable page-locked memory in cpu only env

* Enable page-locked memory in cpu only env

* Formatting

* Add TODOs; Release page-locked memory

* Update perf microbenchmark; Reduce unit test memory

* Reduce CI mem usage

* Add container load checkpoint error reporting + refactor (#2792)

This PR refactors the organization of meta tensor checkpoint loading as follows:

- Move get_param_names() abstract method definition from TransformerPolicy into MetaTensorContainer
- Model-specific get_param_names() definitions moved from policy into model-specific container
- selected_policy_g, megatron_v2_g, and transformer_config_g globals replaced with a single container_g global, since the container will contain all of the information those globals previously captured
- ckpt_load_enabled flag added to containers that's set to False by default in the base.py container and gets set to True when the MetaTensorContainer feature is inherited
- Assertion added to replace_transformer_layer before performing checkpoint loading to check if ckpt_load_enabled ==True, otherwise an error message will be printed saying that the container does not support meta tensor checkpoint loading.

The aim of these changes is to more closely couple meta tensor checkpoint loading code to the MetaTensorContainer and to allow for better error reporting of load checkpoint use on model types that don't support this feature.

* Add user defined launcher args for PDSH launcher (#2804)

* Add user defined launcher args for PDSH launcher

* Formatting fixes

* Fix Slurm launcher user args (#2806)

Fix missing connections from --launcher_args to Slurm srun command.

* Handle hanged tests in CI (#2808)

* Fix inference CI device error (#2824)

* Fix permissions issue with pip upgrade (#2823)

* fix permissions issue with pip upgrade

* install to .local instead of use sudo

* upgrade pip in venv

* Update action.yml

* fix typos

* Fix cpu-only CI hangs (#2825)

* don't run tests in parallel

* make AsyncIO test sequential

* Fix Pipeline Parallel resize unit test (#2833)

* fix overlapping checkpoint names in unit tests

* remove running cpu-only on master merge

* Fix auto TP for duplicate modules with different gems (#2784)

* Fix auto TP for duplicate modules with different gems

* precommit and comments

* Comment

* Combine gem list of same named modules

* remove duplicates from gem_list before updating policy

* Add module attribute with name variation for ProphetNet

---------

Co-authored-by: Jeff Rasley <[email protected]>

* Refactor DS inference API. No longer need replace_method. (#2831)

Co-authored-by: Michael Wyatt <[email protected]>

* Port Reza's INT8-quantization fix to container architecture (#2725)

Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Heyang Qin <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* Fix gpt-Neox rotary embedding implementation (#2782)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* fix for cpu-only tests (#2849)

* bump to 0.8.2

* add auto-generated PR workflow (#2822)

* add auto-generated PR for private repo

* change variable names

* fix typo in autosync workflow (#2850)

* Fix example command when building  wheel with dev version specified (#2815)

* Create tensor parallelism blog/tutorial (#2766)

Co-authored-by: Michael Wyatt <[email protected]>

* Data efficiency library update (#2866)

* data efficiency library update

* data efficiency library update

* data efficiency update

* data efficiency update

* Make z3 respect comm dtype (#2807)

* Make z3 respect comm dtype

* Support fp32 comm dtype

* Remove obsolete assert

* Code cleanup

* Automatic Tensor Parallelism Blog Links (#2877)

* Modify table for compatible web format

* Add tutorial links to navigation

* Add news bit to main readme

* Update docs/_tutorials/automatic-tensor-parallelism.md

Co-authored-by: Michael Wyatt <[email protected]>

---------

Co-authored-by: Michael Wyatt <[email protected]>

* Check device count before running dist tests (#2799)

* Check device count before running dist tests

* fixing format for "Check device count before running dist tests"

* Check device count against max world size

* Check GPU count before launching dist tests

* double-check GPU actually exists

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* AutoTP tutorial web formatting and news (#2883)

Co-authored-by: Jeff Rasley <[email protected]>

* Remove deprecated `torch._six` imports (#2863)

* Remove deprecated `torch._six` imports

Closes #2845.

* Support older versions of PyTorch as well.

---------

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* Reduce I/O size (#2814)

* add missing license info to top of all source code (#2889)

Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* Enable tensor fragments for zero 2 & 3 (#2727)

* Enable tensor fragments for zero 2

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Support offload

* Support multi-gpu

* Cleanup

* WIP

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <[email protected]>

* Support padding

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <[email protected]>

* z3 optimizer state support; aligned api

* Support frozen z3 params

* Unit tests

* Check NVMe offload capability

* Formatting

* Docs

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <[email protected]>

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <[email protected]>

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* More docs

* Support unsharded fp32 grad

* Remove debug prints

* Fix off-by-one detection of empty grads

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <[email protected]>

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <[email protected]>

* Fix off-by-one error

* Skip ranks with no gradient data

* Formatting

* Add license

* Fix license

---------

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>

* better eval sampler (#2907)

Co-authored-by: Olatunji Ruwase <[email protected]>

* using container when loading inference checkpoints (#2875)

This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy.

* Fix CPUAdam for when `vendor_id_raw` is not provided (#2836)

* #1213: Fix CPUAdam for when `vendor_id_raw` is not provided

* formatting (yapf) fix

---------

Co-authored-by: Olatunji Ruwase <[email protected]>

* Always convert input mask to half (#2851)

* Fixes `AttributeError` in #2853 (#2854)

Updates `deepspeed/monitor/monitor.py`
to instantiate objects with correct configs

Relevant issue:
https://github.com/microsoft/DeepSpeed/issues/2853

Co-authored-by: Olatunji Ruwase <[email protected]>

* Add MPICH Multinode Runner (#2839)

* MPICH support

* MPICH changes

* MPICH changes

* MPICH changes

* MPICH changes

* accelerator runtime modifications

* Accelerator runtime changes

* Accelerator runtime modifications

* Remove redundant print from single node

* Move hostfile to tmp

* Code cleanup for MPICH class

* Code cleanup, rm whitespace

* Removing mpiexec environment check details

* Not needed tmp hostfile as pass directly

* Remove debugging comments

* rm print statement

* Revert comm changes as WA not needed

* Use MPICHRunner name for class

* Use MPICHRunner as class name

* No need to use args.force_multi and args.launcher .

This should be set in deepspeedexamples gpt-3.6b .sh script as:
$launcher=MPICH
run_cmd=" deepspeed  --hostfile=${hostfile_ds}  --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --launcher=${launcher} --force_multi pretrain_gpt2.py $@ ${gpt_options}"

* Adhere to code pattern

* Rm empty lines in MPICHRunner class

* Uncomment check for num nodes and workers when used hostfile_deepspeed in gpt-3.6b.sh

* pass MPICH hostfile through launcher_args in gpt-3.6b.sh

* Clean code and remove args hostfile

* fix merge

* fix merge

---------

Co-authored-by: Abhilash Majumder <[email protected]>

* clean up and fix format

* add ut

---------

Co-authored-by: Abhilash Majumder <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>

* TP unsupported models and assertions (#2810)

Co-authored-by: Jeff Rasley <[email protected]>

* AutoTP Assert Kernel Injection Support (#2939)

* check kernel injection supported models

* Clarify why user should use kernel injection

* Check for local CUDA graphs when enable_cuda_graph=True (#2941)

* Improve overflow handling (#2944)

Co-authored-by: Jeff Rasley <[email protected]>

* [RFC] add device abstraction to allow other device than CUDA be used (#2221)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* deepspeed.init_distributed() support for TCP protocols (#2905)

Co-authored-by: Jeff Rasley <[email protected]>

* bump to 0.8.3

* bug fix for skipping mbs (#2171)

Co-authored-by: Rajhans Samdani <[email protected]>

* Fix issue between our abstract accelerator and colossalai's version of op_builder (#2963)

Co-authored-by: Logan Adams <[email protected]>

* [zero] prevent poor configs from running w. zero-offload (#2971)

---------

Signed-off-by: Dashiell Stander <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Guanhua Wang <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Sam Ade Jacobs <[email protected]>
Co-authored-by: Arash Bakhtiari <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Saeyeol Lee <[email protected]>
Co-authored-by: Saeyeol Lee <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jean-Louis Queguiner <[email protected]>
Co-authored-by: Molly Smith <[email protected]>
Co-authored-by: Matt Smith <[email protected]>
Co-authored-by: Thomas-MMJ <[email protected]>
Co-authored-by: lekurile <[email protected]>
Co-authored-by: Lev Kurilenko <[email protected]>
Co-authored-by: Molly Smith <[email protected]>
Co-authored-by: Lok Chand Koppaka <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Dashiell Stander <[email protected]>
Co-authored-by: Andrey Chernykh <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: Joe Mayer <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Adam Moody <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: eltonzheng <[email protected]>
Co-authored-by: Benjamin Steenhoek <[email protected]>
Co-authored-by: Guo Yejun <[email protected]>
Co-authored-by: savitamittal1 <[email protected]>
Co-authored-by: kyoto7250 <[email protected]>
Co-authored-by: Kevin Ko <[email protected]>
Co-authored-by: lokoppakmsft <[email protected]>
Co-authored-by: iLeGend <[email protected]>
Co-authored-by: Alex Hedges <[email protected]>
Co-authored-by: ShijieZZZZ <[email protected]>
Co-authored-by: Ma, Guokai <[email protected]>
Co-authored-by: AGUL <[email protected]>
Co-authored-by: Jeongseok Kang <[email protected]>
Co-authored-by: Hayden <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: Rahil Bathwal <[email protected]>
Co-authored-by: Jithun Nair <[email protected]>
Co-authored-by: Ikko Ashimine <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: li-yi-dong <[email protected]>
Co-authored-by: liyidong.lyd <[email protected]>
Co-authored-by: JackieWu <[email protected]>
Co-authored-by: Xiaoxia (Shirley) Wu <[email protected]>
Co-authored-by: cassieesvelt <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: loadams <[email protected]>
Co-authored-by: Nick Sarkauskas <[email protected]>
Co-authored-by: Bing Xie <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: swli <[email protected]>
Co-authored-by: Martin Cai <[email protected]>
Co-authored-by: Razvan Tanase <[email protected]>
Co-authored-by: Heyang Qin <[email protected]>
Co-authored-by: Yasyf Mohamedali <[email protected]>
Co-authored-by: Mayank Mishra <[email protected]>
Co-authored-by: Farzan Taj <[email protected]>
Co-authored-by: Sam Foreman <[email protected]>
Co-authored-by: Abhilash Majumder <[email protected]>
Co-authored-by: noabauma <[email protected]>
Co-authored-by: Rajhans Samdani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants