Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with cpu offload #707

Closed
pedrocolon93 opened this issue Jan 28, 2021 · 11 comments
Closed

Crash with cpu offload #707

pedrocolon93 opened this issue Jan 28, 2021 · 11 comments

Comments

@pedrocolon93
Copy link

pedrocolon93 commented Jan 28, 2021

Hi there! I have been using this configuration:

{
"zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e6,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e6,
        "overlap_comm": false,
        "contiguous_gradients": true,
        "cpu_offload":true
    },
     "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 5e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-6,
            "weight_decay": 0.01
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 5e-5,
            "warmup_num_steps": 10000
        }
    }
}

To train a modified XLNet model (using the transformers library) on 4 1080ti's.

However after ~20 iterations, after the gradients scale correctly and training begins, it crashes in this function:

complete_grad_norm_calculation_for_cpu_offload(self, params):
        total_norm = 0.0
        norm_type = 2.0
        for p in params:
            if is_model_parallel_parameter(p) or (self.model_parallel_rank == 0):
                param_id = self.get_param_id(p)
                param_norm = self.norm_for_param_grads[param_id]
                total_norm += param_norm.item()**2

With a key error in self.norm_for_param_grads[param_id].

I just sidestepped around this with a
try: param_norm = self.norm_for_param_grads[param_id] total_norm += param_norm.item()**2 except: pass
and it continues to train. Would anyone know what is happening?

@pedrocolon93
Copy link
Author

As an added thing, I need the cpu offload or it goes OOM

@mrgjbd
Copy link

mrgjbd commented Feb 3, 2021

There are useless computing nodes when computing grad.

@tjruwase
Copy link
Contributor

tjruwase commented Feb 3, 2021

@pedrocolon93 thanks for reporting this issue. And thanks @mrgjbd for your suggestion, which I think is correct.

@pedrocolon93 are you using model-parallelism in this training? Also, does the key error happen on all ranks?

@pedrocolon93
Copy link
Author

I'm not sure if its happening on all ranks, and I believe I'm using a distributed model rather than parallelism but I may be wrong.

@Soonhwan-Kwon
Copy link

Soonhwan-Kwon commented Feb 16, 2021

I faced same issue here, and I'm also using distributed model, and using all ranks. And yes I iterated through exactly 20 iterations.

@pedrocolon93
Copy link
Author

@Soonhwan-Kwon If you need to continue training, patch it with a try/catch. Its not an elegant fix but it will get the job done.

@Soonhwan-Kwon
Copy link

@Soonhwan-Kwon If you need to continue training, patch it with a try/catch. Its not an elegant fix but it will get the job done.

thank you for the suggestion, I'll try it right away and see what's happening.

@HHousen
Copy link

HHousen commented Feb 25, 2021

I am getting this same error. I am not using model-parallelism. (The is_model_parallel_parameter function still returns True because of deepspeed/runtime/pipe/module.py line 246.) huggingface/transformers#9622 fixed a similar crash that happened because of gradient accumulation steps (#671). For me it happens every time after exactly 20 steps. I am using pytorch-lightning with a huggingface/transformers model.

Here is the portion of the traceback involving DeepSpeed:

  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 30, in pre_optimizer_step
    deepspeed_engine.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py", line 959, in step
    self._take_model_step(lr_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/engine.py", line 914, in _take_model_step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/zero/stage2.py", line 1379, in step
    self.params_in_partition[i]))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/runtime/zero/stage2.py", line 881, in complete_grad_norm_calculation_for_cpu_offload
    param_norm = self.norm_for_param_grads[param_id]
KeyError: 8

@Soonhwan-Kwon
Copy link

@pedrocolon93 well, thank you, it is working now with your suggestion(try except), but i can't get away from the bad tastes. @mrgjbd How can we check useless nodes when computing grad and get remove it? I would greatly appreciate it if you kindly give me some advice.

self._take_model_step(lr_kwargs)
  File "/home/soouee/anaconda3/envs/pytorch_marco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 914, in _take_model_step
    self.optimizer.step()
  File "/home/soouee/anaconda3/envs/pytorch_marco/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1379, in step
    self.params_in_partition[i]))
  File "/home/soouee/anaconda3/envs/pytorch_marco/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 881, in complete_grad_norm_calculation_for_cpu_offload
    param_norm = self.norm_for_param_grads[param_id]
KeyError: 533

@Soonhwan-Kwon
Copy link

Soonhwan-Kwon commented Mar 11, 2021

I encountered this Error when I passed the key error

timer has already been started

and it keeps happening and can't make model to learn at all.

@ghosthamlet
Copy link
Contributor

ghosthamlet commented Mar 15, 2021

@mrgjbd is right, this is my detailed explain: the KeyError was caused by unused parameter, if you disable deepspeed and use torch.nn.parallel.DistributedDataParallel with find_unused_parameters=False, it may have this error message:

    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. 
This error indicates that your module has parameters that were not used in producing loss. 
You can enable unused parameter detection by 
(1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; 
(2) making sure all `forward` function outputs participate in calculating loss.
 If you already have done the above two steps, then the distributed data parallel module wasn't able to locate 
the output tensors in the return value of your module's `forward` function. 
Please include the loss function and the structure of the return value of `forward` of your module when 
reporting this issue (e.g. list, dict, iterable).

These errors happened when the model have trainable parameters but skipped in training, these skipped params will not go through backward, so their backward hooks in self.create_reduce_and_remove_grad_hooks() of zero stage2 will not run, then they have no norm_for_param_grads,
if the skip is what you want, then the hack by @pedrocolon93 is the right way: try: param_norm = self.norm_for_param_grads[param_id] total_norm += param_norm.item()**2 except: pass , or better:

if param_id in self.norm_for_param_grads: 
    param_norm = self.norm_for_param_grads[param_id] 
    total_norm += param_norm.item()**2 

ghosthamlet added a commit to ghosthamlet/DeepSpeed that referenced this issue Mar 15, 2021
…ped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads
tjruwase added a commit that referenced this issue Mar 27, 2021
…ped in training (#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in #707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <[email protected]>
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this issue Apr 6, 2021
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Cheng Li <[email protected]>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <[email protected]>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <[email protected]>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <[email protected]>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <[email protected]>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <[email protected]>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <[email protected]>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <[email protected]>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <[email protected]>

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <[email protected]>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <[email protected]>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <[email protected]>

Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: brett koonce <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: hamlet <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sid <[email protected]>
ghosthamlet added a commit to ghosthamlet/DeepSpeed that referenced this issue Apr 12, 2021
As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: microsoft#707
ghosthamlet added a commit to ghosthamlet/DeepSpeed that referenced this issue Apr 12, 2021
As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: microsoft#707
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this issue Apr 22, 2021
* test sparse self_attn fix

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Cheng Li <[email protected]>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <[email protected]>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <[email protected]>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <[email protected]>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <[email protected]>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <[email protected]>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <[email protected]>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <[email protected]>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <[email protected]>

* mlperf attn initial commit

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <[email protected]>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <[email protected]>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <[email protected]>

* add inference_batch fn

* Add space in help string (microsoft#926)

* Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (microsoft#881)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* [zero3] GatheredParameters can now handle a list of params (microsoft#884)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* fix cpu_adam memory leak on deepspeed re-use in the same process (microsoft#896)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* [benchmarks] flatten/unflatten benchmarks (microsoft#919)

Co-authored-by: Jeff Rasley <[email protected]>

* improved readability + typos (microsoft#895)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* [zero doc] fix misspelled param (microsoft#878)

We really really really need those params to be validated...

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Samyamr/stage 3 skip modules without parameters (microsoft#867)

Co-authored-by: Jeff Rasley <[email protected]>

* docs (microsoft#909)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Supporting different hidden dimensions for transformer kernels-v2 (microsoft#934)

Co-authored-by: Jeff Rasley <[email protected]>

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* cleanup, reinstantiate sending of logits / layer_past

* cleanup, reinstantiate sending of logits / layer_past

* bump to 0.3.14

* add pypi badge

* Delete check of pdsh (microsoft#941)

* fix double linear override; spelling (microsoft#954)

* [config] turn exponential notation back on for config dump (microsoft#955)

* e-notation for large floats

* handle ints too

* readability

* handle bool

Co-authored-by: Olatunji Ruwase <[email protected]>

* document how to override ~/.cache/torch_extensions (microsoft#959)

* [zero] faster flatten/unflatten (cpp version)  (microsoft#910)

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes

Co-authored-by: Olatunji Ruwase <[email protected]>

* update lr scheduler doc for doing per step or epoch update (microsoft#913)

* update lr scheduler doc for doing per step or epoch update

* work

* trigger build

Co-authored-by: Olatunji Ruwase <[email protected]>

* Fix ZeRO-3 UnboundLocalError (microsoft#968)

* Fix UnboundLocalError

* Get full partition size

* ZeRO-Infinity (microsoft#976)

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>

* revert zero-inf change to launcher

* [docs] zero-inf updates

* bump to 0.3.15

* ZeRO-Infinity tutorial additions (microsoft#978)

* zinf tutorial

* more megatron integration docs

* [docs] add ZeRO-Inf news items

* refactor

* ZeRO-Infinity docs (microsoft#979)

* zinf tutorial

* more megatron integration docs

* ZInf + tiling docs

* [docs] zero-inf updates

* assert no Z2/Z3 with pipeline and fix some docs links (microsoft#980)

* add option to force multi-node launcher mode (microsoft#977)

* [ZeRO Infinity] Allow Init to take a dict for the deepspeed config  (microsoft#983)

* Add check to see if json file is already loaded

* Update doc

* Address review

* Remove doc comment

Co-authored-by: Olatunji Ruwase <[email protected]>

* make bold+italic work without escaping _ (microsoft#775)

Co-authored-by: Olatunji Ruwase <[email protected]>

* remove debug prints: (microsoft#986)

* 1-bit LAMB optimizer (microsoft#970)

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He
Paper: https://arxiv.org/abs/2104.06069

Co-authored-by: sdtblck <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>

* Use odd shape tensor to represent parameter data in partitioned state (microsoft#981)

* use wierd shaped tensor to avoid silent failures when not registering externel params

* fix typo

Co-authored-by: Olatunji Ruwase <[email protected]>

* Make reduce scatter optional for ZeRO-1 as workaround (microsoft#971)

* Make reduce scatter optional for ZeRO-1 as workaround

* Make allreduce default for ZeRO 1

Co-authored-by: Jeff Rasley <[email protected]>

* Fix all Pipeline Module Parameters being sent to cuda:0 (microsoft#687)

* remove communicate overflow (already in utils.CheckOverflow)

Co-authored-by: sid <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Samyam Rajbhandari <[email protected]>
Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: brett koonce <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: hamlet <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Takuya Makino <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]>
Co-authored-by: Sean Naren <[email protected]>
tjruwase pushed a commit that referenced this issue Apr 25, 2021
* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: #707

* Add find_unused_parameters option

As unused parameters in modules may not be expected sometimes, 
add an explicit error msg when it occurred and an option to avoid the error: #707

* Fix syntax error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Fix yapf error

* Move stage2 find_unused_parameters to config file

* Add stage2 find_unused_parameters

* Add stage2 find_unused_parameters

* Add stage2_find_unused_parameters option

* Change error msg to reflect zero_optimization config change

* Fix yapf error

* Fix yapf errors

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Change find_unused_parameters option name

* Add UnusedParametersModel for test option find_unused_parameters

* Add unit test for stage2 find_unused_parameters

* Add cpu-adam compatible check

* Remove dups import

* Trim spaces

* Fix yapf errors

* Trim spaces

* Add False Positive test check

* Fix find_unused_parameters test

* Trim spaces

* Fix yapf error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants