forked from microsoft/DeepSpeed
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another thing to merge. (MY EYES HURT) #1
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…e for the log directory (#296)
* Avoid deadlock for unsynchronized non-zero checkpointing * Fix formatting issues Co-authored-by: Jeff Rasley <[email protected]>
* updates to amp to support grad clip and grad accumulation * zero grad using optimizer if in amp mode
* fix nv_peer_mem version in dockerfile * fix security issue, remove pillow dependency (this is only needed for cifar example which has its own requirements.txt)
mpu object is bound to the class instance.. the if statement uses `self.mpu' but just `mpu` is called in the following lines.. This raises a NameError
The parenthesis alter the evaluation of the assert() and it will always evaluate to True.
Add webinar on-demand links and update readme
* add fix and tests for get_lr from lr_scheduler before training starts
* update fan out flag for pdsh
* turn off multi-node launch if only 1 node
* Create CODEOWNERS
* Update deepspeed_checkpointing.py * formatting Co-authored-by: Jeff Rasley <[email protected]>
* Adding gradient accumulation support for ZeRO Stage 2. Changing all Megatron-LM tests to also test gradient accumulation * Gradient Accumulation support for Stage 2. Model tests added to test the feature * formatting * Update deepspeed_light.py removing comment * Update ds_config_func_bs8_zero1.json reverting this file back. Its not needed for this PR * defining baseline prefix Co-authored-by: Jeff Rasley <[email protected]>
Renaming config files to gas3
Co-authored-by: Samyam Rajbhandari <[email protected]>
Allow DeepSpeed models to be initialized with optimizer=None Co-authored-by: Shaden Smith <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.10.10 to 1.11.0. - [Release notes](https://github.com/sparklemotion/nokogiri/releases) - [Changelog](https://github.com/sparklemotion/nokogiri/blob/master/CHANGELOG.md) - [Commits](sparklemotion/nokogiri@v1.10.10...v1.11.0) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
* Remove a very verbose print statement. * Update engine.py
* Add Linear warmup+decay lr schedule Update lr schedule unit tests * LR scheduler unit tests for LR Range Test and 1Cycle * Disable yapf to preserve parameterizaton * Disable test_pipe.py for CI debugging * Disable test_lr_scheduler for CI debugging * Disable test_lr_scheduler for CI debugging * Enable all unit tests for CI debugging Co-authored-by: Jeff Rasley <[email protected]>
) Special thanks to @g-karthik for tracking this issue down.
Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>
* move workspace memory-allocation to PyTorch * refine the code based on the comments * remove unnecessary options * remove bsz from set_seq_len function
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
* Update README.md * Update index.md
Co-authored-by: Jeff Rasley <[email protected]>
* Fix ZeRO 2 + Pipelining
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Commits on Jul 21, 2020
only global rank 0 can log tensorboard data; avoid multi gpu/node rac… …
1f97242
Commits on Jul 22, 2020
Update setup.py (microsoft#298)
871f7e6
Avoid deadlock for unsynchronized non-zero checkpointing (microsoft#297) …
3cc96e1
Commits on Jul 23, 2020
updates to amp to support grad clip and grad accumulation (microsoft#290 …
eb74c3f
pass steps_per_print to tput timer (microsoft#299)
ec94341
Commits on Jul 24, 2020
bump DSExamples (microsoft#300)
0f94f7e
Commits on Jul 25, 2020
DeepSpeed webinar announcement (microsoft#301)
7ae8f8b
Commits on Jul 27, 2020
Update README.md (microsoft#302)
67821f9
Commits on Jul 28, 2020
Fixing a typo (microsoft#303)
97c5427
Fix nv_peer_mem version (microsoft#304) …
e50b883
Commits on Aug 01, 2020
NameError: name 'mpu' is not defined (microsoft#305) …
9d07d75
Commits on Aug 07, 2020
Removing () from assertion. (microsoft#307) …
c35e944
Add webinar link (microsoft#309) …
29c5fe2
Commits on Aug 08, 2020
updates website gems after kramdown alert (microsoft#311)
903a41a
Commits on Aug 10, 2020
Fix+tests for get_lr from lr_scheduler before training starts (micros… …
cd68e6e
Commits on Aug 12, 2020
bumping DSE commit for pillow security fix (microsoft#312)
892ece6
Update deepspeed_lr_schedules.py (microsoft#314)
3437342
Commits on Aug 13, 2020
Update fan out flag for pdsh (microsoft#315) …
6855ba1
attach empty grad to its param to ensure it's copied after reduction (m… …
e1bea67
Commits on Aug 14, 2020
bump DSE (microsoft#317)
de0523d
Commits on Aug 18, 2020
Turn off multi-node launch if only 1 node (microsoft#322) …
e69b1ee
Commits on Aug 27, 2020
Add code owners for DeepSpeed team (microsoft#335) …
21d5f63
Commits on Aug 28, 2020
bump DSE
6823db3
Commits on Aug 31, 2020
Update deepspeed_checkpointing.py (microsoft#336) …
458c0d9
Samyamr/grad acc stage2 (microsoft#338) …
7240abf
Rename ds_config_func_bs8_zero2_gas10.json to ds_config_func_bs8_zero… …
7a356b2
Rename ds_config_func_bs8_zero0_gas10.json to ds_config_func_bs8_zero… …
6122a74
Update run_func_test.py …
f4726b7
Update .gitignore
e8dd47d
Commits on Sep 01, 2020
Switches BBS example to use mbsize=3 and gas=2 to fit in 16GB of memo… …
838f53b
Sparse attn + ops/runtime refactor + v0.3.0 (microsoft#343) …
e5bbc2e
Update Dockerfile
8716540
Update Dockerfile …
5518aae
Commits on Sep 02, 2020
update DSE and rename SA tests
1661e83
Commits on Sep 03, 2020
Update test_sparse_attention.py
1ebcd6c
Adding link to Sparse Attention in Navigation page (microsoft#355) …
6deac82
Commits on Sep 04, 2020
Jekyll installation instructions (microsoft#351)
ac12833
Commits on Sep 05, 2020
fixed a typo; this was fixed before but seems like it has been lost i… …
a64b0ab
Move code quality tests to Azure-hosted agents. (microsoft#368)
4d4eafb
Commits on Sep 06, 2020
Update installation instructions (microsoft#362)
9e83ef2
Update Sparse Attention Tutorial (microsoft#357) …
9dadf38
Commits on Sep 08, 2020
adding sparse attention to feature index page (microsoft#377)
b73894d
Commits on Sep 09, 2020
temp disable model tests
234bba0
Add 1-bit Adam support to DeepSpeed (microsoft#380) …
01726ce
fixing a link issue with SA tutorial (microsoft#387) …
161e8e6
Update test triggers to exclude docs
79093d7
ZeRO-Offload release (microsoft#391) …
41db1c2
Commits on Sep 10, 2020
Pipeline parallel training engine. (microsoft#392) …
65c2f97
Update documentation for 1-bit Adam (microsoft#388) …
093f09f
Fix datatype issue with sparse attention softmax (microsoft#363) …
dca0b78
Add openmpi to dockerfile
c0d5424
ZeRO tutorials (microsoft#384) …
2dea61f
fix for 16GB v100 nodes (microsoft#393)
b1d4bd7
Sparse attention: updating code tag in documentation (microsoft#394) …
be4b94b
Minjiaz/zero offload (microsoft#382) …
59ce90d
Adding sparse attention news index item (microsoft#376) …
c76769c
Landing page updates (microsoft#395) …
a8a8b3d
Update README.md
7baf3c3
Website edits (microsoft#398) …
6bb5c69
update docker image and bump DSE
b29229b
only add 1bit adam reqs if mpi is installed, update cond build for cp… …
240ea97
bump DSE and doc tweak
4b1df25
Update README.md
9693595
Update _config.yml
ea92ed2
Update news site with press release link
5dc4d6c
Update ZeRO-Offload blog post link (microsoft#401) …
d15015e
remove old pt file
15ca99c
readthedocs upgrade (microsoft#402)
c82756c
Commits on Sep 11, 2020
supporting different intermediate sizes other than 4 * hidden_dim (mi… …
e549be6
Revert "supporting different intermediate sizes other than 4 * hidden… …
4ac9bf6
Commits on Sep 13, 2020
scales throughput by logging freq (microsoft#408)
473ff98
Commits on Sep 15, 2020
pytest skips for tests requiring certain ops (microsoft#411) …
91b4a93
fix bug related to stitching reduced grads across communication parti… …
55ed105
add cpu-adam, reformat, add colors (microsoft#413)
a9e8325
Commits on Sep 16, 2020
Add Linear warmup+decay lr schedule (microsoft#414) …
0e942df
Minor doc fixes (microsoft#417) …
7d91be9
Overflow fix (microsoft#416) …
f5cce75
Fix a typo in comments (microsoft#415) …
4fef478
readthedocs yaml configuration (microsoft#410) …
5812e84
Commits on Sep 17, 2020
Fix few typos in the docs (microsoft#418)
c66f388
Remove pip --use-feature (microsoft#419)
5bc7d4e
Commits on Sep 18, 2020
Activation checkpointing bugfix and unit tests (microsoft#420) …
01b6e27
Revert "Activation checkpointing bugfix and unit tests (microsoft#420)… …
a74a604
Fix activation checkpoint unit tests for GPU systems (microsoft#421)
a825f99
Commits on Sep 21, 2020
Add configurable intermediate size to transformer kernels (microsoft#423 …
a148bd3
DSE bump (microsoft#427)
71f7df3
support dynamic sequence length in transformer kernels (microsoft#424) …
f0f2a70
Commits on Sep 24, 2020
Fix urls in tutorial (microsoft#436) …
5d40f00
Update azure.md (microsoft#437)
192cf7c
Update pipeline.md (microsoft#439)
0ca8215
Commits on Sep 25, 2020
link fix part two :-) (microsoft#441)
6d176c4
unit test rename (microsoft#442)
5412a33
Commits on Sep 28, 2020
fix typos (microsoft#446)
6f28ea3
Commits on Sep 29, 2020
Disable default installation of CPU Adam (microsoft#450) …
7b8be2a
Commits on Oct 01, 2020
Use parentesis around min and max to enable Windows build (microsoft#449 …
9557557
Commits on Oct 05, 2020
Update engine.py (microsoft#458) …
6717638
Commits on Oct 06, 2020
temporarily disable lr unit tests
11cf47e
turning off different tests (temp)
679fc13
Commits on Oct 07, 2020
gan tutorial (microsoft#462) …
2efea69
Fix printing momentum for non-deepspeed optimizer (microsoft#464) …
c39a76f
Commits on Oct 10, 2020
Add DeepSpeed_Adam optimizer (microsoft#468) …
23fc48f
Commits on Oct 12, 2020
fixing typo (microsoft#460)
e25f2a2
add compute cap of 6.0 to transformer kernels …
b8eb40e
revert previous (accidental) change
1afca8f
Commits on Oct 14, 2020
Add support for p100 in transformer kernels (microsoft#470) …
7ddfda8
Commits on Oct 19, 2020
updating website dependencies (microsoft#475)
d720fdb
Commits on Oct 30, 2020
Add CPUAdam optimizer for zero-offload in deepspeed engine (microsoft… …
f5aa254
fixing the AVX_256 compatibility (microsoft#497)
4c37d70
Commits on Nov 05, 2020
Fixing CPU-Adam convergence issue (microsoft#503) …
7d4d742
Commits on Nov 09, 2020
PLD documentation (microsoft#514) …
e351090
Fix PLD news url (microsoft#515) …
41fb24b
Commits on Nov 10, 2020
updating pld docs (microsoft#517)
e082d47
PLD release (microsoft#513) …
be1147c
Commits on Nov 11, 2020
fix bug on non-DLTS infra when no output path set (microsoft#523)
eea1c28
Update zero.md tutorial (microsoft#495) …
0ad4fd8
Commits on Nov 12, 2020
DeepSpeed JIT op + PyPI support (microsoft#496) …
31f46fe
ds_report bug fix on cpu and guard torch import in setup.py (microsof… …
ca9ab12
Installation documentation updates. (microsoft#525) …
d779bd5
Commits on Nov 13, 2020
Dependency pruning (microsoft#528) …
0dc8420
bump version
9941ce7
Commits on Nov 17, 2020
Fix layout bug in ZeRO Stage 1 checkpoint logic (microsoft#531) …
7752dc5
Commits on Nov 18, 2020
append job-name if explicit output dir is given (microsoft#539)
5b09be6
more fine-grained manifest file for includes/excludes (microsoft#540)
fdd81c3
Commits on Nov 19, 2020
ZeRO-1 tune max-elems + bug fix (microsoft#532) …
08c96a1
bump to v0.3.3
9de21b7
backwards compatability w. v020 ckpts, fix issue with zero-1 ckpts (m… …
dce054d
Fix setup.py for cpu-only environment installation (microsoft#538) …
d81cb26
Discover variables for NCCL backend on AML without mpi4py (microsoft#542 …
1b45917
bump version 0.3.4
6b28bc5
Commits on Nov 20, 2020
Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (microso… …
0178e6c
Commits on Nov 21, 2020
Support non-tensor state in checkpoint (microsoft#548)
6021b70
Commits on Nov 22, 2020
Adding static_loss_scale to unfused optimizer (microsoft#546)
bcd56f9
Commits on Nov 23, 2020
Bug fix for norm calculation in absence of model parallel group (micr… …
00c3a25
bump to 0.3.5
16313a9
Commits on Nov 24, 2020
Create main.yml
c18fb0d
Switch to CI to GitHub Actions (microsoft#556)
3347460
Update badges and CI name (microsoft#557)
1ef5cd2
Deprecate client ability to disable gradient reduction (microsoft#552) …
6e65c2c
Simplify dist init and only init if needed. (microsoft#553) …
0e831e2
Turn back on PP tests (microsoft#558)
eec44af
Commits on Nov 25, 2020
Adds long_description to setup.py (microsoft#560)
6009713
bump to 0.3.6 and fix manifest to include reqs (microsoft#561)
73c3262
update manifest
e4e2066
bump to 0.3.7
c51fa65
Commits on Nov 27, 2020
[doc] typo fix and clarification (microsoft#563) …
17f36f1
Commits on Dec 01, 2020
supporting different hidden dimensions (microsoft#559) …
c78c29f
tracking optimizer step in cpu-adam when loading checkpoint (microsof… …
9f52a36
Commits on Dec 02, 2020
[cifar tutorial] improve readability (microsoft#567) …
7a75f8b
Add 'latest' checkpoint save/load support (microsoft#569)
845921b
[engine] train should be able to get
mode
arg (microsoft#571)2d1f7c0
Add compute capability 8.0 if on cuda 11+ (microsoft#572)
be33bea
[build] build against installed cuda-11.1 while torch built w/ cuda-1… …
ff58fa7
Commits on Dec 04, 2020
Fix potential random layout inconsistency issues in sparse attention … …
1e44d48
Commits on Dec 07, 2020
[build] make builder smarter and configurable wrt compute capabilitie… …
ce363d0
[build] add compute_86 (microsoft#577) …
e8b126d
Commits on Dec 08, 2020
Pipeline warnings and checkpoint portability (microsoft#588) …
2f62697
Commits on Dec 09, 2020
Pin triton to 0.2.3 for now, 0.3.0 is broken
d901a6d
bump to 0.3.8
cb7c7da
Add papers/videos to readme/website (microsoft#592)
19acd6c
Add AML video link
7300f3e
Commits on Dec 11, 2020
add manual workflow to run tests with precompiled ops
0518252
[build] fix computer capability arch flags, add PTX, handle PTX (micr… …
8a184b6
add DeepSpeedZeroConfig repr method (microsoft#596) …
66268bd
Supported customizing kwargs for lr_scheduler (microsoft#584) …
a4763f5
Update launcher to set local rank environ variable (microsoft#597) …
c5a449f
Commits on Dec 14, 2020
implement missing get_last_lr (microsoft#595) …
9f8e8f3
Commits on Dec 15, 2020
[doc] xref to hostfile discussion (microsoft#604) …
007466e
Fixes for RTD build errors (microsoft#606) …
6380ee3
Commits on Dec 17, 2020
Transformer-kernel - supporting any arbitrary sequence-length (micros… …
fd2f970
Commits on Dec 18, 2020
Ability to initialize distributed backend outside deepspeed runtime (m… …
7435b2f
Commits on Dec 23, 2020
Elastic training support (microsoft#602) …
81aeea3
Commits on Jan 04, 2021
update SA comp check to fix torch-cpu issue (microsoft#631)
24e0739
Support initialization with dict configuration (microsoft#632)
e6ac731
Commits on Jan 05, 2021
Allow DeepSpeed models to be initialized with optimizer=None (microso… …
a9a83a6
change dist to torch.distributed to fix bug in assert. (microsoft#638)
d38ad6a
docs: minor spelling tweaks (microsoft#623) …
46d2e28
Fix docstring format (microsoft#640)
5ab1279
Commits on Jan 06, 2021
Module replacement support (microsoft#586) …
44bd538
Update builder.py (microsoft#642)
64461da
Commits on Jan 07, 2021
Bump nokogiri from 1.10.10 to 1.11.0 in /docs (microsoft#630) …
8cea96d
Add deepspeed.init_distributed to RTD page (microsoft#645) …
4e2dc4e
Commits on Jan 08, 2021
document deepspeed.initialize() (microsoft#644) …
828d75b
add additional validation checks in elastic config (microsoft#646)
bc046dc
Remove a very verbose print statement. (microsoft#649) …
af212f6
version bump to 0.3.10
c14b839
LR scheduler unit tests (microsoft#429) …
da5563a
Commits on Jan 12, 2021
Handle actvitation checkpointing args that are None or non-tensors (m… …
adcfd26
squash latest flops profiling changes (microsoft#1) (microsoft#664) …
e2fbe4d
Move workspace memory-allocation to PyTorch (microsoft#661) …
981bc7d
Commits on Jan 14, 2021
Validate consistent ckpt tags across ranks (microsoft#667)
f032e56
Commits on Jan 15, 2021
Support optimizer AdamW type (microsoft#670)
865104b
skip empty lines in hostfile (microsoft#669)
6217a6c
Add AdamW to the supported optimizers (microsoft#672) …
c5e4264
add missing config menu entries (microsoft#652) …
e729a3f
doc fix (microsoft#651) …
7b07e12
Commits on Jan 19, 2021
add zero-offload paper (microsoft#680) …
82cecf6
Commits on Jan 20, 2021
[tutorials] typos (microsoft#676) …
7b0bee0
make test_pipe more stable (microsoft#683)
e59ba12
Fix ZeRO 2 + Pipelining (microsoft#677) …
34c83a5