offline full fp32 weights reconstruction for zero 2+3 checkpoints #892

stas00 · 2021-03-26T03:08:02Z

This PR adds a magical script that extracts and reconstructs full fp32 weights from either zero 2 or 3 checkpoint. This consolidated state dict can then be used anywhere w/o DeepSpeed.

I had to change the saving checkpoint code to also add the param names and shapes to the checkpoint optim files
The script doesn't require live engine, everything is done in completely standalone logic - unfortunately I couldn't fully remove the dependency on deepspeed since the pickled data refers to DeepSpeedEngine and can't be unpickled w/o import deepspeed, but otherwise it's completely standalone.
When a new checkpoint is saved this script gets copied automatically to the top checkpoint folder so that it's trivial for the user to get the weights out, but of course the API can be used directly - it just will be a really bad idea to do it live, since for large models it'd be very slow and memory consuming to mix it into the normal training. But if someone wants it - it's

deepspeed.save_checkpoint(output_dir)
dist.barrier()
if torch.distributed.get_rank() == 0:
    convert_zero_chkpt_to_fp32_consolid_state_dict(f"{output_dir}/global_step1"), output_file):

I guess the only non-smooth part is figuring out that "global_stepXX" part - probably should expose some of that deepspeed API.

To run the script, from the top level checkpoint dir, is just:

$ cd /path/to/checkpoints_dir
$ ./zero_to_fp32.py global_step1 pytorch_model.bin
Processing zero checkpoint at global_step1
Detected checkpoint of type zero stage 3, world_size: 2
Saving fp32 state dict to pytorch_model.bin (total_numel=60506624)

Future improvements:

currently the script uses 2x memory of the final checkpoint - eventually it'd be good to find a smart way to avoid that requirement
currently the script relies on sufficient RAM to do the work - for huge models some kind of memory mapping will need to be used

TODO:

Where would be a good place to document this?

Lot's of thanks to @tjruwase for the guidance. Once I understood how zero handles its data, it was surprisingly easy to do the rest. Kudos for building an easy to handle system!

p.s. PR with fp16 live consolidator for zero3 addressing #872 is coming soon

@tjruwase

Fixes: #800

@awan-10

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Cheng Li <[email protected]> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <[email protected]> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Shaden Smith <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <[email protected]> * [runner/launch] propagate the error (microsoft#854) Co-authored-by: Jeff Rasley <[email protected]> * docs: minor spelling tweaks (microsoft#858) * Allow args to be optional in deepspeed.initialize (microsoft#825) * Fix ZeRO3 save_checkpoint (microsoft#857) Co-authored-by: Jeff Rasley <[email protected]> * Make config objects json serializable (microsoft#862) Co-authored-by: Jeff Rasley <[email protected]> * bump version 0.3.13 * 1-bit Adam v2 (microsoft#817) Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., microsoft#813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 7840085, reversing changes made to a6dba72. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd98. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * consistent checkpoint filenaming (microsoft#865) * consistent checkpoint filenaming * backward compatible rename Co-authored-by: Olatunji Ruwase <[email protected]> * [doc] launcher (microsoft#868) As discussed in microsoft#662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: microsoft#662 Co-authored-by: Jeff Rasley <[email protected]> * [doc] pipeline (microsoft#888) * [doc] pipeline As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak * [debug utils] see_memory_usage fixes (microsoft#890) * see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things * full fp32 weights reconstruction for zero 2+3 (microsoft#892) * save_fp16_model consolidated for zero3 (microsoft#893) Co-authored-by: Olatunji Ruwase <[email protected]> * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861) * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: Olatunji Ruwase <[email protected]> * update kramdown (microsoft#901) security alert related to older kramdown version * update backward api doc (microsoft#903) * Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905) Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/gettalong/kramdown/releases) - [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page) - [Commits](https://github.com/gettalong/kramdown/commits) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <[email protected]> * We're hiring! + integration posts * [website] We're hiring! + integration posts * [website] we're hiring! * zero.Init() clarification (microsoft#880) * zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: Olatunji Ruwase <[email protected]> * disable pipe test (microsoft#915) This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though. * Add link to AML examples. (microsoft#916) Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Shaden Smith <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: brett koonce <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: hamlet <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sid <[email protected]>

@awan-10

* test sparse self_attn fix * [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Cheng Li <[email protected]> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <[email protected]> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Shaden Smith <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <[email protected]> * [runner/launch] propagate the error (microsoft#854) Co-authored-by: Jeff Rasley <[email protected]> * docs: minor spelling tweaks (microsoft#858) * Allow args to be optional in deepspeed.initialize (microsoft#825) * Fix ZeRO3 save_checkpoint (microsoft#857) Co-authored-by: Jeff Rasley <[email protected]> * Make config objects json serializable (microsoft#862) Co-authored-by: Jeff Rasley <[email protected]> * bump version 0.3.13 * 1-bit Adam v2 (microsoft#817) Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., microsoft#813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 7840085, reversing changes made to a6dba72. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd98. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * consistent checkpoint filenaming (microsoft#865) * consistent checkpoint filenaming * backward compatible rename Co-authored-by: Olatunji Ruwase <[email protected]> * [doc] launcher (microsoft#868) As discussed in microsoft#662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: microsoft#662 Co-authored-by: Jeff Rasley <[email protected]> * [doc] pipeline (microsoft#888) * [doc] pipeline As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak * [debug utils] see_memory_usage fixes (microsoft#890) * see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things * full fp32 weights reconstruction for zero 2+3 (microsoft#892) * save_fp16_model consolidated for zero3 (microsoft#893) Co-authored-by: Olatunji Ruwase <[email protected]> * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861) * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: Olatunji Ruwase <[email protected]> * mlperf attn initial commit * update kramdown (microsoft#901) security alert related to older kramdown version * update backward api doc (microsoft#903) * Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905) Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/gettalong/kramdown/releases) - [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page) - [Commits](https://github.com/gettalong/kramdown/commits) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <[email protected]> * We're hiring! + integration posts * [website] We're hiring! + integration posts * [website] we're hiring! * zero.Init() clarification (microsoft#880) * zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: Olatunji Ruwase <[email protected]> * disable pipe test (microsoft#915) This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though. * Add link to AML examples. (microsoft#916) Co-authored-by: Jeff Rasley <[email protected]> * add inference_batch fn * Add space in help string (microsoft#926) * Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (microsoft#881) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [zero3] GatheredParameters can now handle a list of params (microsoft#884) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * fix cpu_adam memory leak on deepspeed re-use in the same process (microsoft#896) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [benchmarks] flatten/unflatten benchmarks (microsoft#919) Co-authored-by: Jeff Rasley <[email protected]> * improved readability + typos (microsoft#895) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [zero doc] fix misspelled param (microsoft#878) We really really really need those params to be validated... Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Samyamr/stage 3 skip modules without parameters (microsoft#867) Co-authored-by: Jeff Rasley <[email protected]> * docs (microsoft#909) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Supporting different hidden dimensions for transformer kernels-v2 (microsoft#934) Co-authored-by: Jeff Rasley <[email protected]> * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * cleanup, reinstantiate sending of logits / layer_past * cleanup, reinstantiate sending of logits / layer_past * bump to 0.3.14 * add pypi badge * Delete check of pdsh (microsoft#941) * fix double linear override; spelling (microsoft#954) * [config] turn exponential notation back on for config dump (microsoft#955) * e-notation for large floats * handle ints too * readability * handle bool Co-authored-by: Olatunji Ruwase <[email protected]> * document how to override ~/.cache/torch_extensions (microsoft#959) * [zero] faster flatten/unflatten (cpp version) (microsoft#910) * faster flatten/unflatten with apex * switch to cpp flatten/unflatten * style * better comment * missing import * switch to build ops at run time * fixes Co-authored-by: Olatunji Ruwase <[email protected]> * update lr scheduler doc for doing per step or epoch update (microsoft#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <[email protected]> * Fix ZeRO-3 UnboundLocalError (microsoft#968) * Fix UnboundLocalError * Get full partition size * ZeRO-Infinity (microsoft#976) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Shaden Smith <[email protected]> * revert zero-inf change to launcher * [docs] zero-inf updates * bump to 0.3.15 * ZeRO-Infinity tutorial additions (microsoft#978) * zinf tutorial * more megatron integration docs * [docs] add ZeRO-Inf news items * refactor * ZeRO-Infinity docs (microsoft#979) * zinf tutorial * more megatron integration docs * ZInf + tiling docs * [docs] zero-inf updates * assert no Z2/Z3 with pipeline and fix some docs links (microsoft#980) * add option to force multi-node launcher mode (microsoft#977) * [ZeRO Infinity] Allow Init to take a dict for the deepspeed config (microsoft#983) * Add check to see if json file is already loaded * Update doc * Address review * Remove doc comment Co-authored-by: Olatunji Ruwase <[email protected]> * make bold+italic work without escaping _ (microsoft#775) Co-authored-by: Olatunji Ruwase <[email protected]> * remove debug prints: (microsoft#986) * 1-bit LAMB optimizer (microsoft#970) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He Paper: https://arxiv.org/abs/2104.06069 Co-authored-by: sdtblck <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Use odd shape tensor to represent parameter data in partitioned state (microsoft#981) * use wierd shaped tensor to avoid silent failures when not registering externel params * fix typo Co-authored-by: Olatunji Ruwase <[email protected]> * Make reduce scatter optional for ZeRO-1 as workaround (microsoft#971) * Make reduce scatter optional for ZeRO-1 as workaround * Make allreduce default for ZeRO 1 Co-authored-by: Jeff Rasley <[email protected]> * Fix all Pipeline Module Parameters being sent to cuda:0 (microsoft#687) * remove communicate overflow (already in utils.CheckOverflow) Co-authored-by: sid <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Shaden Smith <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: brett koonce <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: hamlet <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Takuya Makino <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Sean Naren <[email protected]>

humza909 · 2023-04-04T15:37:11Z

@stas00 I think this offline reconstruction is only casting fp16 weights to fp32?

Then how we can use stage3 with fp32 weights. I could not see any method to save 32bit partitioned weights.
This means if a model was originally trained with fp32 converting it to fp16 will cause a precision loss, and if someone wants to avoid this loss there is no way to train and save with stage3.

Please share your thoughts.

Thanks

stas00 · 2023-04-04T16:26:57Z

@stas00 I think this offline reconstruction is only casting fp16 weights to fp32?

no, it's extracting fp32 weight shards and reconstructs full fp32 weights from those. fp16 is not involved anywhere in this process.

humza909 · 2023-04-05T06:53:14Z

@stas00 Thanks for your response.

Two/Three questions regarding stage3:

If a model is trained in fp16 setting and consolidated by gather_16bit_weights_on_model_save then how the script zero_to_fp32.py will extract fp32 weights?
If I am training with only fp32 enabled then what method will consolidate weights? Do we still use gather_16bit_weights_on_model_save? There is no other config option available. If it is still saving in 16 bit then where are the remaining 16 bit saved? How bits are split? Why the model was originally converted to 16-bit and used for forward and backward pass?
Why not save the weights directly to the enabled fp precision in config either fp16 or fp32 without going for additional steps. I know going from fp16 to fp32 will take additional RAM, but if fp32 is enabled save the model in fp32.
Below the model parameters are saved in optim_states.pt?


ls -lh global_step100/
total 15G
-rw-rw-r-- 1 snxds snxds  97M Apr  4 08:07 zero_pp_rank_0_mp_rank_00_model_states.pt
-rw-rw-r-- 1 snxds snxds 7.4G Apr  4 08:07 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 snxds snxds  97M Apr  4 08:07 zero_pp_rank_1_mp_rank_00_model_states.pt
-rw-rw-r-- 1 snxds snxds 7.4G Apr  4 08:07 zero_pp_rank_1_mp_rank_00_optim_states.pt

Thanks

stas00 · 2023-04-05T17:09:18Z

@stas00 Thanks for your response.

Two/Three questions regarding stage3:

1. If a model is trained in fp16 setting and consolidated by `gather_16bit_weights_on_model_save` then how the script `zero_to_fp32.py` will extract fp32 weights?

If you did save_checkpoint it'll extract fp32 weights.

gather_16bit_weights_on_model_save has nothing to do with the normal Deepspeed save/load sharded weight cycles.

2. If I am training with only fp32 enabled then what method will consolidate weights? Do we still use `gather_16bit_weights_on_model_save`? There is no other config option available. If it is still saving in 16 bit then where are the remaining 16 bit saved? How bits are split? Why the model was originally converted to 16-bit and used for forward and backward pass?

the fp32 weights are saved as parts of save_checkpoint and zero_to_fp32.py extracts those.

you're asking a good question about gather_16bit_weights_on_model_save - I asked for it specifically to deal with half-precision training. You simply don't call this method if you're not doing half-precision. Perhaps if there is a need you can ask for gather_32bit_weights_on_model_save, if you need it.

3. Why not save the weights directly to the enabled fp precision in config either fp16 or fp32 without going for additional steps. I know going from fp16 to fp32 will take additional RAM, but if fp32 is enabled save the model in fp32.

Because it's super-expensive both cpu RAM and time-wise. When you do serious training you must save checkpoints frequently due to hardware crashes, and if this process isn't fast you're wasting gpu idling time. So the helper method was added to aid the beginner user to get things working. Once you have everything else working you don't want to use gathering ever again for any largish model size serious training.

4. Below the model parameters are saved in optim_states.pt?

what is question 4?

humza909 · 2023-04-06T08:30:01Z

@stas00 Thanks for your response.

Ignore question 04. Likely last series of questions.

I am still confused why everything in stage3 is in 16-bit? Even I am using full-precision/fp32.
Your answer to my first question is save_checkpoint will extract fp32 weights, but how? My model was converted to fp16 before the start of training, and fp32 weights were discarded at the beginning. Is it only the type casting now?

Any recommended links where I can get the implementation details would be very helpful.

Thanks for you time :)

stas00 · 2023-04-06T15:40:57Z

I am still confused why everything in stage3 is in 16-bit? Even I am using full-precision/fp32.

It's not. If you're not using half-precision. Nothing is in fp16. Everything is in fp32.

You should never convert your model to any format when you use deepspeed. It'll do it for you on the need basis.

Your answer to my first question is save_checkpoint will extract fp32 weights, but how? My model was converted to fp16 before the start of training, and fp32 weights were discarded at the beginning. Is it only the type casting now?

See the previous answer. When not using half precision it'll upcast it into fp32 if you pre-converted to fp16.

To give you a hint deepspeed performs the same as mixed precision protocol, but in reverse - it converts the model to fp16 and it keeps a copy of fp32 weights in its optimizer. So you always have both weights. save_checkpoint saves only fp32 weights.

So if you're not doing mixed precision, you only have fp32 weights and they are saved with save_checkpoint.

Any recommended links where I can get the implementation details would be very helpful.

I think for this particular topic there isn't much documentation, you can read the papers here:
https://huggingface.co/docs/transformers/main/main_classes/deepspeed#main-deepspeed-resources
the rest of the doc might be useful as well as a general reading, but it's specific to the transformers integration.

You can make a request to document these nuances. These are all excellent questions and are important to understand.

khulaifi95 · 2023-07-12T10:23:08Z

@stas00 Sorry but I find this issue related to my problem here. To make an extension on the last question from @humza909, for zero stage-3 training across 2 nodes, is it ever possible to reconstrcut the model weights from the contents that saved in checkpoints from one node?

zcakzhuu · 2023-07-12T13:08:43Z

I am still confused why everything in stage3 is in 16-bit? Even I am using full-precision/fp32.

It's not. If you're not using half-precision. Nothing is in fp16. Everything is in fp32.

You should never convert your model to any format when you use deepspeed. It'll do it for you on the need basis.

Your answer to my first question is save_checkpoint will extract fp32 weights, but how? My model was converted to fp16 before the start of training, and fp32 weights were discarded at the beginning. Is it only the type casting now?

See the previous answer. When not using half precision it'll upcast it into fp32 if you pre-converted to fp16.

To give you a hint deepspeed performs the same as mixed precision protocol, but in reverse - it converts the model to fp16 and it keeps a copy of fp32 weights in its optimizer. So you always have both weights. save_checkpoint saves only fp32 weights.

So if you're not doing mixed precision, you only have fp32 weights and they are saved with save_checkpoint.

Any recommended links where I can get the implementation details would be very helpful.

I think for this particular topic there isn't much documentation, you can read the papers here: https://huggingface.co/docs/transformers/main/main_classes/deepspeed#main-deepspeed-resources the rest of the doc might be useful as well as a general reading, but it's specific to the transformers integration.

You can make a request to document these nuances. These are all excellent questions and are important to understand.

Hi, thanks for the great answers! They are really helpful. I have a question about fine-tuning in fp32 using peft. Regarding question 2, if I am checkpointing a model that is fine-tuned using peft in fp32 with a transformers trainer, do I have to run zero_to_fp32.py before resuming training from checkpoint using the trainer? Aren't the weights in .bin file fp32 already?

stas00 · 2023-07-12T15:14:04Z

@zcakzhuu,

You don't need to do anything if you resume from a deepspeed checkpoint saved at the first run - it will do the right thing.

As long as you remain within the deepspeed realm you don't need to do anything.

Only when you're done training and you want to take the weights elsewhere you'd use zero_to_fp32.py if you were training in mixed precision or if you were training in fp32 but saving only the deepspeed checkpoint.

Just give it a try - train for 1 step, save and resume and train for another step. It shouldn't be too difficult to see what you get.

stas00 · 2023-07-12T15:17:01Z

@khulaifi95, I didn't understand your question, could you please try to reframe it?

full fp32 weights reconstruction for zero 2+3

2c453d4

stas00 requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners March 26, 2021 03:08

stas00 changed the title ~~full fp32 weights reconstruction for zero 2+3~~ offline full fp32 weights reconstruction for zero 2+3 checkpoints Mar 26, 2021

This was referenced Mar 26, 2021

[DeepSpeed] ZeRO Stage 3 huggingface/transformers#10753

Merged

[request] new method to save the fp32 model #800

Closed

[zero3] how to get the model reconstructed for saving? #872

Closed

tjruwase approved these changes Mar 26, 2021

View reviewed changes

tjruwase merged commit 7531c6b into microsoft:master Mar 26, 2021

tjruwase mentioned this pull request Mar 26, 2021

Save ZeRO3 (partitioned) fp16 weights #882

Closed

stas00 deleted the consolidate-fp32-weights branch March 26, 2021 22:19

stas00 mentioned this pull request Apr 1, 2021

[docs] new zero functionality and install #909

Merged

exelents mentioned this pull request Apr 27, 2021

Reconstruction of fp32 weights on stage3 doesn't work #1009

Closed

piegu mentioned this pull request Jun 21, 2021

Adapter-transformers & DeepSpeed: how to get fp32 weights reconstruction? adapter-hub/adapters#192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

offline full fp32 weights reconstruction for zero 2+3 checkpoints #892

offline full fp32 weights reconstruction for zero 2+3 checkpoints #892

stas00 commented Mar 26, 2021 •

edited

Loading

humza909 commented Apr 4, 2023

stas00 commented Apr 4, 2023

humza909 commented Apr 5, 2023 •

edited

Loading

stas00 commented Apr 5, 2023

humza909 commented Apr 6, 2023

stas00 commented Apr 6, 2023 •

edited

Loading

khulaifi95 commented Jul 12, 2023

zcakzhuu commented Jul 12, 2023

stas00 commented Jul 12, 2023 •

edited

Loading

stas00 commented Jul 12, 2023

offline full fp32 weights reconstruction for zero 2+3 checkpoints #892

offline full fp32 weights reconstruction for zero 2+3 checkpoints #892

Conversation

stas00 commented Mar 26, 2021 • edited Loading

humza909 commented Apr 4, 2023

stas00 commented Apr 4, 2023

humza909 commented Apr 5, 2023 • edited Loading

stas00 commented Apr 5, 2023

humza909 commented Apr 6, 2023

stas00 commented Apr 6, 2023 • edited Loading

khulaifi95 commented Jul 12, 2023

zcakzhuu commented Jul 12, 2023

stas00 commented Jul 12, 2023 • edited Loading

stas00 commented Jul 12, 2023

stas00 commented Mar 26, 2021 •

edited

Loading

humza909 commented Apr 5, 2023 •

edited

Loading

stas00 commented Apr 6, 2023 •

edited

Loading

stas00 commented Jul 12, 2023 •

edited

Loading