[RFC] hivemind.Optimizer overhaul #400

justheuristic · 2021-11-01T07:33:18Z

Current experience with hivemind optimizers

Where it hurts:

it should be clear which optimizer you should use (e.g. DecentralizedSGD vs CollaborativeOptimizer)
optimizers should work fine with default parameters
it hurts that we need to re-implement the same features for each optimizer (e.g. add decentralized learning rate scheduler and epochs abstraction #252)
implementing complex features (e.g. DPU) in CollaborativeOptimizer makes it into a huge blob of code that is difficult to maintain
OffloadOptimizerWrapper (in sahajbert-xl) was painful to maintain and inefficient w.r.t. cpu-gpu communication

What worked:

mimicry for PytorchOptimizer still seems to be a good idea
tracking the global batch size in CollaborativeOptimizer works good enough for practical use
wrapping arbitrary optimizer and scheduler proved convenient

Additional constraints:

we need to finalize the interface by NeurIPS demo (not necessarily all functionality)
we must be able to maintain features: PowerSGD/EF21, etc

Proposal: interface

Replace the 4 existing optimizers with a single one

hivemind.Optimizer(
  params=model.parameters(), optim_cls=partial(Adam, betas=(0.9, 0.95)),  # alternative: opt=my_optimizer_instance
  average_gradients=True, average_parameters=True, average_statistics=['exp_avg_sq'],
  async_state_averaging=True, async_gradient_averaging=True, async_optimizer_step=True, offload_optimizer=True,
)

The new optimizer would cover all 5 existing ones as special cases:

DecentralizedSGD: Optimizer(opt=SGD(...), average_parameters=True)
DecentralizedAdam: Optimizer(opt=Adam(...), average_parameters=True, average_statistics=("exp_avg_sq",))
CollaborativeOptimizer: Optimizer(opt=..., average_gradients=True, target_batch_size=4096, average_parameters=True)
CollaborativeAdaptiveOptimizer: Optimizer(opt=..., average_gradients=True, target_batch_size=4096, average_parameters=True, average_statistics=("exp_avg_sq",))
DPUOptimizer: Optimizer(..., average_gradients=True, async_gradient_averaging=True, **same_as_above)

Proposal: internals

TODOs

create hivemind.Optimizer
- test convergence rate for some transformer
- add a CI test that checks the correctness after several optimizer steps
- add a CI test that ensures that asynchronous averaging works for both gradients and parameters
basic features
- add an option to not average gradients at all (training should be asynchronous)
  - achtung! this will require a more carefull ProgressTracker to prevent behavior where
    - the first peer switches to next state ahead of others
    - other peers now detect that there are less samples accumulated (first peer has reset its #samples)
    - other peers eventually become out_of_sync and must load state from the first peer
    - how to fix: make sure that progress tracker keeps track of last time peer updated progress on current step
- implement ahead-of-time scheduling gradient averaging
- implement averaging parameters every k steps
- implement asynchronous gradient/state averaging
- add an option to run optimizer on top of averaged parameters/gradients (RAM efficiency)
advanced features
- implement EF21+ gradient averager
- implement PowerSGD averager
update tutorials
- documentation page
- switch quickstart.html to the new optimizer
- switch examples/albert to the new optimizer
refactoring / chores
- HivemindGradScaler -> hivemind.GradScaler
- HivemindOptimizer -> hivemind.Optimizer
- deprecate hivemind.optim.*
- deprecate hivemind.averaging.training
- remove the old optimizers in v1.1.0
- set target_group_size default = unlimited

codecov · 2021-11-01T07:34:58Z

Codecov Report

Merging #400 (b755f8a) into master (688b514) will increase coverage by 0.10%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #400      +/-   ##
==========================================
+ Coverage   84.09%   84.19%   +0.10%     
==========================================
  Files          77       77              
  Lines        7891     7891              
==========================================
+ Hits         6636     6644       +8     
+ Misses       1255     1247       -8

Impacted Files	Coverage Δ
hivemind/optim/experimental/progress_tracker.py	`97.75% <0.00%> (-1.13%)`	⬇️
hivemind/optim/experimental/optimizer.py	`62.53% <0.00%> (-0.30%)`	⬇️
hivemind/averaging/averager.py	`87.65% <0.00%> (+0.72%)`	⬆️
hivemind/utils/asyncio.py	`100.00% <0.00%> (+0.86%)`	⬆️
hivemind/utils/mpfuture.py	`95.00% <0.00%> (+0.90%)`	⬆️
hivemind/averaging/matchmaking.py	`88.72% <0.00%> (+1.48%)`	⬆️

justheuristic · 2021-11-06T05:26:23Z

So far, all discussion we've had on this was oral. So, lemme summarize what we've agreed on with @mryab @borzunov @yhn112 :

In the first version, we introduce hivemind.Optimizer in hivemind.optim.experimental, but import it as
- the first version should support the (optimized) CollaborativeOptimizer mode, other functionality may not be implemented yet.
- the first version should be merged without deprecation
In the second version, implement a behavior that mimics DecetralizedSGD/DecentralizedAdam
- add test that checks for this specific behavior
- deprecate the corresponding classes
- update quickstart.md

... and then we'll figure out the rest of the features

This PR implements GradientAverager - a subclass of DecentralizedAverager that supports accumulating and aggregating gradients. This class supports pre-scheduling and delayed averaging ( for DPU, #394 ) for use in hivemind.Optimizer ( #400 ) Co-authored-by: Max Ryabinin <[email protected]> Co-authored-by: Aleksandr Borzunov <[email protected]>

This PR implements a component of hivemind.Optimizer ( #400 ) that holds the training state and supports (delayed) optimizer steps and averaging rounds. Unlike TrainingAverager, this class is does not need data locks as it will only update model parameters during .step. Co-authored-by: Roman Zhytar <[email protected]> Co-authored-by: Anton Sinitsin <[email protected]> Co-authored-by: Max Ryabinin <[email protected]> Co-authored-by: Aleksandr Borzunov <[email protected]>

justheuristic · 2021-11-30T21:34:02Z

Status report:

hivemind.Optimizer core functionality was merged in Implement core functionality of hivemind.Optimizer #403
The current effort is tree-fold:
- 1) Documentation & tutorials
  - add hivemind.Optimizer to RTFD
  - switch quickstart tutorial to use hivemind.Optimizer and test stability (@justheuristic )
  - switch examples/albert to use hivemind.Optimizer and test convergence (TODO)
  - add developer's guide to hivemind.Optimizer (TODO)
- 2) Stability and support
  - verify full convergence for mingpt
  - verify full convergence with volunteers (TODO)
  - background thread: address any bugs and provide support as necessary
- 3) Advanced features
  - PowerSGD compression and error feedback (@artek0chumak )

borzunov · 2022-05-23T11:52:54Z

Most of this implemented in several PRs towards the 1.0.0 release.

justheuristic added 2 commits November 1, 2021 10:02

scheme for rtfd

7c7ac80

scheme for rtfd

4df9665

This was referenced Nov 6, 2021

Implement core functionality of hivemind.Optimizer #403

Merged

Add GradientAverager with support for delayed averaging #404

Merged

justheuristic mentioned this pull request Nov 8, 2021

[hivemind.Optimizer] TrainingStateAverager #407

Merged

6 tasks

Merge branch 'master' into rfc_optimizer

abbea26

justheuristic mentioned this pull request Dec 15, 2021

ParameterAveragingOptimizer: support scheduler #239

Closed

12 tasks

Merge branch 'master' into rfc_optimizer

b755f8a

justheuristic closed this May 23, 2022

justheuristic mentioned this pull request May 31, 2022

Memory leak in averager (on cancelled steps) #264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] hivemind.Optimizer overhaul #400

[RFC] hivemind.Optimizer overhaul #400

justheuristic commented Nov 1, 2021 •

edited

Loading

codecov bot commented Nov 1, 2021 •

edited

Loading

justheuristic commented Nov 6, 2021

justheuristic commented Nov 30, 2021 •

edited

Loading

borzunov commented May 23, 2022

[RFC] hivemind.Optimizer overhaul #400

[RFC] hivemind.Optimizer overhaul #400

Conversation

justheuristic commented Nov 1, 2021 • edited Loading

Current experience with hivemind optimizers

Proposal: interface

Proposal: internals

TODOs

codecov bot commented Nov 1, 2021 • edited Loading

Codecov Report

justheuristic commented Nov 6, 2021

justheuristic commented Nov 30, 2021 • edited Loading

borzunov commented May 23, 2022

justheuristic commented Nov 1, 2021 •

edited

Loading

codecov bot commented Nov 1, 2021 •

edited

Loading

justheuristic commented Nov 30, 2021 •

edited

Loading