Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] hivemind.Optimizer overhaul #400

Closed
wants to merge 4 commits into from
Closed

Conversation

justheuristic
Copy link
Member

@justheuristic justheuristic commented Nov 1, 2021

Current experience with hivemind optimizers

Where it hurts:

  • it should be clear which optimizer you should use (e.g. DecentralizedSGD vs CollaborativeOptimizer)
  • optimizers should work fine with default parameters
  • it hurts that we need to re-implement the same features for each optimizer (e.g. add decentralized learning rate scheduler and epochs abstraction #252)
  • implementing complex features (e.g. DPU) in CollaborativeOptimizer makes it into a huge blob of code that is difficult to maintain
  • OffloadOptimizerWrapper (in sahajbert-xl) was painful to maintain and inefficient w.r.t. cpu-gpu communication

What worked:

  • mimicry for PytorchOptimizer still seems to be a good idea
  • tracking the global batch size in CollaborativeOptimizer works good enough for practical use
  • wrapping arbitrary optimizer and scheduler proved convenient

Additional constraints:

  • we need to finalize the interface by NeurIPS demo (not necessarily all functionality)
  • we must be able to maintain features: PowerSGD/EF21, etc

Proposal: interface

Replace the 4 existing optimizers with a single one

hivemind.Optimizer(
  params=model.parameters(), optim_cls=partial(Adam, betas=(0.9, 0.95)),  # alternative: opt=my_optimizer_instance
  average_gradients=True, average_parameters=True, average_statistics=['exp_avg_sq'],
  async_state_averaging=True, async_gradient_averaging=True, async_optimizer_step=True, offload_optimizer=True,
)

The new optimizer would cover all 5 existing ones as special cases:

  • DecentralizedSGD: Optimizer(opt=SGD(...), average_parameters=True)
  • DecentralizedAdam: Optimizer(opt=Adam(...), average_parameters=True, average_statistics=("exp_avg_sq",))
  • CollaborativeOptimizer: Optimizer(opt=..., average_gradients=True, target_batch_size=4096, average_parameters=True)
  • CollaborativeAdaptiveOptimizer: Optimizer(opt=..., average_gradients=True, target_batch_size=4096, average_parameters=True, average_statistics=("exp_avg_sq",))
  • DPUOptimizer: Optimizer(..., average_gradients=True, async_gradient_averaging=True, **same_as_above)

Proposal: internals

img

TODOs

  • create hivemind.Optimizer

    • test convergence rate for some transformer
    • add a CI test that checks the correctness after several optimizer steps
    • add a CI test that ensures that asynchronous averaging works for both gradients and parameters
  • basic features

    • add an option to not average gradients at all (training should be asynchronous)
      • achtung! this will require a more carefull ProgressTracker to prevent behavior where
        • the first peer switches to next state ahead of others
        • other peers now detect that there are less samples accumulated (first peer has reset its #samples)
        • other peers eventually become out_of_sync and must load state from the first peer
        • how to fix: make sure that progress tracker keeps track of last time peer updated progress on current step
    • implement ahead-of-time scheduling gradient averaging
    • implement averaging parameters every k steps
    • implement asynchronous gradient/state averaging
    • add an option to run optimizer on top of averaged parameters/gradients (RAM efficiency)
  • advanced features

    • implement EF21+ gradient averager
    • implement PowerSGD averager
  • update tutorials

    • documentation page
    • switch quickstart.html to the new optimizer
    • switch examples/albert to the new optimizer
  • refactoring / chores

    • HivemindGradScaler -> hivemind.GradScaler
    • HivemindOptimizer -> hivemind.Optimizer
    • deprecate hivemind.optim.*
    • deprecate hivemind.averaging.training
    • remove the old optimizers in v1.1.0
    • set target_group_size default = unlimited

@codecov
Copy link

codecov bot commented Nov 1, 2021

Codecov Report

Merging #400 (b755f8a) into master (688b514) will increase coverage by 0.10%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #400      +/-   ##
==========================================
+ Coverage   84.09%   84.19%   +0.10%     
==========================================
  Files          77       77              
  Lines        7891     7891              
==========================================
+ Hits         6636     6644       +8     
+ Misses       1255     1247       -8     
Impacted Files Coverage Δ
hivemind/optim/experimental/progress_tracker.py 97.75% <0.00%> (-1.13%) ⬇️
hivemind/optim/experimental/optimizer.py 62.53% <0.00%> (-0.30%) ⬇️
hivemind/averaging/averager.py 87.65% <0.00%> (+0.72%) ⬆️
hivemind/utils/asyncio.py 100.00% <0.00%> (+0.86%) ⬆️
hivemind/utils/mpfuture.py 95.00% <0.00%> (+0.90%) ⬆️
hivemind/averaging/matchmaking.py 88.72% <0.00%> (+1.48%) ⬆️

@justheuristic
Copy link
Member Author

So far, all discussion we've had on this was oral. So, lemme summarize what we've agreed on with @mryab @borzunov @yhn112 :

  • In the first version, we introduce hivemind.Optimizer in hivemind.optim.experimental, but import it as
    • the first version should support the (optimized) CollaborativeOptimizer mode, other functionality may not be implemented yet.
    • the first version should be merged without deprecation
  • In the second version, implement a behavior that mimics DecetralizedSGD/DecentralizedAdam
    • add test that checks for this specific behavior
    • deprecate the corresponding classes
    • update quickstart.md

... and then we'll figure out the rest of the features

justheuristic added a commit that referenced this pull request Nov 8, 2021
This PR implements GradientAverager - a subclass of DecentralizedAverager that supports accumulating and aggregating gradients. This class supports pre-scheduling and delayed averaging ( for DPU, #394 ) for use in hivemind.Optimizer ( #400 )

Co-authored-by: Max Ryabinin <[email protected]>
Co-authored-by: Aleksandr Borzunov <[email protected]>
justheuristic added a commit that referenced this pull request Nov 15, 2021
This PR implements a component of hivemind.Optimizer ( #400 ) that holds the training state and supports (delayed) optimizer steps and averaging rounds.
Unlike TrainingAverager, this class is does not need data locks as it will only update model parameters during .step.

Co-authored-by: Roman Zhytar <[email protected]>
Co-authored-by: Anton Sinitsin <[email protected]>
Co-authored-by: Max Ryabinin <[email protected]>
Co-authored-by: Aleksandr Borzunov <[email protected]>
@justheuristic
Copy link
Member Author

justheuristic commented Nov 30, 2021

Status report:

  • hivemind.Optimizer core functionality was merged in Implement core functionality of hivemind.Optimizer #403
  • The current effort is tree-fold:
    • 1) Documentation & tutorials
      • add hivemind.Optimizer to RTFD
      • switch quickstart tutorial to use hivemind.Optimizer and test stability (@justheuristic )
      • switch examples/albert to use hivemind.Optimizer and test convergence (TODO)
      • add developer's guide to hivemind.Optimizer (TODO)
    • 2) Stability and support
      • verify full convergence for mingpt
      • verify full convergence with volunteers (TODO)
      • background thread: address any bugs and provide support as necessary
    • 3) Advanced features

@borzunov
Copy link
Member

Most of this implemented in several PRs towards the 1.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants