Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix offloaded optimizer with single peer #450

Merged
merged 8 commits into from
Jan 19, 2022
Merged

Conversation

justheuristic
Copy link
Member

The bug was originally found by @elricwan and @finger92 (seemingly independently) and reported in #447

Here's what was caused the issue:

I investigated what went wrong in when training with only one trainer. Currently, hivemind.Optimizer is hard-wired to use the averaged gradients -- as in "averaged with peers".

If you are the only peer, gradients are not averaged, so optimizer runs with zero gradients all the time. This change should fix the problem in your specific case: 4ffd9ca I have seemingly introduced that bug myself in #440 . It only affects the github version of hivemind (i.e. not the pypi version)

The bug was introduced in #440 and affects the following setups (all 3 must be true):

  • installed hivemind from github
  • there is only one training peer in the swarm
  • offload_optimizer is True

Here's the behavior after the fix is introduced:
image

@codecov
Copy link

codecov bot commented Jan 19, 2022

Codecov Report

Merging #450 (d50dedd) into master (8aa798d) will increase coverage by 0.37%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master     #450      +/-   ##
==========================================
+ Coverage   83.72%   84.10%   +0.37%     
==========================================
  Files          78       78              
  Lines        7928     7931       +3     
==========================================
+ Hits         6638     6670      +32     
+ Misses       1290     1261      -29     
Impacted Files Coverage Δ
hivemind/optim/optimizer.py 62.28% <85.71%> (+2.05%) ⬆️
hivemind/optim/progress_tracker.py 97.80% <0.00%> (-1.10%) ⬇️
hivemind/averaging/averager.py 87.65% <0.00%> (+0.72%) ⬆️
hivemind/utils/asyncio.py 100.00% <0.00%> (+0.86%) ⬆️
hivemind/optim/grad_averager.py 93.81% <0.00%> (+1.03%) ⬆️
hivemind/dht/node.py 92.63% <0.00%> (+1.18%) ⬆️
hivemind/averaging/matchmaking.py 88.69% <0.00%> (+4.46%) ⬆️

@@ -618,6 +618,12 @@ def _load_averaged_gradients_into_optimizer_(self):

self.grad_averager.notify_used_averaged_gradients()

def _load_local_gradients_into_optimizer(self):
"""Fallback to using local gradients in the optimizer (instead of averaged gradients)"""
logger.log(self.status_loglevel, f"Proceeding with local gradients")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that this can be optimized in case of one peer (if we'd ever need to optimize this case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just did it

@justheuristic justheuristic merged commit a974b55 into master Jan 19, 2022
@justheuristic justheuristic deleted the single-peer-fix branch January 19, 2022 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants