Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hivemind.Optimizer] ProgressTracker #408

Merged
merged 23 commits into from
Nov 16, 2021
Merged

[hivemind.Optimizer] ProgressTracker #408

merged 23 commits into from
Nov 16, 2021

Conversation

justheuristic
Copy link
Member

Auxiliary class that keeps track of local & global training progress, measured in epochs.
An epoch can be incremented after collaboration accumulates a said number of gradients (target_batch_size).
Similarly to pytorch LR scheduler, epoch can be incremented on a single optimizer update or many local updates.

@codecov
Copy link

codecov bot commented Nov 15, 2021

Codecov Report

Merging #408 (608b2cc) into master (99a0c18) will increase coverage by 0.40%.
The diff coverage is 97.48%.

@@            Coverage Diff             @@
##           master     #408      +/-   ##
==========================================
+ Coverage   84.17%   84.57%   +0.40%     
==========================================
  Files          75       76       +1     
  Lines        7121     7280     +159     
==========================================
+ Hits         5994     6157     +163     
+ Misses       1127     1123       -4     
Impacted Files Coverage Δ
hivemind/optim/experimental/progress_tracker.py 97.48% <97.48%> (ø)
hivemind/utils/mpfuture.py 94.95% <0.00%> (+0.91%) ⬆️
hivemind/utils/performance_ema.py 94.87% <0.00%> (+15.38%) ⬆️

@justheuristic
Copy link
Member Author

justheuristic commented Nov 15, 2021

Massive test results:

#5a4c56b test accuracy = 29 / 31 (too flappy, trying more lenient thresholds for epoch==2)
image

hivemind/optim/experimental/progress_tracker.py Outdated Show resolved Hide resolved
tests/test_optimizer.py Outdated Show resolved Hide resolved
tests/test_optimizer.py Outdated Show resolved Hide resolved
hivemind/optim/experimental/progress_tracker.py Outdated Show resolved Hide resolved
@contextlib.contextmanager
def pause_updates(self):
"""Temporarily stop progress tracker from updating global training state"""
with self.lock_global_progress, self.performance_ema.pause():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(may be out of scope of this PR)

I'm surprised that performance_ema.pause() resets the timer instead of actually pausing it.

hivemind/optim/experimental/progress_tracker.py Outdated Show resolved Hide resolved
hivemind/optim/experimental/progress_tracker.py Outdated Show resolved Hide resolved
try:
while not self.shutdown_triggered.is_set():
wait_timeout = max(0.0, last_report_time + self.metadata_expiration - get_dht_time())
logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Here and further:

Suggested change
logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command.")
logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command")

Comment on lines 130 to 136
if self.global_epoch > self.local_progress.epoch:
return True
elif self.global_progress.samples_accumulated >= self.target_batch_size:
return True
elif get_dht_time() >= self.global_progress.eta_next_epoch:
return True
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.global_epoch > self.local_progress.epoch:
return True
elif self.global_progress.samples_accumulated >= self.target_batch_size:
return True
elif get_dht_time() >= self.global_progress.eta_next_epoch:
return True
return False
return (
self.global_epoch > self.local_progress.epoch or
self.global_progress.samples_accumulated >= self.target_batch_size or
get_dht_time() >= self.global_progress.eta_next_epoch
)

assert not tracker.is_alive()

mean_step_time = sum(step_time_deltas) / len(step_time_deltas)
for i in (0, 1, 5):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i in (0, 1, 5):
for i in (0, 1, 5): # Without the 4th worker (the fastest one)

mean_step_time = sum(step_time_deltas) / len(step_time_deltas)
for i in (0, 1, 5):
assert 1.05 * mean_step_time < step_time_deltas[i] < 2.0 * mean_step_time
for i in (2, 3, 4):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i in (2, 3, 4):
for i in (2, 3, 4): # With the 4th worker

performance_ema_alpha: float = 0.1,
metadata_expiration: float = 30.0,
status_loglevel: int = logging.DEBUG,
private_key: PrivateKey = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private_key: PrivateKey = None,
private_key: Optional[RSAPrivateKey] = None,

tracker.shutdown()
dht.shutdown()

# note: we use processes here because RSASignatureValidator inside trackers uses process-wide RSA keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# note: we use processes here because RSASignatureValidator inside trackers uses process-wide RSA keys

@justheuristic justheuristic merged commit d883387 into master Nov 16, 2021
@justheuristic justheuristic deleted the progress_tracker branch November 16, 2021 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants