[hivemind.Optimizer] ProgressTracker #408

justheuristic · 2021-11-15T20:48:39Z

Auxiliary class that keeps track of local & global training progress, measured in epochs.
An epoch can be incremented after collaboration accumulates a said number of gradients (target_batch_size).
Similarly to pytorch LR scheduler, epoch can be incremented on a single optimizer update or many local updates.

codecov · 2021-11-15T20:54:44Z

Codecov Report

Merging #408 (608b2cc) into master (99a0c18) will increase coverage by 0.40%.
The diff coverage is 97.48%.

@@            Coverage Diff             @@
##           master     #408      +/-   ##
==========================================
+ Coverage   84.17%   84.57%   +0.40%     
==========================================
  Files          75       76       +1     
  Lines        7121     7280     +159     
==========================================
+ Hits         5994     6157     +163     
+ Misses       1127     1123       -4

Impacted Files	Coverage Δ
hivemind/optim/experimental/progress_tracker.py	`97.48% <97.48%> (ø)`
hivemind/utils/mpfuture.py	`94.95% <0.00%> (+0.91%)`	⬆️
hivemind/utils/performance_ema.py	`94.87% <0.00%> (+15.38%)`	⬆️

justheuristic · 2021-11-15T21:08:24Z

Massive test results:

#5a4c56b test accuracy = 29 / 31 (too flappy, trying more lenient thresholds for epoch==2)

hivemind/optim/experimental/progress_tracker.py

tests/test_optimizer.py

hivemind/optim/experimental/progress_tracker.py

borzunov · 2021-11-16T16:39:52Z

hivemind/optim/experimental/progress_tracker.py

+    @contextlib.contextmanager
+    def pause_updates(self):
+        """Temporarily stop progress tracker from updating global training state"""
+        with self.lock_global_progress, self.performance_ema.pause():


(may be out of scope of this PR)

I'm surprised that performance_ema.pause() resets the timer instead of actually pausing it.

hivemind/optim/experimental/progress_tracker.py

Co-authored-by: Alexander Borzunov <[email protected]>

…tracker

hivemind/optim/experimental/progress_tracker.py

borzunov · 2021-11-16T21:35:51Z

hivemind/optim/experimental/progress_tracker.py

+        try:
+            while not self.shutdown_triggered.is_set():
+                wait_timeout = max(0.0, last_report_time + self.metadata_expiration - get_dht_time())
+                logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command.")


nit: Here and further:

Suggested change

logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command.")

logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command")

borzunov · 2021-11-16T21:58:25Z

hivemind/optim/experimental/progress_tracker.py

+        if self.global_epoch > self.local_progress.epoch:
+            return True
+        elif self.global_progress.samples_accumulated >= self.target_batch_size:
+            return True
+        elif get_dht_time() >= self.global_progress.eta_next_epoch:
+            return True
+        return False


Suggested change

if self.global_epoch > self.local_progress.epoch:

return True

elif self.global_progress.samples_accumulated >= self.target_batch_size:

return True

elif get_dht_time() >= self.global_progress.eta_next_epoch:

return True

return False

return (

self.global_epoch > self.local_progress.epoch or

self.global_progress.samples_accumulated >= self.target_batch_size or

get_dht_time() >= self.global_progress.eta_next_epoch

)

borzunov · 2021-11-16T22:19:44Z

tests/test_optimizer.py

+    assert not tracker.is_alive()
+
+    mean_step_time = sum(step_time_deltas) / len(step_time_deltas)
+    for i in (0, 1, 5):


Suggested change

for i in (0, 1, 5):

for i in (0, 1, 5): # Without the 4th worker (the fastest one)

borzunov · 2021-11-16T22:20:13Z

tests/test_optimizer.py

+    mean_step_time = sum(step_time_deltas) / len(step_time_deltas)
+    for i in (0, 1, 5):
+        assert 1.05 * mean_step_time < step_time_deltas[i] < 2.0 * mean_step_time
+    for i in (2, 3, 4):


Suggested change

for i in (2, 3, 4):

for i in (2, 3, 4): # With the 4th worker

borzunov · 2021-11-16T22:30:24Z

hivemind/optim/experimental/progress_tracker.py

+        performance_ema_alpha: float = 0.1,
+        metadata_expiration: float = 30.0,
+        status_loglevel: int = logging.DEBUG,
+        private_key: PrivateKey = None,


Suggested change

private_key: PrivateKey = None,

private_key: Optional[RSAPrivateKey] = None,

borzunov · 2021-11-16T22:33:01Z

tests/test_optimizer.py

+        tracker.shutdown()
+        dht.shutdown()
+
+    # note: we use processes here because RSASignatureValidator inside trackers uses process-wide RSA keys


Suggested change

# note: we use processes here because RSASignatureValidator inside trackers uses process-wide RSA keys

xtinkt and others added 4 commits November 15, 2021 23:47

implementation

c6e7600

test

f34a32d

massive test

5a4c56b

remove massive tests

94d42bc

justheuristic requested a review from borzunov November 15, 2021 20:55

massive tests: more lenient threshold

ef7e7db

justheuristic added 2 commits November 16, 2021 00:11

massive tests: tweak thresholds

a859aa5

wait for finished

4f55558

justheuristic mentioned this pull request Nov 15, 2021

add decentralized learning rate scheduler and epochs abstraction #252

Closed

justheuristic added 5 commits November 16, 2021 00:32

sleep before leaving

228f1e7

undo massive tests

b9d7135

Update progress_tracker.py

e533c1f

Update test_optimizer.py

fcfadc6

Merge branch 'master' into progress_tracker

0d65076

borzunov requested changes Nov 16, 2021

View reviewed changes

justheuristic and others added 7 commits November 16, 2021 19:59

Update hivemind/optim/experimental/progress_tracker.py

c511464

Co-authored-by: Alexander Borzunov <[email protected]>

review

47618c4

Merge remote-tracking branch 'origin/progress_tracker' into progress_…

64c47e7

…tracker

black-isort

00fe781

review

806e7a3

whim

3246b78

unfork

fb4940c

borzunov requested changes Nov 16, 2021

View reviewed changes

borzunov approved these changes Nov 16, 2021

View reviewed changes

borzunov added 2 commits November 17, 2021 01:24

review

52e90b8

review

866af4c

borzunov reviewed Nov 16, 2021

View reviewed changes

review

adbf91e

borzunov reviewed Nov 16, 2021

View reviewed changes

review

608b2cc

justheuristic merged commit d883387 into master Nov 16, 2021

justheuristic deleted the progress_tracker branch November 16, 2021 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hivemind.Optimizer] ProgressTracker #408

[hivemind.Optimizer] ProgressTracker #408

justheuristic commented Nov 15, 2021

codecov bot commented Nov 15, 2021 •

edited

Loading

justheuristic commented Nov 15, 2021 •

edited

Loading

borzunov Nov 16, 2021

borzunov Nov 16, 2021

borzunov Nov 16, 2021

borzunov Nov 16, 2021

borzunov Nov 16, 2021

borzunov Nov 16, 2021

borzunov Nov 16, 2021

	logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command.")
	logger.debug(f"Will report progress again in {wait_timeout} seconds or on user command")

	for i in (0, 1, 5):
	for i in (0, 1, 5): # Without the 4th worker (the fastest one)

	for i in (2, 3, 4):
	for i in (2, 3, 4): # With the 4th worker

	private_key: PrivateKey = None,
	private_key: Optional[RSAPrivateKey] = None,

[hivemind.Optimizer] ProgressTracker #408

[hivemind.Optimizer] ProgressTracker #408

Conversation

justheuristic commented Nov 15, 2021

codecov bot commented Nov 15, 2021 • edited Loading

Codecov Report

justheuristic commented Nov 15, 2021 • edited Loading

borzunov Nov 16, 2021

Choose a reason for hiding this comment

borzunov Nov 16, 2021

Choose a reason for hiding this comment

borzunov Nov 16, 2021

Choose a reason for hiding this comment

borzunov Nov 16, 2021

Choose a reason for hiding this comment

borzunov Nov 16, 2021

Choose a reason for hiding this comment

borzunov Nov 16, 2021

Choose a reason for hiding this comment

borzunov Nov 16, 2021

Choose a reason for hiding this comment

codecov bot commented Nov 15, 2021 •

edited

Loading

justheuristic commented Nov 15, 2021 •

edited

Loading