Simplify ExpertBackend interface #483

justheuristic · 2022-06-14T04:22:12Z

The core idea is that we should not make hivemind internals conditional on a specific training technique, such as linear warmup or gradient clipping by norm. Instead, we let user define their own scheduler and/or optimizer as necessary.

remove gradient clipping from ExpertBackend: this behavior can be achieved with a user-defined Optimizer
remove stats from ExpertBackend: this behavior can be achieved with a user-defined Scheduler
rename full_state -> state_dict, rationale: there is no "non-full" state in this context
rename ExpertBackend.expert -> ExpertBackend.module to avoid confusion

…update_period

…_expiration

- extract batching and clipping from ExpertBackend, reassign this role to optimizer/scheduler - rename full_state -> state_dict, rationale: there is no "non-full" state in this context - rename ExpertBackend.expert -> ExpertBackend.module to avoid confusion

justheuristic · 2022-06-14T04:24:45Z

hivemind/hivemind_cli/run_server.py

@@ -54,7 +54,8 @@ def main():
                        help='Server will report experts to DHT once in this many seconds')
    parser.add_argument('--expiration', type=float, required=False, default=None,
                        help='DHT entries will expire after this many seconds')
-    parser.add_argument('--num_total_steps', type=int, required=False, help='The total number of steps for LR schedule')
+    parser.add_argument('--num_training_steps', type=int, required=False, help='The total number of steps for LR schedule')


Name changed in the source: https://github.com/huggingface/transformers/blob/v4.19.4/src/transformers/optimization.py#L75

justheuristic · 2022-06-14T04:26:46Z

hivemind/moe/server/expert_backend.py

@@ -163,65 +141,47 @@ def backward(self, *inputs: torch.Tensor) -> Tuple[torch.Tensor, ...]:
            torch.autograd.backward(
                outputs_flat, grad_tensors=grad_outputs_flat, create_graph=False, retain_graph=False
            )
-            self.apply_gradients(batch_size)
+            self.on_backward(batch_size)


rationale: this does not necessarily apply gradients, e.g.

virtual batching applies gradients once every k steps

pretrained models do not apply the returned gradients

codecov · 2022-06-14T04:33:24Z

Codecov Report

Merging #483 (e05d3dc) into master (6c56a87) will decrease coverage by 0.19%.
The diff coverage is 75.45%.

@@            Coverage Diff             @@
##           master     #483      +/-   ##
==========================================
- Coverage   85.97%   85.78%   -0.20%     
==========================================
  Files          79       80       +1     
  Lines        7772     7808      +36     
==========================================
+ Hits         6682     6698      +16     
- Misses       1090     1110      +20

Impacted Files	Coverage Δ
hivemind/__init__.py	`100.00% <ø> (ø)`
hivemind/moe/__init__.py	`100.00% <ø> (ø)`
hivemind/moe/server/layers/dropout.py	`96.87% <ø> (ø)`
hivemind/moe/server/layers/optim.py	`52.63% <52.63%> (ø)`
hivemind/moe/server/checkpoints.py	`82.45% <66.66%> (ø)`
hivemind/moe/server/dht_handler.py	`98.24% <75.00%> (ø)`
hivemind/moe/server/server.py	`79.67% <85.71%> (-0.44%)`	⬇️
hivemind/moe/server/module_backend.py	`94.66% <93.10%> (ø)`
hivemind/hivemind_cli/run_server.py	`80.32% <100.00%> (ø)`
hivemind/moe/server/__init__.py	`100.00% <100.00%> (ø)`
... and 7 more

justheuristic · 2022-06-14T12:47:51Z

hivemind/moe/server/server.py

+            optimizer = optim_cls(expert.parameters()) if optim_cls is not None else None
+            scheduler = scheduler_cls(optimizer) if scheduler_cls is not None else None
+            if clip_grad_norm is not None:
+                scheduler = ClippingWrapper(scheduler, clip_grad_norm)


hivemind/moe/server/layers/optim.py

mryab · 2022-06-14T14:17:00Z

hivemind/moe/server/expert_backend.py


 import torch
 from torch import nn

 from hivemind.moe.server.task_pool import TaskPool
+from hivemind.optim.state_averager import LRSchedulerBase


Since this import is not actually related to hivemind.optim, I’d suggest so simply inline the statement that is being imported

mryab · 2022-06-14T14:18:15Z

hivemind/moe/server/expert_backend.py

           It should return gradients w.r.t. inputs that follow ``nested_flatten(self.outputs_schema)``;
-
-           .. todo we handle layer states (e.g. batchnorm stats) incorrectly, updating them twice.


Is this not correct anymore? :)

This is arguably no different than in gradient checkpoints, and it is unlikely that we can fix it here for all cases -- without user defining custom layers. I can keep it if you insist [please reply here if so]. Alternatively, perhaps it would be best to change this todo into a warning/note. Your call?

A warning would be sufficient, I suppose

restored the warning

hivemind/moe/server/expert_backend.py

mryab · 2022-06-14T14:26:54Z

hivemind/moe/server/layers/optim.py

+    """A wrapper for pytorch.optim.Optimizer that forwards all methods to the wrapped optimizer"""
+
+    def __init__(self, optim: torch.optim.Optimizer):
+        object.__init__(self)


object? In that case, it’s not a true Optimizer

on the contrary it defines all optimizer fields and methods as properties

mryab · 2022-06-14T14:28:26Z

hivemind/moe/server/layers/optim.py

+        return self.optim.add_param_group(param_group)
+
+
+class ClippingWrapper(OptimizerWrapper):


In case we need OptimizerWrapper just for this application, I’d suggest not to overcomplicate the code and just write one class for a specific use case

hivemind/moe/server/server.py

Co-authored-by: Max Ryabinin <[email protected]>

justheuristic and others added 14 commits June 12, 2022 23:05

Increase default update_period to 30s, set default expiration to 2 * …

e5a3e46

…update_period

black-isort

637fb01

review

372d915

review

98d4952

typo

b43c243

add expiration param

108a24b

black-isort

a5aa6f9

more requests

e940911

review

2356bc1

Update tests/test_dht_experts.py

771ebc1

py39

8fb0986

Merge remote-tracking branch 'origin/default_expiration' into default…

c210ecb

…_expiration

Merge branch 'master' into demo

a0622bd

justheuristic commented Jun 14, 2022

View reviewed changes

rename

fa2da45

justheuristic commented Jun 14, 2022

View reviewed changes

black-isort

6c49fe9

justheuristic marked this pull request as ready for review June 14, 2022 04:34

justheuristic requested a review from mryab June 14, 2022 04:36

borzunov approved these changes Jun 14, 2022

View reviewed changes

rename

2c77de0

justheuristic commented Jun 14, 2022

View reviewed changes

justheuristic and others added 5 commits June 14, 2022 16:42

un-hardcode experts from private interface on server side

b1873e1

un-hardcode experts from private interface on server side

9f3187f

un-hardcode experts from private interface on server side

d87a7b1

wrap optimizer, not scheduler

65d622b

ModuleBackend

a00fb9e

mryab requested changes Jun 14, 2022

View reviewed changes

justheuristic and others added 17 commits June 14, 2022 17:59

ModuleBackend

5569c42

Update hivemind/moe/server/layers/optim.py

add83b5

Co-authored-by: Max Ryabinin <[email protected]>

fix import

9664d05

Merge remote-tracking branch 'origin/demo' into demo

c30bc6c

review

7aed0a8

review

fa48f2c

review

5601a95

Update hivemind/moe/server/server.py

decf1b4

Co-authored-by: Max Ryabinin <[email protected]>

review

409e035

Merge remote-tracking branch 'origin/demo' into demo

891a83b

black-isort

dd6fc94

review

a9b7643

Merge branch 'master' into demo

8a2e1f2

review

efeb31b

Merge remote-tracking branch 'origin/demo' into demo

a63e8ef

review

04af589

review

e05d3dc

mryab approved these changes Jun 15, 2022

View reviewed changes

justheuristic merged commit 5ea21a7 into master Jun 15, 2022

justheuristic deleted the demo branch June 15, 2022 10:14

justheuristic mentioned this pull request Jun 17, 2022

finish renaming experts -> module_backends in ConnectionHandler #487

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify ExpertBackend interface #483

Simplify ExpertBackend interface #483

justheuristic commented Jun 14, 2022 •

edited

Loading

justheuristic Jun 14, 2022 •

edited

Loading

justheuristic Jun 14, 2022

codecov bot commented Jun 14, 2022 •

edited

Loading

justheuristic Jun 14, 2022

mryab Jun 14, 2022

justheuristic Jun 14, 2022

mryab Jun 14, 2022

justheuristic Jun 14, 2022

mryab Jun 14, 2022

justheuristic Jun 15, 2022

mryab Jun 14, 2022

justheuristic Jun 14, 2022

mryab Jun 14, 2022

		It should return gradients w.r.t. inputs that follow ``nested_flatten(self.outputs_schema)``;

		.. todo we handle layer states (e.g. batchnorm stats) incorrectly, updating them twice.

		return self.optim.add_param_group(param_group)


		class ClippingWrapper(OptimizerWrapper):

Simplify ExpertBackend interface #483

Simplify ExpertBackend interface #483

Conversation

justheuristic commented Jun 14, 2022 • edited Loading

justheuristic Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 14, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justheuristic commented Jun 14, 2022 •

edited

Loading

justheuristic Jun 14, 2022 •

edited

Loading

codecov bot commented Jun 14, 2022 •

edited

Loading