Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert averager to libp2p backend #323

Merged
merged 70 commits into from
Jul 28, 2021
Merged

Convert averager to libp2p backend #323

merged 70 commits into from
Jul 28, 2021

Conversation

borzunov
Copy link
Member

@borzunov borzunov commented Jul 15, 2021

Current status: Finished.

What I have tested:

  • Training (1 monitor + 2 trainers) works, peers average successfully.

TODO:

  • Provide benchmark results (default and for large tensors)
  • Add experiment_prefix to handler name for averager RPCs to enable using several different averagers simultaneously
  • Follow-up PR: rename PeerID -> Endpoint, use bytes for PeerIDs in protobufs (instead of string) (moved to [REFACTOR] updates to DHT internals #276 )

@borzunov borzunov force-pushed the averager-libp2p branch 2 times, most recently from 8590bd9 to 8109d93 Compare July 15, 2021 22:59
@borzunov borzunov force-pushed the averager-libp2p branch 2 times, most recently from cb09676 to 955c058 Compare July 16, 2021 18:37
@borzunov
Copy link
Member Author

borzunov commented Jul 23, 2021

We have found that #317 is the reason of periodic test freezes in this branch and master.

3/30 test runs freeze for the current MPFuture implementation: report.

0/30 test runs freeze for the reverted MPFuture implementation (to the version with torch shared memory): report

Now, we are thinking about ways to fix that.

Copy link
Member

@mryab mryab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this monumental contribution! Before we merge, though, I'd also like to see two things:

  1. Passing tests (maybe Resolve deadlock in MPFuture #337 and Reduce complexity of several DHT tests #334 will be useful in that regard)
  2. Performance benchmarks comparing this one with the master branch

tests/test_training.py Outdated Show resolved Hide resolved
tests/test_averaging.py Outdated Show resolved Hide resolved
tests/test_averaging.py Outdated Show resolved Hide resolved
tests/test_averaging.py Outdated Show resolved Hide resolved
tests/test_averaging.py Outdated Show resolved Hide resolved
hivemind/dht/__init__.py Outdated Show resolved Hide resolved
@@ -354,7 +354,7 @@ def report_training_progress(self):
with self.lock_local_progress:
current_time = get_dht_time()
local_state_info = TrainingState(
endpoint=self.averager.endpoint,
peer_id=self.averager.endpoint.to_base58(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly against casting to base58 in multiple places all over the code; would appreciate if it was possible to come up with a way to reduce this casting :)

Or maybe you can just call __str__ everywhere, since to_base58 is an implementation detail

Copy link
Member Author

@borzunov borzunov Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use str(endpoint), however we would still need to call PeerID.from_base58(value) to deserialize. Therefore, I'd suggest to keep these operations symmetric.

Also, it is actually more natural to use bytes for representing PeerIDs in protobufs (and change endpoint.to_base58()/PeerID.from_base58(value) to endpoint.to_bytes()/PeerID(value)). However, DHT already uses str for PeerIDs in protobufs, so I'd like to make the code consistent in this PR (but I don't mind changing everything to bytes in a separate PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's change it in a follow-up

hivemind/averaging/allreduce.py Outdated Show resolved Hide resolved
hivemind/p2p/p2p_daemon.py Outdated Show resolved Hide resolved
if len(spec.args) < 3:
raise ValueError(
f"{method_name} is expected to at least three positional arguments "
f"(self: TServicer, request: TInputProtobuf, context: hivemind.p2p.P2PContext)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankfully, TServicer and TInputProtobuf are no more :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed mentioning of TServicer from this comment. However, I'd still suggest to use the T prefix for TypeVars to distinguish them from the usual types, so I am keeping TInputProtobuf for now :)

@@ -59,6 +59,16 @@ async def aenumerate(aiterable: AsyncIterable[T]) -> AsyncIterable[Tuple[int, T]
index += 1


async def asingle(aiter: AsyncIterable[T]) -> T:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is inspired by Single() from LINQ (.NET functional programming functions).

@borzunov
Copy link
Member Author

borzunov commented Jul 27, 2021

Benchmark Results

Setup

num_peers = 16
target_group_size = 16
request_timeout = 1
hid_size = 8192
num_layers = 1
averaging_expiration = 300

Branch master (39afa97)

Part size: 2 ** 20 bytes
Averaging step time: mean 23.3 sec (std 1.2 sec, based on 3 runs)

Branch averager-libp2p (fc8d296)

The plot shows the mean averaging time ± std (based on 3 runs).

Part size (optimal): 2 ** 19 bytes
Averaging step time: mean 25.8 sec (std 0.8 sec, based on 3 runs)

@borzunov borzunov requested a review from mryab July 27, 2021 01:44
hivemind/averaging/allreduce.py Outdated Show resolved Hide resolved
hivemind/averaging/averager.py Outdated Show resolved Hide resolved
hivemind/averaging/matchmaking.py Outdated Show resolved Hide resolved
hivemind/p2p/servicer.py Show resolved Hide resolved
hivemind/utils/asyncio.py Outdated Show resolved Hide resolved
@justheuristic justheuristic merged commit 3f691fc into master Jul 28, 2021
@justheuristic justheuristic deleted the averager-libp2p branch July 28, 2021 17:56
borzunov added a commit that referenced this pull request Jul 29, 2021
This PR follows #323 and does the remaining mass refactors:

1. Rename `Endpoint` to `PeerID` in averager (+ related variable names)
2. Rename the `P2P.id` field to `P2P.peer_id` (because the local peer ID is stored in the `.peer_id` fields in all other classes)
3. Serialize `PeerID`s as `bytes` instead of Base58 string
4. Remove `JoinRequest.peer_id` and `AveragingData.peer_id` fields (they duplicate `context.remote_id`)
5. Remove the `DecentralizedAveraging` gRPC interface (not used anymore)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants