Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend #242

justheuristic · 2021-04-22T21:51:45Z

[depends on #238 to be merged ]
After we've implemented P2P transport with nat traversal, we should switch the main components to libp2p backend to take advantage of this new transport.

One of three main components is hivemind.server.Server and its counterpart hivemind.client.RemoteExpert

On a client side, hivemind creates a RemoteExpert pytorch module that calls experts via _RemoteModuleCall (and _RemoteCallMany for DMoE)

A server receives incoming connections with several ConnectionHandler processes running in parallel. These processes run gRPC servers and hence should be switched to libp2p.

Checklist
- find some way to attach several processes to one RPC (as in server/connection_handler.py)
- make sure it passes tests/test_moe.py
- make sure it passes tests/test_training.py
- tune performance in tests/benchmark_througphput.py

The text was updated successfully, but these errors were encountered:

GreenFatGuy · 2022-02-24T17:21:01Z

working on it

justheuristic · 2022-02-24T17:23:26Z

Current status:

we're broken DHT get_experts (or declare experts) capability, gotta fix it
there's some error when dealing with large messages (@deniskamazur PLZ add an example that fails, e.g. for benchmark_throughput)
gotta merge changes from master

On merging:

Diff: master...server-p2p
server/init.py -- was renamed to server.py, apply this: d29101e (but keep the server class from server-p2p )
benchmarks/benchmark_throughput_p2p.py -- accept ours
run_server - one line change, accept it
expert.py: (1) accept everything in this branch, (2) backport a single commit from master: b442369
connection_handler: (same as expert.py) (1) accept everything in this branch, (2) backport a single commit from master: b442369
dht_handler - accept this branch -- but its actuall buggy, will fix

Current affairs

@GreenFatGuy tries to bring this branch up to date and/or figure out how to connect to servers via DHT (examples/moe scenario)
@deniskamazur tries to reproduce the error with large messages in benchmark_throughput_p2p.py

GreenFatGuy · 2022-03-19T15:25:39Z

Current progress:

merge of master is done
dht bug is fixed, the problem was that dht had not enough data to initialize RemoteExpert properly + it should be constructed on client side now
working on fixes for big benchmarks: the plan is to split data into partitions.

GreenFatGuy · 2022-04-08T13:53:14Z

Current status:

Implemented streaming for backward and forward for moe/expert/RemoteModuleCall. Now we split all inputs and outputs for forward/backward into smaller parts before sending, then stream this parts and assemble again on the other end. There is no problem with big batches now.

In plans:

fix tests for new moe expert/server on p2p
implement variance in forward/backward call: if we can - do non-streaming version (old one), if don't - new one with streaming. This could possibly increase throughput (do experiment).
try different architectures. I need a server for it, because my machine can not do this

GreenFatGuy · 2022-05-20T17:28:31Z

Current status:

Implemented variance for forward/backward: small packets are handled at once, big are streamed.
All variants of benchmark_throughput_p2p.py are working
Added forwarding of P2P instance from current DHT to the RemoteExpert (reminder: remote expert are now more complex than just an address of the host. They have P2P instance inside, thus it is created in a lazy way on client side)
Achieved better performance on benchmark by using multiple instances of connection handler. This achieved by some changes in p2p daemon. (TODO: link to the changes)
Beam search updated to handle lazy expert creation. Test tests/test_dht_experts.py passed.

In plans:

fix other tests (firstly tests/test_training.py and tests/test_moe.py, then others if any broken)
do pull request, fix comments, merge to master

borzunov · 2022-06-09T00:52:46Z

Done in #470. Further minor issues are listed in #478.

justheuristic added enhancement New feature or request server labels Apr 22, 2021

justheuristic assigned dvmazur Apr 22, 2021

dvmazur linked a pull request Apr 22, 2021 that will close this issue

Add initial support for connecting via libp2p #238

Merged

dvmazur removed a link to a pull request May 3, 2021

Add initial support for connecting via libp2p #238

Merged

elricwan mentioned this issue Jan 18, 2022

[BUG] Loss did not decrease in Albert example after 125000 max step. #447

Closed

justheuristic assigned GreenFatGuy Feb 24, 2022

GreenFatGuy mentioned this issue May 23, 2022

Convert hivemind.server to libp2p backend #470

Merged

15 tasks

borzunov closed this as completed Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend #242

Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend #242

justheuristic commented Apr 22, 2021 •

edited by GreenFatGuy

Loading

GreenFatGuy commented Feb 24, 2022

justheuristic commented Feb 24, 2022 •

edited

Loading

GreenFatGuy commented Mar 19, 2022

GreenFatGuy commented Apr 8, 2022

GreenFatGuy commented May 20, 2022 •

edited

Loading

borzunov commented Jun 9, 2022

Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend #242

Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend #242

Comments

justheuristic commented Apr 22, 2021 • edited by GreenFatGuy Loading

GreenFatGuy commented Feb 24, 2022

justheuristic commented Feb 24, 2022 • edited Loading

GreenFatGuy commented Mar 19, 2022

GreenFatGuy commented Apr 8, 2022

GreenFatGuy commented May 20, 2022 • edited Loading

borzunov commented Jun 9, 2022

justheuristic commented Apr 22, 2021 •

edited by GreenFatGuy

Loading

justheuristic commented Feb 24, 2022 •

edited

Loading

GreenFatGuy commented May 20, 2022 •

edited

Loading