Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert hivemind.Server/RemoteModuleCall/RemoteCallMany to libp2p backend #242

Closed
2 of 4 tasks
justheuristic opened this issue Apr 22, 2021 · 6 comments
Closed
2 of 4 tasks
Assignees
Labels
enhancement New feature or request server

Comments

@justheuristic
Copy link
Member

justheuristic commented Apr 22, 2021

[depends on #238 to be merged ]
After we've implemented P2P transport with nat traversal, we should switch the main components to libp2p backend to take advantage of this new transport.

One of three main components is hivemind.server.Server and its counterpart hivemind.client.RemoteExpert

On a client side, hivemind creates a RemoteExpert pytorch module that calls experts via _RemoteModuleCall (and _RemoteCallMany for DMoE)

A server receives incoming connections with several ConnectionHandler processes running in parallel. These processes run gRPC servers and hence should be switched to libp2p.

  • Checklist
    • find some way to attach several processes to one RPC (as in server/connection_handler.py)
    • make sure it passes tests/test_moe.py
    • make sure it passes tests/test_training.py
    • tune performance in tests/benchmark_througphput.py
@GreenFatGuy
Copy link
Collaborator

working on it

@justheuristic
Copy link
Member Author

justheuristic commented Feb 24, 2022

Current status:

  • we're broken DHT get_experts (or declare experts) capability, gotta fix it
  • there's some error when dealing with large messages (@deniskamazur PLZ add an example that fails, e.g. for benchmark_throughput)
  • gotta merge changes from master

On merging:

  • Diff: master...server-p2p
  • server/init.py -- was renamed to server.py, apply this: d29101e (but keep the server class from server-p2p )
  • benchmarks/benchmark_throughput_p2p.py -- accept ours
  • run_server - one line change, accept it
  • expert.py: (1) accept everything in this branch, (2) backport a single commit from master: b442369
  • connection_handler: (same as expert.py) (1) accept everything in this branch, (2) backport a single commit from master: b442369
  • dht_handler - accept this branch -- but its actuall buggy, will fix

Current affairs

  • @GreenFatGuy tries to bring this branch up to date and/or figure out how to connect to servers via DHT (examples/moe scenario)
  • @deniskamazur tries to reproduce the error with large messages in benchmark_throughput_p2p.py

@GreenFatGuy
Copy link
Collaborator

Current progress:

  • merge of master is done
  • dht bug is fixed, the problem was that dht had not enough data to initialize RemoteExpert properly + it should be constructed on client side now
  • working on fixes for big benchmarks: the plan is to split data into partitions.

@GreenFatGuy
Copy link
Collaborator

Current status:

Implemented streaming for backward and forward for moe/expert/RemoteModuleCall. Now we split all inputs and outputs for forward/backward into smaller parts before sending, then stream this parts and assemble again on the other end. There is no problem with big batches now.

In plans:

  • fix tests for new moe expert/server on p2p
  • implement variance in forward/backward call: if we can - do non-streaming version (old one), if don't - new one with streaming. This could possibly increase throughput (do experiment).
  • try different architectures. I need a server for it, because my machine can not do this

@GreenFatGuy
Copy link
Collaborator

GreenFatGuy commented May 20, 2022

Current status:

  • Implemented variance for forward/backward: small packets are handled at once, big are streamed.
  • All variants of benchmark_throughput_p2p.py are working
  • Added forwarding of P2P instance from current DHT to the RemoteExpert (reminder: remote expert are now more complex than just an address of the host. They have P2P instance inside, thus it is created in a lazy way on client side)
  • Achieved better performance on benchmark by using multiple instances of connection handler. This achieved by some changes in p2p daemon. (TODO: link to the changes)
  • Beam search updated to handle lazy expert creation. Test tests/test_dht_experts.py passed.

In plans:

  • fix other tests (firstly tests/test_training.py and tests/test_moe.py, then others if any broken)
  • do pull request, fix comments, merge to master

@borzunov
Copy link
Member

borzunov commented Jun 9, 2022

Done in #470. Further minor issues are listed in #478.

@borzunov borzunov closed this as completed Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server
Projects
None yet
Development

No branches or pull requests

4 participants