Skip to content

v0.8: stable large-scale infrastructure

Closed Aug 29, 2020 100% complete

Goal: actually support training on 1000s of nodes

Features:

  • server: switch from pythonic connection_handler to asyncio + gRPC
  • client: implement parallel fault-tolerant backward for moe.py
  • dht: implement bulk store/get operations with caching

This milestone is closed.

No open issues remain. View closed issues or see open milestones in this repository.