v0.8: stable large-scale infrastructure
Closed Aug 29, 2020
100% complete
Goal: actually support training on 1000s of nodes
Features:
- server: switch from pythonic connection_handler to asyncio + gRPC
- client: implement parallel fault-tolerant backward for moe.py
- dht: implement bulk store/get operations with caching
This milestone is closed.
No open issues remain. View closed issues or see open milestones in this repository.