-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training progress and todo #1820
Comments
Shall we first implement a "workable" version that will run on paddle cloud? A simple version with capability of "fault tolerant" needs below features. This version running "async SGD" can deal with trainer fails and scale trainer.
|
I agree with having the first "workable" implementation of fault tolerant distributed training. Supporting only async SGD at first seems very reasonable to me. A side note, google internally uses async SGD for the majority of the jobs. Same with Cai Cloud Technology.
Let's discuss more during meeting :) |
Moved to #1860 |
Progress:
Design docs:
TODO:
For the first level bullet points, please refer to the corresponding design doc. The second level bullet points are questions that we need to figure out.
The text was updated successfully, but these errors were encountered: