-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1alpha2] The state of distributed model training. #544
Comments
Personally, I think we do not make decisions for users. We consider the job failed if there is one worker failed and the user do not specify the chief worker. |
We will use worker-0 as chief worker by default.
|
How many cases do we support now? Or what's the default policy for distributed TFJob? |
@DjangoPeng @gaocegege @ScorpioCPH 1, user Settings container restart strategy for Always, so once the pod running, so the follow-up if pod fail, we still think is running, the final training good or bad, to the user's own judgment;
WDYT? |
IMO, the overall state for distributed training should be:
|
@0olwzo0 My understanding is:
|
Ref #562 |
dup with #562 |
@gaocegege @ScorpioCPH @DjangoPeng
As for the state judgment of model training, I think there are usually several situations:
1、The best case is that the non-chief worker task is wrong, because these tasks are actually stateless. When such a worker task is restored, it will reconnect with its PS task and restart the process that was previously broken.
2、More bad a situation is PS task fails, then there is a problem, because the PS task is state, all the worker task need to rely on them to send their gradient and obtain a new parameter values. In this case, their chief worker task is responsible for monitoring this error, and if this error occurs, the chief worker task interrupts the entire training and resumes all PS tasks from the previous checkpoint.
3、In the worst case is devoted to the worker task failed, because we make it in charge of all the other tasks, we make sure that it has failed, to cluster all the tasks in the back in a good condition. So what we do is we interrupt the training.
According to the above analysis, I think it can be divided into several types:
The user sets the chief, so when the chief node fails, I think the distributed training fails;
The user does not set chief, so there are some situations here:
(1) the user may use worker0 as the chief node, so we believe that worker0 has problems and distributed training fails;
(2) the user does not use worker0 as the chief node, so we believe that the problem of ps appears and distributed training fails;
But the above two cases, we have no way to judge whether the user use worker0 as devoted to nodes, so I think, this kind of situation, is any worker0 and ps nodes appear problem, I think the distributed training failure.
What do you think?
The text was updated successfully, but these errors were encountered: