-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only identify specific exit codes as retryable error #518
Conversation
/assign @jlewi |
If we treat all errors as permanent then how do we deal with retryable errors like the gRPC server going down? Or the process being killed by SIGTERM because a node became unhealthy? |
It depends that how to define the retryable error. The gRPC server going down can be related to user code. Then it makes some sense to say it's still user error(though it's not permanent). And for the SIGTERM case, it's also hard to tell if the node was manually shutdown by user(user error) or by the service like GCE(retryable?). |
I can say from experience that retryable errors are an issue and treating all errors is permanent as a no go. On Cloud VMs can die and this will cause workers to die. Treating this as a permanent error and failing the job is not the right thing to do. Do you agree can we close this PR? |
I updated the PR to only allow to retry for some known cases from CloudML Engine. And I think the list can be extended if we see other data points. |
Doesn't it make more sense to treat all errors as retryable and classify certain ones we know are permanent. Basically the reverse of what is being proposed here. Kubernetes already has the |
What's an example of an error code that would be missclassified using the current schema? I don't think this list is sufficient TensorFlow is a distributed system. So when one process goes down; e.g. SIGTERM because VM goes down. This error can propogate to other TF workers e.g. because gRPC now encounters errors causing exceptions to be thrown. These exceptions need to be caught and turned into exit codes that indicate the proper behavior e.g. So I think users need a schema that gives them the ability to indicate whether an exit is retryable or not. The original schema was chosen for simplicity, symmetry and to give users the ability to define an error as retryable or permanent 1-128 - permanent This automatically classifies most unix triggered failures as retryable which I believe is the correct behavior. Does anyone have a counter example? We treat OOMs as permanent but we detect those based on K8s/Docker signals not exit codes. |
I am not sure if it is the conventions from Google so I did not leave comments here. Personally, I agree with @jlewi . And if we need to support different policies for different clouds. There are two options:
FYI There is a doc about sys.exit in python.
|
It's not true that all unix triggered failures are retryable. An example we have seen on CloudML Engine is To me, from the cloud service perspective, keeping the resources running after a misclassification of retryable error is not a good idea(but it may be not true for non-cloud environment). I agree with @gaocegege, supporting custom restart policy is a better option. But I still don't think [128, 255] is the right default range for exit code to retry. |
Thanks that's a good example. Does anyone know off hand what exit code Python uses for an unhandled exception? I think the default behavior for that should be retyable. If we want to be explicit about exist codes then I think we should do the following
|
Any thoughts? |
Does it mean that users provide a predefined list of exit code to retry?
The question is how to get the full list of the retryable signals. Do you think SIGTERM and SIGKILL are good enough as the start point?
What do you mean by 'not make any guarantees'? Is it to identify them as permanent error? |
@ddysher I think so. We should have a plan to support customized restartpolicy in v1alpha2. |
I think in addition to SIGTERM and SIGKILL we should figure out what exit code Python by default uses for unhandled exceptions and map that to retryable errors. We should also pick two exit codes one to correspond to user defined retryable and one to correspond to user defined exit codes. @ddysher Yes I think the behavior should be the same for v1alpha2. We should define function IsRetryableExitCode so that we can use the same code in both implementations. |
How about |
SGTM |
FYI, I found this link is helpful.
I think it's hard to define the behavior for every sys signal, so it's better to start with known ones. What do you guys think? |
This looks good to me. @gaocegege @ScorpioCPH thoughts? |
SGTM |
Updated the PR to address the discussion. PTAL! |
pkg/trainer/training.go
Outdated
// We don't want to retry for both cases. | ||
// More info about exit status can be found in: | ||
// https://www.gnu.org/software/bash/manual/html_node/Exit-Status.html | ||
if s.ExitCode == 1 || s.ExitCode == 2 || s.ExitCode == 126 || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a utility function IsRetryableExitCode? I'd like to be able to use the same function in the v1alpha1 and v1alpha2 controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, Move it to pkg/util/train/train_util.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and I agree with jlewi to add a function in utility package.
/ok-to-test |
Addressed the comment, PTAL! |
@jlewi LGTY? |
Thank you so much this is a great change. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Should return ReplicaStateFailed if container exit code is not 0 * Update the criteria for retryable errors. * Reformat * Reformat * Reformat * Fix lint error. * Handle the exit code more explicitly. * Reformat. * Create a util func for IsRetryableExitCode.
* Should return ReplicaStateFailed if container exit code is not 0 * Update the criteria for retryable errors. * Reformat * Reformat * Reformat * Fix lint error. * Handle the exit code more explicitly. * Reformat. * Create a util func for IsRetryableExitCode.
The criteria to decide permanent error vs retryable error is not correct. Basically, the current exit code range[128, 255] for retryable errors is too broad and it will misclassify some permanent errors as retryable. For example:
The proposal is to only allow retry for specific error codes that are more likely to be caused by transient issues(e.g. VM was rescheduled or VM was deleted by mistake):
130 = (128+2) Container terminated by Control-C
137 = (128+9) Container received a SIGKILL
143 = (128+15) Container received a SIGTERM
The list can be extended if we see other retryable cases.
The safe approach is to classify all the non-zero exit code as permanent errors.
This change is