Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only identify specific exit codes as retryable error #518

Merged
merged 12 commits into from
Apr 20, 2018

Conversation

wenzhel101
Copy link
Contributor

@wenzhel101 wenzhel101 commented Mar 30, 2018

The criteria to decide permanent error vs retryable error is not correct. Basically, the current exit code range[128, 255] for retryable errors is too broad and it will misclassify some permanent errors as retryable. For example:

  1. user code exits with negative code(e.g. sys.exit(-1)). The container exit code will be 255(exit_code % 256).
  2. user code was killed by SIGSEGV, SIGABRT, etc. Those cases are mostly caused by the problem in mem allocation from user code.

The proposal is to only allow retry for specific error codes that are more likely to be caused by transient issues(e.g. VM was rescheduled or VM was deleted by mistake):
130 = (128+2) Container terminated by Control-C
137 = (128+9) Container received a SIGKILL
143 = (128+15) Container received a SIGTERM

The list can be extended if we see other retryable cases.

The safe approach is to classify all the non-zero exit code as permanent errors.


This change is Reviewable

@coveralls
Copy link

coveralls commented Mar 30, 2018

Coverage Status

Coverage increased (+1.1%) to 49.643% when pulling 96d933e on 0olwzo0:master into ff2958a on kubeflow:master.

@wenzhel101
Copy link
Contributor Author

/assign @jlewi

@jlewi
Copy link
Contributor

jlewi commented Mar 30, 2018

If we treat all errors as permanent then how do we deal with retryable errors like the gRPC server going down?

Or the process being killed by SIGTERM because a node became unhealthy?

@wenzhel101
Copy link
Contributor Author

It depends that how to define the retryable error. The gRPC server going down can be related to user code. Then it makes some sense to say it's still user error(though it's not permanent). And for the SIGTERM case, it's also hard to tell if the node was manually shutdown by user(user error) or by the service like GCE(retryable?).
I agree that we may not be able to handle it perfectly. It depends on how frequently the edge cases happen. IMO, If all of the edge cases are not 'likely' to happen, it's fine to treat them all as permanent errors.

@jlewi
Copy link
Contributor

jlewi commented Mar 30, 2018

I can say from experience that retryable errors are an issue and treating all errors is permanent as a no go.

On Cloud VMs can die and this will cause workers to die. Treating this as a permanent error and failing the job is not the right thing to do.

Do you agree can we close this PR?

@k8s-ci-robot k8s-ci-robot added size/M and removed size/L labels Apr 2, 2018
@wenzhel101 wenzhel101 changed the title Should return ReplicaStateFailed if container exit code is not 0 Only identify specific exit codes as retryable error Apr 2, 2018
@wenzhel101
Copy link
Contributor Author

I updated the PR to only allow to retry for some known cases from CloudML Engine. And I think the list can be extended if we see other data points.
What do you think, @jlewi?

@jose5918
Copy link
Contributor

jose5918 commented Apr 3, 2018

Doesn't it make more sense to treat all errors as retryable and classify certain ones we know are permanent. Basically the reverse of what is being proposed here. Kubernetes already has the CrashLoopBackOff for things that are failing and it's possible to put a limit on the time the operator lets a job stay in a failed state.

@jlewi
Copy link
Contributor

jlewi commented Apr 3, 2018

The criteria to decide permanent error vs retryable error is not correct. Basically, the current exit code range[128, 255] for retryable errors is too broad and it will misclassify some permanent errors as retryable.

What's an example of an error code that would be missclassified using the current schema?

I don't think this list is sufficient
130 = (128+2) Container terminated by Control-C
137 = (128+9) Container received a SIGKILL
143 = (128+15) Container received a SIGTERM

TensorFlow is a distributed system. So when one process goes down; e.g. SIGTERM because VM goes down. This error can propogate to other TF workers e.g. because gRPC now encounters errors causing exceptions to be thrown.

These exceptions need to be caught and turned into exit codes that indicate the proper behavior e.g.
permanent or retryable.

So I think users need a schema that gives them the ability to indicate whether an exit is retryable or not.

The original schema was chosen for simplicity, symmetry and to give users the ability to define an error as retryable or permanent

1-128 - permanent
129-255 - retryable

This automatically classifies most unix triggered failures as retryable which I believe is the correct behavior. Does anyone have a counter example?

We treat OOMs as permanent but we detect those based on K8s/Docker signals not exit codes.

/cc @gaocegege @ScorpioCPH

@gaocegege
Copy link
Member

I am not sure if it is the conventions from Google so I did not leave comments here.

Personally, I agree with @jlewi . And if we need to support different policies for different clouds. There are two options:

FYI There is a doc about sys.exit in python.

Most systems require it to be in the range 0–127, and produce undefined results otherwise. Some systems have a convention for assigning specific meanings to specific exit codes, but these are generally underdeveloped

@wenzhel101
Copy link
Contributor Author

This automatically classifies most unix triggered failures as retryable which I believe is the correct
behavior. Does anyone have a counter example?

It's not true that all unix triggered failures are retryable. An example we have seen on CloudML Engine is SIGSEGV, which is related to user code and won't recover after retry. In this case, keeping the job running is a wrong behavior and the issue may not be noticed by users(TFJob is RUNNING).
We also saw some users do sys.exit(-1) to exit from their programs. They meant to stop executing the program but it will be retried using tf-operator.

To me, from the cloud service perspective, keeping the resources running after a misclassification of retryable error is not a good idea(but it may be not true for non-cloud environment). I agree with @gaocegege, supporting custom restart policy is a better option. But I still don't think [128, 255] is the right default range for exit code to retry.

@jlewi
Copy link
Contributor

jlewi commented Apr 3, 2018

Thanks that's a good example.

Does anyone know off hand what exit code Python uses for an unhandled exception? I think the default behavior for that should be retyable.

If we want to be explicit about exist codes then I think we should do the following

  • Define exit codes corresponding to user defined retryable and permanent errors

  • Define behavior for exit codes corresponding to relevant unix signals

  • Explicitly define all other exit codes as undefined and not make any guarantees.

@jlewi
Copy link
Contributor

jlewi commented Apr 10, 2018

Any thoughts?

@wenzhel101
Copy link
Contributor Author

Define exit codes corresponding to user defined retryable and permanent errors

Does it mean that users provide a predefined list of exit code to retry?

Define behavior for exit codes corresponding to relevant unix signals

The question is how to get the full list of the retryable signals. Do you think SIGTERM and SIGKILL are good enough as the start point?

Explicitly define all other exit codes as undefined and not make any guarantees.

What do you mean by 'not make any guarantees'? Is it to identify them as permanent error?

@gaocegege
Copy link
Member

@ddysher I think so. We should have a plan to support customized restartpolicy in v1alpha2.

@jlewi
Copy link
Contributor

jlewi commented Apr 11, 2018

What do you mean by 'not make any guarantees'? Is it to identify them as permanent error?
I mean the behavior is undefined; i.e. we don't specify whether an exit code will be treated as retryable or permanent. Its left to the implementation to decide.

I think in addition to SIGTERM and SIGKILL we should figure out what exit code Python by default uses for unhandled exceptions and map that to retryable errors.

We should also pick two exit codes one to correspond to user defined retryable and one to correspond to user defined exit codes.

@ddysher Yes I think the behavior should be the same for v1alpha2. We should define function IsRetryableExitCode so that we can use the same code in both implementations.

@wenzhel101
Copy link
Contributor Author

We should also pick two exit codes one to correspond to user defined retryable and one to correspond to user defined exit codes.

How about SIGUSR1 & SIGUSR2?

@jlewi
Copy link
Contributor

jlewi commented Apr 12, 2018

SGTM

@wenzhel101
Copy link
Contributor Author

FYI, I found this link is helpful.
Here is my proposal:

  • Permanent Error:

    • 1: general errors
    • 2: Misuse of shell builtins
    • 126: Command invoked cannot execute
    • 127: Command not found
    • 128: Invalid argument to exit
    • 139: terminated by SIGSEGV(Invalid memory reference)
  • Retryable Error:

    • 130: SIGINT(interrupted from keyboard CTRL-C)
    • 137: terminated by SIGKILL
    • 143: terminated by SIGTERM
    • 138: corresponds to SIGUSR1. Reserved in tf-operator for user specified retryable errors.
  • Others are undefined, no guarantee about the behavior(currently handle them as permanent error).

I think it's hard to define the behavior for every sys signal, so it's better to start with known ones. What do you guys think?

@jlewi
Copy link
Contributor

jlewi commented Apr 17, 2018

This looks good to me.

@gaocegege @ScorpioCPH thoughts?

@gaocegege
Copy link
Member

SGTM

@wenzhel101
Copy link
Contributor Author

Updated the PR to address the discussion. PTAL!

// We don't want to retry for both cases.
// More info about exit status can be found in:
// https://www.gnu.org/software/bash/manual/html_node/Exit-Status.html
if s.ExitCode == 1 || s.ExitCode == 2 || s.ExitCode == 126 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a utility function IsRetryableExitCode? I'd like to be able to use the same function in the v1alpha1 and v1alpha2 controllers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, Move it to pkg/util/train/train_util.go

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and I agree with jlewi to add a function in utility package.

@gaocegege
Copy link
Member

/ok-to-test

@wenzhel101
Copy link
Contributor Author

Addressed the comment, PTAL!

@gaocegege
Copy link
Member

@jlewi LGTY?

@jlewi
Copy link
Contributor

jlewi commented Apr 20, 2018

Thank you so much this is a great change.
Apologies it took so long. We are trying to ramp up as a community and need to grow the pool of reviewers.

/lgtm
/approver

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c4ad789 into kubeflow:master Apr 20, 2018
yph152 pushed a commit to yph152/tf-operator that referenced this pull request Jun 18, 2018
* Should return ReplicaStateFailed if container exit code is not 0

* Update the criteria for retryable errors.

* Reformat

* Reformat

* Reformat

* Fix lint error.

* Handle the exit code more explicitly.

* Reformat.

* Create a util func for IsRetryableExitCode.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018
* Should return ReplicaStateFailed if container exit code is not 0

* Update the criteria for retryable errors.

* Reformat

* Reformat

* Reformat

* Fix lint error.

* Handle the exit code more explicitly.

* Reformat.

* Create a util func for IsRetryableExitCode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants