-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement configurable failure policy. #537
Implement configurable failure policy. #537
Conversation
Hi @jedwins1998. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
There is one TODO left for an additional test I would like to add and I still need to implement Webhook validation for OnJobFailureReasons. Besides that, I consider the code ready for review. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move all the helpers in jobset_controller.go specific to failure policies to a failure_policy.go file, and add unit tests for any important ones in failure_policy_test.go? Same as success_policy.go and success_policy_test.go.
I'll take a deeper look next week. Thanks for working on this!
Can do. |
b94061d
to
afc73b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick pass while you are working on the refactor, looks good so far!
afc73b0
to
591d2ee
Compare
This is now done. |
72b6e44
to
c1db56b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the integration tests yet
1bbfe17
to
0283adf
Compare
…to be the first failure policy rule test.
…case names more clear.
8381433
to
5b8f55d
Compare
I added `[failure policy]` to the begin of the name of each test related to failure policies so that it is easier to select only those tests to run. I also updated tests to check that `RestartsCountTowardsMax` is incrementing only when expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after a few final comments are addressed. In a follow-up PR, we should add an example JobSet spec to the examples/ folder showcasing how the feature works, which we can use to do manual testing as well.
RestartJobSetAndIgnoreMaxRestarts FailurePolicyAction = "RestartJobSetAndIgnoreMaxRestarts" | ||
) | ||
|
||
// FailurePolicyRule defines a FailurePolicyAction to be executed if a child job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is a general comment unrelated to this file) Now that feature gate support has been merged in #557 I think we should add a feature gate (default on) for this feature. If the feature is not enabled, fall back to the current behavior.
We can do this in a follow up PR.
I want to avoid a scenario where we publish the v0.6.0 release and an important customer is using this feature, then they encounter a bug that slipped through the cracks, and we can't simply downgrade to v0.5.0 to mitigate because their JobSet spec (often defined in Python/Go code checked into their codebase) is using fields which only exist in v0.6.0 - thus requiring some emergency rollout on their end to revert their Python/Go code to a spec usable by JobSet v0.5.0, and then downgrade JobSet deployment to v0.5.0.
/lgtm Thanks for working on this! Will leave approval for @ahg-g |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, jedwins1998 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
This pull request is to implement configurable failure policy.
There is one difference to note from the KEP. I added a new field to the JobSetStatus that tracks the number of restarts which count towards the restart limit. I then use this variable to allow some restarts to not count towards the maximum number of restarts.
This resolves #262.