-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handling for podFailurePolicy #269
Changes from all commits
94c3e70
1dfa069
99b7cb5
81ae7d5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,6 +42,11 @@ import ( | |
const ( | ||
RestartsKey string = "jobset.sigs.k8s.io/restart-attempt" | ||
parallelDeletions int = 50 | ||
|
||
// The JobConditionReasonPodFailurePolicy constant is defined here | ||
// since the constant is not exported as part of Job API and thus | ||
// cannot be referenced directly, so we have to redefine it here for now. | ||
JobConditionReasonPodFailurePolicy = "PodFailurePolicy" | ||
) | ||
|
||
var ( | ||
|
@@ -459,12 +464,30 @@ func (r *JobSetReconciler) executeFailurePolicy(ctx context.Context, js *jobset. | |
} | ||
|
||
func (r *JobSetReconciler) executeRestartPolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error { | ||
if js.Spec.FailurePolicy.MaxRestarts == 0 { | ||
if js.Spec.FailurePolicy.MaxRestarts == 0 || r.triggeredPodFailurePolicy(ctx, js, ownedJobs) { | ||
return r.failJobSet(ctx, js) | ||
} | ||
return r.restartPolicyRecreateAll(ctx, js, ownedJobs) | ||
} | ||
|
||
// If a child job has failed due to triggering a PodFailurePolicy, | ||
// we should fail the JobSet immediately rather than restarting. | ||
// This allows the user to configure a PodFailurePolicy such that | ||
// job failures under certain conditions do not cause the JobSet to | ||
// restart, while others do. | ||
func (r *JobSetReconciler) triggeredPodFailurePolicy(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) bool { | ||
log := ctrl.LoggerFrom(ctx) | ||
for _, failedJob := range ownedJobs.failed { | ||
for _, c := range failedJob.Status.Conditions { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could use IsStatusConditionTrue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at this more, I don't think you would be able to use this. That allows you to see if you have a FailedJob conditon but it doesn't match on the reason. I guess you could match on FailedJob condition and then check reason. |
||
if c.Type == batchv1.JobFailed && c.Reason == JobConditionReasonPodFailurePolicy && c.Status == corev1.ConditionTrue { | ||
log.V(2).Info("jobset %s child job %s failed due to triggering a PodFailurePolicy", js.Name, failedJob.Name) | ||
return true | ||
} | ||
} | ||
} | ||
return false | ||
} | ||
|
||
func (r *JobSetReconciler) restartPolicyRecreateAll(ctx context.Context, js *jobset.JobSet, ownedJobs *childJobs) error { | ||
log := ctrl.LoggerFrom(ctx) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just checking, we only want to obey
PodFailurePolicy
ifRestartPolicy
is specified. Default behavior would be to fail on any failure?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no jobset failure policy is specified, the jobset will fail immediately without restarts anyway. So the only place we need to do this podFailurePolicy check is if restarting is an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this behavior would be immediately intuitive to users:
If one Job fails due to PodFailure policy, the entire job set fails.
Either make sure this is properly documented or make it a JobSet's FailurePolicy whether to respect the Job's.