Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resuming Jobset after restoring PodTemplate (by deleting Jobs) #625

Closed
wants to merge 3 commits into from

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Jul 25, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #624
Fixes #535

Special notes for your reviewer:

This approach relies on updating the Jobs on resume, rather the deleting Jobs on suspend.
Alternative implementation was done in: #625

Does this PR introduce a user-facing change?

Allow restoring PodTemplate on suspend, and fix resuming JobSet after restoring 
the PodTemplate. This fixes the integration with Kueue to support eviction of workloads
and re-admitting in another ResourceFlavor (with other nodeSelectors, or without nodeSelectors).
It is achieved by deleting the Jobs while the JobSet is suspended.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 25, 2024
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 25, 2024
Copy link

netlify bot commented Jul 25, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 8022696
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66a8fb8490146c0008e03f32

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 25, 2024
// This is needed for integration with Kueue/DWS.
if ptr.Deref(oldJS.Spec.Suspend, false) {
if ptr.Deref(oldJS.Spec.Suspend, false) || ptr.Deref(js.Spec.Suspend, false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ptr.Deref(oldJS.Spec.Suspend, false) || ptr.Deref(js.Spec.Suspend, false) {
if ptr.Deref(oldJS.Spec.Suspend, false) || (ptr.Deref(js.Spec.Suspend, false) && js.Spec.ReplicatedJobs[*].status.startTime == nil) {

Wouldn't it be helpful to validate if the jobs don't have startTime?
Because if any jobs have startTime, this operation should fail due to batch/job validation.
But, I'm not sure if we should add validations deeply depending on the batch/job specifications.

@danielvegamyhre @mimowo WDYT?

Copy link
Contributor Author

@mimowo mimowo Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flow is that we first suspend the parent JobSet, then it suspends the Jobs. So, if we added this condition, the startTime on running child jobs would prevent the update.

However, this is complex, and I think I should have added an e2e test in JobSet to verify the suspending works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I’m thinking integration and e2e test solidifying the requirements that JobSet needs for working with Kueue will be a great idea.

We obviously don’t want to bring in Kueue as a dependency so I think just verifying that the updates/patches work in integration/e2e will be welcome

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the flow is that we first suspend the parent JobSet, then it suspends the Jobs. So, if we added this condition, the startTime on running child jobs would prevent the update.

Uhm, that makes sense.
But, I guess that users fall into the rabbit hole when they accidentally modify the scheduling directives against the ReplicatedJob with startTime because the jobset-controller just output errors in the controller logs, this is not directly user feedback.

Maybe we can improve these implied specifications for the JobSet users once we introduce kubernetes/kubernetes#113221.

So, I'm ok with the current implementation for now.

Copy link
Contributor Author

@mimowo mimowo Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the e2e test would not work (see in the PR I already added it), because there is a deeper issue - JobSet does not delete the Jobs once suspended: #535. The fact that we keep the old jobs means we never create new jobs on resuming again if the PodTemplate is updated in JobSet.

I'm trying to implement it by deleting the Jobs on suspend, and it looks promising. There are some integration tests for startupPolicy to adjust I yet need to look into.

Copy link
Contributor Author

@mimowo mimowo Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you have some more context knowledge if traps I may encounter with this approach (I will probably continue next week), but essentially this is the only I see, and it would mimic the Job behavior, where suspended Jobs deletes the pods.

@danielvegamyhre
Copy link
Contributor

Thanks for working on the fix for this @mimowo. Please take a look at my comment here #624 (comment) as well, I'd like to improve our coordination to prevent these kinds of issues from occurring in the future.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

/hold
for e2e or integration test inside the JobSet repo

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 25, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2024
@mimowo mimowo force-pushed the fix-suspending-jobset branch 6 times, most recently from 85077d4 to 0974200 Compare July 26, 2024 10:48
@danielvegamyhre
Copy link
Contributor

Let me know when this is ready for review and I'll take a look. Thanks!

@mimowo mimowo force-pushed the fix-suspending-jobset branch 2 times, most recently from b9678b7 to b40a6d2 Compare July 26, 2024 17:53
@mimowo
Copy link
Contributor Author

mimowo commented Jul 26, 2024

Let me know when this is ready for review and I'll take a look. Thanks!

Thank you @danielvegamyhre ! I think it is ready now, let me also know if some parts require additional explanation.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 26, 2024

/hold cancel
Two new e2e tests are added on suspending and resuming, one focuses on fast suspend -> resume -> suspend -> resume path to check for any race conditions. The other tracks deletion and creation of pods also.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2024
@danielvegamyhre danielvegamyhre self-assigned this Jul 26, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 29, 2024

@danielvegamyhre I'm wondering about introducing DeleteJobsOnSuspend feature gate (true by default in Beta) that could be used in emergency (a bug found) to revert the behavior for a user without recompiling, but I didn't see any other feature gates in JobSet yet. OTOH, the amount of testing to cover both paths would be huge, plus it is rather a bug than a new feature, so I'm hesitant. WDYT?

rStatus := js.Status.ReplicatedJobsStatus
// Don't allow to mutate PodTemplate on unsuspending if there
// are still active or suspended jobs from the previous run
if len(rStatus) > index && (rStatus[index].Active > 0 || rStatus[index].Suspended > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assign len(rStatus) > index to a well-named variable to make it more clear/readable why it matters if the number of replicatedJob statuses is greater than the current replicatedJob index?

Furthermore, could we abstract this block out into a helper function to make it clear at a glance what we are trying to check here, and make it more concise/readable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, instead of using len(rStatus) > index I basically want to check if the ReplicatedJobsStatus is initialized. If not, then I assume the JobSet didn't run yet, so there are no Jobs. I have also extracted the code to a function and documented. PTAL.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mimowo
Once this PR has been reviewed and has the lgtm label, please ask for approval from danielvegamyhre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/controllers/jobset_controller.go Show resolved Hide resolved
pkg/webhooks/jobset_webhook.go Show resolved Hide resolved
// controller to delete the Jobs from the previous run if they could
// conflict with the Job creation on resume.
rStatus := js.Status.ReplicatedJobsStatus
if rStatus[index].Active > 0 || rStatus[index].Suspended > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should include terminating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, currently we only count as "active" Jobs which have at least one "active" pods. This means that it is possible that some Jobs which have only terminating pods may still exist.

One fixing idea would be to count Jobs as active if they have at least one active or terminating pod. Another is to introduce a new counter for "terminating" jobs - Jobs which have all pods terminating. WDYT?

I guess this is a corner case for now, so maybe could be done in a follow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not too bad in this PR, because a terminating Job (all pods are terminating) is going to be deleted eventually. Yes, if we resume the JobSet while it exists, then it can block creating replacement Job for a while, but IIUC the JobSet controller will create replacement for such a Job as soon as it is fully gone, and the replacement Job will have a good PodTemplate.

In order to prevent resuming a JobSet while there is a terminating Job we could:

  1. include such a Job as active, if job.Status.Active > 0 || ptr.Deref(job.Status.Terminating, 0) > 0 here
  2. introduce a new counter for Terminating Jobs. Terminating would be a fallback (so job.Status.Active == 0) counter if ptr.Deref(job.Status.Terminating, 0) > 0

I'm leaning towards option (2.). I could implement it either in this PR or a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a follow up is fine. Terminating on Jobs is currently a beta feature so logic may be somewhat complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terminating on Jobs is currently a beta feature so logic may be somewhat complex.

I think we don't need any code in JobSet depending on the feature-gate in k8s. If the feature gate is disabled in k8s the field is not set. So, the code in JobSet would see nil, we could prevent panics with ptr.Deref(job.Status.Terminating, 0), but this should be enough, IIUC.

test/e2e/e2e_test.go Show resolved Hide resolved
test/e2e/e2e_test.go Show resolved Hide resolved
@mimowo mimowo mentioned this pull request Jul 30, 2024
@danielvegamyhre
Copy link
Contributor

@danielvegamyhre I'm wondering about introducing DeleteJobsOnSuspend feature gate (true by default in Beta) that could be used in emergency (a bug found) to revert the behavior for a user without recompiling, but I didn't see any other feature gates in JobSet yet. OTOH, the amount of testing to cover both paths would be huge, plus it is rather a bug than a new feature, so I'm hesitant. WDYT?

I don't think we need a feature gate for this.

@mimowo
Copy link
Contributor Author

mimowo commented Aug 1, 2024

@danielvegamyhre @kannon92 thanks for reviewing. let me know what are the remaining things to address in this PR.
It seems this is the outstanding comment #625 (comment).

@mimowo
Copy link
Contributor Author

mimowo commented Aug 5, 2024

/hold
I have synced with @danielvegamyhre and we discussed that it would be good to see if we can achieve the functionality needed for Kueue without deleting the Jobs on suspend, reusing bits introduced in https://github.com/kubernetes-sigs/jobset/pull/590/files#diff-68677f8530f00fbbb399ac36268f928ee584948dd09091644a8fad81a8658153R457-R478.

I agree this is worth exploring as this is closer to how JobSet operates currently, so the impact is more predictable. I will try this approach and update the PR or summarize if there are any blockers.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2024
@mimowo mimowo changed the title Fix restoring PodTemplate on JobSet suspend Fix resuming Jobset after restoring PodTemplate (by deleting Jobs) Aug 6, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Aug 6, 2024

@danielvegamyhre @kannon92
I have prepared another PR which fixes the issue by updating the Jobs on resume, rather than recreating them. I'm happy to move forward with that approach. PTAL: #640. I have also update the issue description to make it more clear.
Finally, there is a small PR with e2e test showing the issue: #635.

@mimowo
Copy link
Contributor Author

mimowo commented Aug 6, 2024

/close
in favor of #640

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closed this PR.

In response to this:

/close
in favor of #640

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
5 participants