Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to mutate PodTemplate when suspending a JobSet and support resuming such JobSet #624

Closed
mimowo opened this issue Jul 25, 2024 · 4 comments · Fixed by #644
Closed
Assignees

Comments

@mimowo
Copy link
Contributor

mimowo commented Jul 25, 2024

There are currently two related issues which prevent JobSet - Kueue integration:

  1. JobSet rejects mutation of PodTemplate on suspend

When Kueue evicts a workload (represented by JobSet) it stops the JobSet and tries to restore the PodTemplate to enable re-admitting the same JobSet to another ResourceFlavor (with potentially different nodeSelectors).
For example, the following e2e test for Job shows how Kueue can preempt a workload and re-admit with another nodeSelector: link.

However, the integration with Kueue does not work currently, because the Kueue request to suspend
the JobSet fails if it also wants to update the PodTemaplte.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

/assign
/cc @danielvegamyhre @tenzen-y

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Jul 25, 2024

Let's fix this, but rather than solely doing a one-off fix here, we need to iron out the specific requirements for JobSet + Kueue integration, as well as align our roadmaps so changes in Kueue don't break JobSet integration.

We just recently had an issue similar to this a couple months ago, with Kueue trying to mutate certain podTemplate fields on suspended JobSets, but these are immutable fields in JobSet, which led to a customer/user reporting the issue (#579).

One thing we could potentially do is make the entire podTemplate mutable in JobSet, to prevent any further issues like this.

cc @alculquicondor @mimowo @ahg-g @kannon92

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

I think this is a good point. I think at the technical layer we should keep extending the JobSet e2e test suite in Kueue which was started by recently.

EDIT: the test suite for reference: https://github.com/kubernetes-sigs/kueue/blob/main/test/e2e/singlecluster/jobset_test.go. I'm going to extend it as part of kubernetes-sigs/kueue#2691 (started the PR in kubernetes-sigs/kueue#2700).

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

The proposal for the e2e test scenario which covers this and #623: #623 (comment)

@mimowo mimowo changed the title Allow JobSet to mutate PodTemplate when suspending a Job Allow JobSet to mutate PodTemplate when suspending a Job and support resuming such JobSet Aug 6, 2024
@mimowo mimowo changed the title Allow JobSet to mutate PodTemplate when suspending a Job and support resuming such JobSet Allow to mutate PodTemplate when suspending a JobSet and support resuming such JobSet Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment