JobSet fails to suspend if nodeSelectors are added on admission #2691

mimowo · 2024-07-25T08:14:03Z

What happened:

JobSet fails to suspend if there are nodeSelectors added on admission. It fails because JobSet webhook rejects the update.

What you expected to happen:

JobSet can be suspended and the PodTemplate is restored.

How to reproduce it (as minimally and precisely as possible):

Create a kind cluster and install Kueue 0.8.0 and JobSet 0.5.2
Configure Kueue using the following (node the nodeLabels for the flavor)

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "worker-flavor"
spec:
  nodeLabels:
    "kubernetes.io/hostname": "kind-worker"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "worker-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: "default"
  name: "user-queue"
spec:
  clusterQueue: "cluster-queue"

Create a JobSet that will get admitted to the queue:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: sleep
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  suspend: true
  replicatedJobs:
  - name: workers
    template:
      spec:
        parallelism: 4
        completions: 4
        backoffLimit: 0
        template:
          spec:
            containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - 600s
              resources:
                requests:
                  memory: 100Mi
                  cpu: 100m
  - name: driver
    template:
      spec:
        parallelism: 1
        completions: 1
        backoffLimit: 0
        template:
          spec:
            containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - 600s
              resources:
                requests:
                  memory: 100Mi
                  cpu: 100m

Stop the Cluster queue k edit clusterqueue/cluster-queue and set the stopPolicy: HoldAndDrain

Issue: This should result in stopping the JobSet. However, it continues to run and be admitted. The logs for Kueue show that the mutation request is rejected by JobSet webhook.

The issue is reproduced by the e2e test in the WIP PR: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/2700/pull-kueue-test-e2e-main-1-29/1816503837968568320

Anything else we need to know?:

I fill open a dedicated issue in JobSet and see if we can fix it there. We may need a work-around in Kueue to do 2-step stop:
suspend first, then restore the PodTemplate, but I prefer to avoid it. This is something we aim to avoid for Job in the long run too: kubernetes/kubernetes#113221

Environment:

Kubernetes version (use kubectl version):
Kueue version (use git describe --tags --dirty --always):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

mimowo · 2024-07-25T08:14:14Z

/assign
/cc @tenzen-y

tenzen-y · 2024-07-25T11:05:25Z

I think that this is a little big deal for the GPU workload since some users inject the tolerations for the GPU Node via resourceFlavor.

By this limitation, we can not preempt or deactivate the Jobs with tolerations inserted by the ResourceFlavor, right?
If so, kubernetes/kubernetes#113221 is a higher priority for the Kueue.

mimowo · 2024-07-25T11:20:48Z

By this limitation, we can not preempt or deactivate the Jobs with tolerations inserted by the ResourceFlavor, right?
If so, kubernetes/kubernetes#113221 is a higher priority for the Kueue.

To workaround this problem we have implemented a custom stop in Kueue, which takes 3 requests to stop:

suspend
set startTime to nil
restore PodTemplate

So, while it is not very performant it works. So it is a high priority, but I would not call it urgent. We could follow a similar pattern for JobSet (one request to suspend, and one to restore PodTemplate), but I hope we can fix it in JobSet itself.

mimowo · 2024-07-25T11:22:48Z

I think it didn't surface until now because JobSet is mostly used with DWS ProvisioningRequests, where the quota is set very high, so evictions are rare.

tenzen-y · 2024-07-25T11:31:13Z

By this limitation, we can not preempt or deactivate the Jobs with tolerations inserted by the ResourceFlavor, right?
If so, kubernetes/kubernetes#113221 is a higher priority for the Kueue.

To workaround this problem we have implemented a custom stop in Kueue, which takes 3 requests to stop:

suspend

set startTime to nil

restore PodTemplate

So, while it is not very performant it works. So it is a high priority, but I would not call it urgent. We could follow a similar pattern for JobSet (one request to suspend, and one to restore PodTemplate), but I hope we can fix it in JobSet itself.

Yeah, that is the reason why we introduced the JobWithCustomStop interface...

tenzen-y · 2024-07-25T11:35:38Z

So, while it is not very performant it works. So it is a high priority, but I would not call it urgent. We could follow a similar pattern for JobSet (one request to suspend, and one to restore PodTemplate), but I hope we can fix it in JobSet itself.

One thing is preemption or deactivation throughput, this 3 step stop mechanism obviously would bring us lower throughput preemption.
I guess that this may be a little bit of a deal in the big cluster. But, I'm not sure if this could be one of the use cases of kubernetes/kubernetes#113221.

mimowo · 2024-07-25T11:41:27Z

IIUC kubernetes/kubernetes#113221 is about it - to give us an "official" API to do what we do in Kueue. The fact that Kueue sets status.startTime=nil is hacky - external controllers should not mutate Status of objects managed by another controller, in principle.

I think the idea "We could implement mutable scheduling directives after suspension if we set .status.startTime to nil when when we are confident that there all the running pods are at least terminating. " is that .status.startTime is set to nil by the k8s Job controller, not an external controller.

tenzen-y · 2024-07-25T11:50:31Z

IIUC kubernetes/kubernetes#113221 is about it - to give us an "official" API to do what we do in Kueue. The fact that Kueue sets status.startTime=nil is hacky - external controllers should not mutate Status of objects managed by another controller, in principle.

I think the idea "We could implement mutable scheduling directives after suspension if we set .status.startTime to nil when when we are confident that there all the running pods are at least terminating. " is that .status.startTime is set to nil by the k8s Job controller, not an external controller.

I agree with you. Additionally, nice to have verifications if the .status.active and .status.terminating are 0.
Anyway, if we can discuss this in the sig-apps or wg-batch meeting for the next kubernetes release iteration, it sounds attractive.

k8s-triage-robot · 2024-10-23T16:55:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y · 2024-10-23T20:24:23Z

/remove-lifecycle stale

mimowo · 2024-10-24T06:51:57Z

I think this is already addressed by the fix in JobSet: kubernetes-sigs/jobset#644, followed up by the merge of e2e test in Kueue: #2700 (the e2e test covers to the scenario in the issue).
/close

k8s-ci-robot · 2024-10-24T06:52:02Z

@mimowo: Closing this issue.

In response to this:

I think this is already addressed by the fix in JobSet: kubernetes-sigs/jobset#644, followed up by the merge of e2e test in Kueue: #2700 (the e2e test covers to the scenario in the issue).
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Jul 25, 2024

k8s-ci-robot assigned mimowo Jul 25, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024

k8s-ci-robot closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobSet fails to suspend if nodeSelectors are added on admission #2691

JobSet fails to suspend if nodeSelectors are added on admission #2691

mimowo commented Jul 25, 2024 •

edited

Loading

mimowo commented Jul 25, 2024

tenzen-y commented Jul 25, 2024 •

edited

Loading

mimowo commented Jul 25, 2024 •

edited

Loading

mimowo commented Jul 25, 2024

tenzen-y commented Jul 25, 2024

tenzen-y commented Jul 25, 2024 •

edited

Loading

mimowo commented Jul 25, 2024 •

edited

Loading

tenzen-y commented Jul 25, 2024

k8s-triage-robot commented Oct 23, 2024

tenzen-y commented Oct 23, 2024

mimowo commented Oct 24, 2024

k8s-ci-robot commented Oct 24, 2024

JobSet fails to suspend if nodeSelectors are added on admission #2691

JobSet fails to suspend if nodeSelectors are added on admission #2691

Comments

mimowo commented Jul 25, 2024 • edited Loading

mimowo commented Jul 25, 2024

tenzen-y commented Jul 25, 2024 • edited Loading

mimowo commented Jul 25, 2024 • edited Loading

mimowo commented Jul 25, 2024

tenzen-y commented Jul 25, 2024

tenzen-y commented Jul 25, 2024 • edited Loading

mimowo commented Jul 25, 2024 • edited Loading

tenzen-y commented Jul 25, 2024

k8s-triage-robot commented Oct 23, 2024

tenzen-y commented Oct 23, 2024

mimowo commented Oct 24, 2024

k8s-ci-robot commented Oct 24, 2024

mimowo commented Jul 25, 2024 •

edited

Loading

tenzen-y commented Jul 25, 2024 •

edited

Loading

mimowo commented Jul 25, 2024 •

edited

Loading

tenzen-y commented Jul 25, 2024 •

edited

Loading

mimowo commented Jul 25, 2024 •

edited

Loading