Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSet fails to suspend if nodeSelectors are added on admission #2691

Closed
mimowo opened this issue Jul 25, 2024 · 12 comments
Closed

JobSet fails to suspend if nodeSelectors are added on admission #2691

mimowo opened this issue Jul 25, 2024 · 12 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mimowo
Copy link
Contributor

mimowo commented Jul 25, 2024

What happened:

JobSet fails to suspend if there are nodeSelectors added on admission. It fails because JobSet webhook rejects the update.

What you expected to happen:

JobSet can be suspended and the PodTemplate is restored.

How to reproduce it (as minimally and precisely as possible):

  1. Create a kind cluster and install Kueue 0.8.0 and JobSet 0.5.2
  2. Configure Kueue using the following (node the nodeLabels for the flavor)
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "worker-flavor"
spec:
  nodeLabels:
    "kubernetes.io/hostname": "kind-worker"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "worker-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: "default"
  name: "user-queue"
spec:
  clusterQueue: "cluster-queue"
  1. Create a JobSet that will get admitted to the queue:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: sleep
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  suspend: true
  replicatedJobs:
  - name: workers
    template:
      spec:
        parallelism: 4
        completions: 4
        backoffLimit: 0
        template:
          spec:
            containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - 600s
              resources:
                requests:
                  memory: 100Mi
                  cpu: 100m
  - name: driver
    template:
      spec:
        parallelism: 1
        completions: 1
        backoffLimit: 0
        template:
          spec:
            containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - 600s
              resources:
                requests:
                  memory: 100Mi
                  cpu: 100m
  1. Stop the Cluster queue k edit clusterqueue/cluster-queue and set the stopPolicy: HoldAndDrain

Issue: This should result in stopping the JobSet. However, it continues to run and be admitted. The logs for Kueue show that the mutation request is rejected by JobSet webhook.

The issue is reproduced by the e2e test in the WIP PR: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/2700/pull-kueue-test-e2e-main-1-29/1816503837968568320

Anything else we need to know?:

I fill open a dedicated issue in JobSet and see if we can fix it there. We may need a work-around in Kueue to do 2-step stop:
suspend first, then restore the PodTemplate, but I prefer to avoid it. This is something we aim to avoid for Job in the long run too: kubernetes/kubernetes#113221

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Jul 25, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

/assign
/cc @tenzen-y

@tenzen-y
Copy link
Member

tenzen-y commented Jul 25, 2024

I think that this is a little big deal for the GPU workload since some users inject the tolerations for the GPU Node via resourceFlavor.

By this limitation, we can not preempt or deactivate the Jobs with tolerations inserted by the ResourceFlavor, right?
If so, kubernetes/kubernetes#113221 is a higher priority for the Kueue.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

By this limitation, we can not preempt or deactivate the Jobs with tolerations inserted by the ResourceFlavor, right?
If so, kubernetes/kubernetes#113221 is a higher priority for the Kueue.

To workaround this problem we have implemented a custom stop in Kueue, which takes 3 requests to stop:

  1. suspend
  2. set startTime to nil
  3. restore PodTemplate

So, while it is not very performant it works. So it is a high priority, but I would not call it urgent. We could follow a similar pattern for JobSet (one request to suspend, and one to restore PodTemplate), but I hope we can fix it in JobSet itself.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

I think it didn't surface until now because JobSet is mostly used with DWS ProvisioningRequests, where the quota is set very high, so evictions are rare.

@tenzen-y
Copy link
Member

By this limitation, we can not preempt or deactivate the Jobs with tolerations inserted by the ResourceFlavor, right?
If so, kubernetes/kubernetes#113221 is a higher priority for the Kueue.

To workaround this problem we have implemented a custom stop in Kueue, which takes 3 requests to stop:

  1. suspend
  2. set startTime to nil
  3. restore PodTemplate

So, while it is not very performant it works. So it is a high priority, but I would not call it urgent. We could follow a similar pattern for JobSet (one request to suspend, and one to restore PodTemplate), but I hope we can fix it in JobSet itself.

Yeah, that is the reason why we introduced the JobWithCustomStop interface...

@tenzen-y
Copy link
Member

tenzen-y commented Jul 25, 2024

So, while it is not very performant it works. So it is a high priority, but I would not call it urgent. We could follow a similar pattern for JobSet (one request to suspend, and one to restore PodTemplate), but I hope we can fix it in JobSet itself.

One thing is preemption or deactivation throughput, this 3 step stop mechanism obviously would bring us lower throughput preemption.
I guess that this may be a little bit of a deal in the big cluster. But, I'm not sure if this could be one of the use cases of kubernetes/kubernetes#113221.

@mimowo
Copy link
Contributor Author

mimowo commented Jul 25, 2024

IIUC kubernetes/kubernetes#113221 is about it - to give us an "official" API to do what we do in Kueue. The fact that Kueue sets status.startTime=nil is hacky - external controllers should not mutate Status of objects managed by another controller, in principle.

I think the idea "We could implement mutable scheduling directives after suspension if we set .status.startTime to nil when when we are confident that there all the running pods are at least terminating. " is that .status.startTime is set to nil by the k8s Job controller, not an external controller.

@tenzen-y
Copy link
Member

IIUC kubernetes/kubernetes#113221 is about it - to give us an "official" API to do what we do in Kueue. The fact that Kueue sets status.startTime=nil is hacky - external controllers should not mutate Status of objects managed by another controller, in principle.

I think the idea "We could implement mutable scheduling directives after suspension if we set .status.startTime to nil when when we are confident that there all the running pods are at least terminating. " is that .status.startTime is set to nil by the k8s Job controller, not an external controller.

I agree with you. Additionally, nice to have verifications if the .status.active and .status.terminating are 0.
Anyway, if we can discuss this in the sig-apps or wg-batch meeting for the next kubernetes release iteration, it sounds attractive.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024
@tenzen-y
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Oct 24, 2024

I think this is already addressed by the fix in JobSet: kubernetes-sigs/jobset#644, followed up by the merge of e2e test in Kueue: #2700 (the e2e test covers to the scenario in the issue).
/close

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closing this issue.

In response to this:

I think this is already addressed by the fix in JobSet: kubernetes-sigs/jobset#644, followed up by the merge of e2e test in Kueue: #2700 (the e2e test covers to the scenario in the issue).
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants