Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobSet stays in suspend state if kueue is managing it #3349

Closed
kannon92 opened this issue Oct 28, 2024 · 18 comments
Closed

JobSet stays in suspend state if kueue is managing it #3349

kannon92 opened this issue Oct 28, 2024 · 18 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kannon92
Copy link
Contributor

kannon92 commented Oct 28, 2024

What happened:
If I submit a simple JobSet with kueue, the workload stays in a suspend state.
What you expected to happen:
Kueue will unsuspend and the workload will run successfully.

How to reproduce it (as minimally and precisely as possible):

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: paralleljobs
  namespace: kueue-demo
  labels:
    kueue.x-k8s.io/queue-name: queue
spec:
  replicatedJobs:
  - name: workers
    replicas: 2
    template:
      spec:
        parallelism: 4
        completions: 4
        backoffLimit: 0
        template:
          spec:
            containers:
            - name: sleep
              image: quay.io/quay/busybox
              command: 
                - sleep
              args:
                - 100s
  - name: driver
    template:
      spec:
        parallelism: 1
        completions: 1
        backoffLimit: 0
        template:
          spec:
            containers:
            - name: sleep
              image: quay.io/quay/busybox
              command: 
                - sleep
              args:
                - 100s
  1. Submit a jobset that uses kueue (ie add Workload will stay in a suspended state.

Anything else we need to know?:
JobSet is 0.7.0.
Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always): 0.8.1
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@kannon92 kannon92 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2024
@kannon92
Copy link
Contributor Author

kannon92 commented Oct 28, 2024

Workload says its admitted;

kueue$ oc get workloads -n kueue-demo
NAME                         QUEUE   RESERVED IN     ADMITTED   FINISHED   AGE
job-sample-job-55pkz-a624b   queue   cluster-queue   True       True       20m
jobset-paralleljobs-b13e4    queue   cluster-queue   True                  12m

But the jobset is suspended:

kehannon@kehannon-thinkpadp1gen4i:~/Work/openshift/kubecon-na-2024/kueue$ oc get jobset -n kueue-demo
NAME           TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs                                          true        13m

If I submit this jobSet without the kueue label, the workload runs without issue.

@kannon92
Copy link
Contributor Author

"error":"JobSet.jobset.x-k8s.io \"paralleljobs\" is invalid: spec.network: Invalid value: \"object\": Value is immutable","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Kueue manager logs are logging this error.

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Oct 29, 2024

We already fixed #3132 on 0.9.

@mbobrovskyi
Copy link
Contributor

@mimowo @tenzen-y maybe we should cherry-pick to 0.8?

@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2024

I think it might be a good idea indeed. The fix does not require API changes. We deferred due to possibly many conflicts but I think it is worth trying.

Could you please try to prepare a minimal cherry - pick so that we can assess what it entails?

@mbobrovskyi
Copy link
Contributor

/assign

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Oct 29, 2024

OK. On this case I think we need to cherry-pick #3102 and #3132

@tenzen-y
Copy link
Member

I think it might be a good idea indeed. The fix does not require API changes. We deferred due to possibly many conflicts but I think it is worth trying.

Could you please try to prepare a minimal cherry - pick so that we can assess what it entails?

SGTM

@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2024

I think I'm ok with with that - no API / schema changes in the diffs, but the changes are big, so let me confirm with @tenzen-y . Actually, we discussed the cherry-picking before and the main argument was that we still have time before release of new CRDs, which is proven wrong by the issue.

OTOH, we are just a week from releasing 0.9.0, and based on the comment #3349 (comment) @kannon92 could probably mitigate by using 0.9.0-rc.1

@tenzen-y
Copy link
Member

I think I'm ok with with that - no API / schema changes in the diffs, but the changes are big, so let me confirm with @tenzen-y . Actually, we discussed the cherry-picking before and the main argument was that we still have time before release of new CRDs, which is proven wrong by the issue.

OTOH, we are just a week from releasing 0.9.0, and based on the comment #3349 (comment) @kannon92 could probably mitigate by using 0.9.0-rc.1

Yes, that's right. However, the discussion result was based on the already resolved RayJob issue.
So, based on this JobSet issue, we might want to cherry-pick. Or, we may be able to just upgrade the JobSet version in the release-0.8 branch.

@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2024

Or, we may be able to just upgrade the JobSet version in the release-0.8 branch.

This could be an option indeed. If this is less changes I'm ok to also start with that

@tenzen-y
Copy link
Member

@mbobrovskyi, Could you check if we can upgrade the JobSet module version with fewer changes?

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Oct 29, 2024

@mbobrovskyi, Could you check if we can upgrade the JobSet module version with fewer changes?

Ah, it's require to upgrade the Kubernetes version to v0.31.1. And there are a lot of changes :)

@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2024

in that case let's go with the fix for field dropping

@kannon92
Copy link
Contributor Author

Thank you all! My hope was to test Kueue with released containers for Kubecon so using the rc isn’t ideal.

Either way I think having this change for 0.8 will be useful as 0.9 requires 1.31 so this will be helpful.

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Oct 30, 2024

/close

Due to fixed by #3358.

@k8s-ci-robot
Copy link
Contributor

@mbobrovskyi: Closing this issue.

In response to this:

/close

Due to fixed on #3358.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo
Copy link
Contributor

mimowo commented Oct 30, 2024

FYI we are going to release 0.8.2 which will include the fix: #3371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants