Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MPIJob managedBy feature for the MultiKueue #3289

Merged

Conversation

mszadkow
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Support MPIJob managedBy feature for the MultiKueue.
Allow for installing mpi-operator in management cluster instead of only crds.

Which issue(s) this PR fixes:

Fixes #3257

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MultiKueue: Add support for  MPIJob  `spec.runPolicy.managedBy` field

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Oct 23, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 23, 2024
Copy link

netlify bot commented Oct 23, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 0618ad7
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/671a6219594bfa0008ddd35c

@mszadkow
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Oct 23, 2024
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the featue/managed-by-mpi-operator branch from 5d873dd to 7fa0893 Compare October 23, 2024 14:14
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the featue/managed-by-mpi-operator branch from 7fa0893 to 89cc222 Compare October 23, 2024 19:21
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow marked this pull request as ready for review October 23, 2024 19:36
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2024
Copy link
Contributor

@mbobrovskyi mbobrovskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also implement defaulting webhoooks?

@mszadkow
Copy link
Contributor Author

Should we also implement defaulting webhoooks?

Good question, I guess it's true...

@mimowo
Copy link
Contributor

mimowo commented Oct 24, 2024

Yes, we want the webhook defaulting too.

@tenzen-y
Copy link
Member

As I mentioned in the issue, let's implement the dedicated webhook instead of base:

SetupMPIJobWebhook = jobframework.BaseWebhookFactory(NewJob(), fromObject)

@mszadkow mszadkow force-pushed the featue/managed-by-mpi-operator branch from 89cc222 to 8615a3f Compare October 24, 2024 09:44
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 24, 2024
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the featue/managed-by-mpi-operator branch from 8615a3f to d87e5e4 Compare October 24, 2024 09:56
@mszadkow
Copy link
Contributor Author

/retest

@mszadkow mszadkow force-pushed the featue/managed-by-mpi-operator branch from 541f2f1 to eb0238c Compare October 24, 2024 11:59
@mszadkow
Copy link
Contributor Author

/retest

@mbobrovskyi
Copy link
Contributor

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e940cc9c7145f77bc65e779f0ebfcfff5874dc32

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mszadkow Thank you for driving this feature across kubeflow and Kueue!
I left a few nits comments.

@@ -116,16 +116,53 @@ func TestMultikueueAdapter(t *testing.T) {
return adapter.DeleteRemoteObject(ctx, workerClient, types.NamespacedName{Name: "mpijob1", Namespace: TestNamespace})
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the same case as

"missing jobset is not considered managed": {
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment on lines +75 to +100
if canDefaultManagedBy(mpiJob.Spec.RunPolicy.ManagedBy) {
localQueueName, found := mpiJob.Labels[constants.QueueLabel]
if !found {
return nil
}
clusterQueueName, ok := w.queues.ClusterQueueFromLocalQueue(queue.QueueKey(mpiJob.ObjectMeta.Namespace, localQueueName))
if !ok {
log.V(5).Info("Cluster queue for local queue not found", "mpijob", klog.KObj(mpiJob), "localQueue", localQueueName)
return nil
}
for _, admissionCheck := range w.cache.AdmissionChecksForClusterQueue(clusterQueueName) {
if admissionCheck.Controller == kueue.MultiKueueControllerName {
log.V(5).Info("Defaulting ManagedBy", "mpijob", klog.KObj(mpiJob), "oldManagedBy", mpiJob.Spec.RunPolicy.ManagedBy, "managedBy", kueue.MultiKueueControllerName)
mpiJob.Spec.RunPolicy.ManagedBy = ptr.To(kueue.MultiKueueControllerName)
return nil
}
}
}

return nil
}

func canDefaultManagedBy(mpiJobSpecManagedBy *string) bool {
return features.Enabled(features.MultiKueue) &&
(mpiJobSpecManagedBy == nil || *mpiJobSpecManagedBy == v2beta1.KubeflowJobController)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to commonize in the jobframework package, but we can consider it as a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow up, fair deal ;)

pkg/controller/jobs/mpijob/mpijob_webhook.go Outdated Show resolved Hide resolved
Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit

pkg/controller/jobs/mpijob/mpijob_multikueue_adapter.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2024
@mszadkow
Copy link
Contributor Author

/retest

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
@mszadkow Could you open PR to update the documentation?
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: aedfc0e34b32347d5a300404ed345d672c2ee0d5

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mszadkow, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2024
@k8s-ci-robot k8s-ci-robot merged commit 34d1fef into kubernetes-sigs:main Oct 24, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Oct 24, 2024
@mbobrovskyi mbobrovskyi deleted the featue/managed-by-mpi-operator branch October 25, 2024 02:31
@mszadkow
Copy link
Contributor Author

Thanks! @mszadkow Could you open PR to update the documentation? /lgtm /approve

Thank you!
Yes, I will open it.

@mszadkow
Copy link
Contributor Author

#3313

@mszadkow
Copy link
Contributor Author

#3316 - docs PR

PBundyra pushed a commit to PBundyra/kueue that referenced this pull request Nov 5, 2024
…3289)

* Add managedBy field impl and unit tests

* Update MpiJob multikueue integration test

* Update e2e tests and start to use mpi-operator on managment cluster

* Implement webhook defaulting

* Update after code review
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
…3289)

* Add managedBy field impl and unit tests

* Update MpiJob multikueue integration test

* Update e2e tests and start to use mpi-operator on managment cluster

* Implement webhook defaulting

* Update after code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support MPIJob managedBy feature for the MultiKueue
6 participants