-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support kubeflow operator #297
Comments
Note that MPIJob latest version is not currently part of the training-operator kubeflow/training-operator#1479 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/lifecycle frozen |
This is currently blocked on kubeflow/common#196 |
/assign |
As a first step, I opened a PR to add the PyTorchJob support, and then I will add the following framework support:
Also, I'm on the fence if we should support MPIJob v1 hosted only on kubeflow/training-operator (currently, MPIJob v2 hosted only on kubeflow/mpi-operator) Regarding MPIJob v1 wdyt? @alculquicondor @mimowo @kerthcet @trasc |
I'm ok leaving it out if it's not trivial to support 2 API versions. I think the CRD objects themselves are not compatible. |
Right.
We can not support v1 and v2 API by a single controller: https://github.com/kubernetes-sigs/kueue/tree/a103723023aa6c5a63cc8c1248fd38d8640d7003/pkg/controller/jobs/mpijob. However, once we implement a separate controller for v1 like https://github.com/kubernetes-sigs/kueue/blob/3589969054023cb8b584a4639f4b9dec8c371a67/pkg/controller/jobs/kubeflow/jobs/pytorchjob/pytorchjob_controller.go, we can support v1. |
Anyway, I think MPIJob v1 is a lower priority since we already support MPIJob v2. |
+1 to defer the work unless we receive strong demands. |
I agree. |
Tasks:
|
What would you like to be added:
Support kubeflow training operator.
Why is this needed:
It is to track the status of kueue to support kubeflow training operator.
The text was updated successfully, but these errors were encountered: