-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI Operator v1alpha2 API Design Proposal #92
Comments
For things like JobStatus, we should aim to have a common implementation across operators. Please see https://github.com/kubeflow/tf-operator/blob/master/pkg/apis/common/v1beta2/common_types.go. The pytorch operator is a great example for using the common types and libraries. |
Yes. As @richardsliu suggested, it would be better if you use common JobStatus. It would be easier to implement at this point. We are aiming to reach a point where all operators have common JobStatus type so that other components can take use of this. In Pytorch operator, See https://github.com/kubeflow/pytorch-operator/blob/master/pkg/apis/pytorch/v1beta1/types.go#L41 |
+1 on common job status. I am recently working on kubebench and found it's better to have common job status to orchestrate workflow to avoid extra control on job status. Otherwise, I have to define job finish condition based on different JobStatus based on different DL framework operator. |
Thanks @richardsliu @johnugeorge @Jeffwan for the suggestion. I totally agree that we can reuse |
+1 on using the common job status and spec. Looking more: (a) ReplicaStatus in both lists only the number of Active/Running, Succeeded and Failed replicas; how about other states like Pending when some replicas are waiting for resources to be scheduled. (b) BackoffLimit would be a good candidate to move into common RestartPolicy (but is probably a broader change). |
@terrytangyuan we have been thinking about it for sometime. It didn't happen because of effort to be put into it. We also have a JobController which was designed to share features across operators. It should be also moved to a common repo. (See https://github.com/kubeflow/tf-operator/tree/master/pkg/common/jobcontroller ) @richardsliu Please add your thoughts too |
+1 to separate the common module out. |
Thanks for everyone's feedback! I've created a PR in #95 for the initial v1alpha2 MPIJob API Spec based on everyone's feedback. Please take a look and let me know if there's anything else that needs to be addressed. Note that I copied the common types from |
I'll defer to @richardsliu @johnugeorge since they have largely been driving operators these days. |
We can take this up after 0.5 release. |
Hi community,
I am proposing the design for v1alpha2 API version for MPI Operator. You are very welcomed to join the discussion here if you have any questions, comments, concerns, and suggestions. Once we have a concensus from the community, we can then start working on individual items.
Here are the main API changes before we dive into the detail API spec (not including specific implementations):
GPUs
andGPUsPerNode
. This is the remaining work from Support processing resource types other than GPU #75 and Move processing unit specific flags to MPIJobSpec #85.Template
intoLauncherSpec
andWorkerSpec
. See separate out worker and launcher pod specs #54 and Launcher and worker statuses do not correctly indicate the underlying states #90.MPIJobLauncherStatusType
with a more genericMPIJobPodStatusType
that represents different states of the either launcher or worker pods.ReplicaStatuses
that represents statuses of all the worker replicas and removesWorkerReplicas
since it can be inferred fromReplicaStatuses
. See Launcher and worker statuses do not correctly indicate the underlying states #90.Below is the proposed API spec for v1alpha2:
cc: @rongou @anfeng @jlewi @everpeace @gaocegege @Nivedita-V @madhukarkm @ywskycn @ScorpioCPH @jian-he @cheyang @richardsliu
Feel free to tag others who might be interested.
The text was updated successfully, but these errors were encountered: