Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

[discussion] Refactor pytorch operator APIs #84

Closed
richardsliu opened this issue Oct 11, 2018 · 7 comments
Closed

[discussion] Refactor pytorch operator APIs #84

richardsliu opened this issue Oct 11, 2018 · 7 comments

Comments

@richardsliu
Copy link
Contributor

Most of the types defined in https://github.com/kubeflow/pytorch-operator/blob/master/pkg/apis/pytorch/v1alpha2/types.go overlaps with TFJob. The structures of the APIs in tf-operator and pytorch-operator are similar enough such that they should just extend from a single API.

I propose something like:

Common:

  • JobStatus
  • ReplicaStatus
  • JobCondition
  • JobConditionType
  • CleanPodPolicy
  • RestartPolicy

TFJob:

  • TFJobSpec
  • TFJobReplicaSpec
  • TFJobReplicaType

Pytorch:

  • PyTorchJobSpec
  • PyTorchReplicaSpec
  • PyTorchReplicaType

The common types can reside in the tf-operator repository for now. This will allow us to:

  1. Keep API semantics and vocabulary consistent across all training components;
  2. Reduce code duplication when possible;
  3. Ensure feature parity across components.

Thoughts?

@gaocegege
Copy link
Member

How about placing all APIs and clients in one repository named kuebflow/clients or clientsets

@richardsliu
Copy link
Contributor Author

Pytorch-operator currently imports a lot of reusable code from tf-operator. The idea is to make tf-operator the "canonical" operator, from which other training components can extend.

@johnugeorge
Copy link
Member

@richardsliu Thanks for taking this up. I was thinking in the same lines while I was restructuring the operator code. We planned to keep it this way so that CRDs of each operator are completely independent of each other(which gives more flexibility to each operator) This came up in one of the discussions in kubeflow-discuss group. For eg: CleanPodPolicyRunning policy is supported in TF but not in PyTorch.

However, it is a just a design choice if we want to share the status of the Job(JobStatus) across all operators. We will then have a consistent status field for every operator.

@jlewi

@gaocegege
Copy link
Member

I think we do not have conflicts across the operators, TF has some extra policies which are not useful in PyTorch. But we can share the status fields. Then pytorch operator could ignore the policy.

Personally, I think sharing could help us to keep the consistency, which is helpful for users.

@johnugeorge
Copy link
Member

Implemented

TF: kubeflow/training-operator#859

Pytorch: #93

@johnugeorge
Copy link
Member

Closing this issue
/close

@k8s-ci-robot
Copy link

@johnugeorge: Closing this issue.

In response to this:

Closing this issue
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants