[discussion] Refactor pytorch operator APIs #84

richardsliu · 2018-10-11T00:59:17Z

Most of the types defined in https://github.com/kubeflow/pytorch-operator/blob/master/pkg/apis/pytorch/v1alpha2/types.go overlaps with TFJob. The structures of the APIs in tf-operator and pytorch-operator are similar enough such that they should just extend from a single API.

I propose something like:

Common:

JobStatus
ReplicaStatus
JobCondition
JobConditionType
CleanPodPolicy
RestartPolicy

TFJob:

TFJobSpec
TFJobReplicaSpec
TFJobReplicaType

Pytorch:

PyTorchJobSpec
PyTorchReplicaSpec
PyTorchReplicaType

The common types can reside in the tf-operator repository for now. This will allow us to:

Keep API semantics and vocabulary consistent across all training components;
Reduce code duplication when possible;
Ensure feature parity across components.

Thoughts?

gaocegege · 2018-10-11T03:19:43Z

How about placing all APIs and clients in one repository named kuebflow/clients or clientsets

richardsliu · 2018-10-11T03:34:27Z

Pytorch-operator currently imports a lot of reusable code from tf-operator. The idea is to make tf-operator the "canonical" operator, from which other training components can extend.

johnugeorge · 2018-10-11T05:45:47Z

@richardsliu Thanks for taking this up. I was thinking in the same lines while I was restructuring the operator code. We planned to keep it this way so that CRDs of each operator are completely independent of each other(which gives more flexibility to each operator) This came up in one of the discussions in kubeflow-discuss group. For eg: CleanPodPolicyRunning policy is supported in TF but not in PyTorch.

However, it is a just a design choice if we want to share the status of the Job(JobStatus) across all operators. We will then have a consistent status field for every operator.

@jlewi

gaocegege · 2018-10-11T05:57:51Z

I think we do not have conflicts across the operators, TF has some extra policies which are not useful in PyTorch. But we can share the status fields. Then pytorch operator could ignore the policy.

Personally, I think sharing could help us to keep the consistency, which is helpful for users.

johnugeorge · 2018-11-09T08:34:11Z

Implemented

TF: kubeflow/training-operator#859

Pytorch: #93

johnugeorge · 2018-11-09T08:34:25Z

Closing this issue
/close

k8s-ci-robot · 2018-11-09T08:34:26Z

@johnugeorge: Closing this issue.

In response to this:

Closing this issue
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

richardsliu mentioned this issue Oct 26, 2018

WIP: TF operator v1beta1 API kubeflow/training-operator#854

Closed

k8s-ci-robot closed this as completed Nov 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] Refactor pytorch operator APIs #84

[discussion] Refactor pytorch operator APIs #84

richardsliu commented Oct 11, 2018

gaocegege commented Oct 11, 2018

richardsliu commented Oct 11, 2018

johnugeorge commented Oct 11, 2018

gaocegege commented Oct 11, 2018

johnugeorge commented Nov 9, 2018

johnugeorge commented Nov 9, 2018

k8s-ci-robot commented Nov 9, 2018

[discussion] Refactor pytorch operator APIs #84

[discussion] Refactor pytorch operator APIs #84

Comments

richardsliu commented Oct 11, 2018

gaocegege commented Oct 11, 2018

richardsliu commented Oct 11, 2018

johnugeorge commented Oct 11, 2018

gaocegege commented Oct 11, 2018

johnugeorge commented Nov 9, 2018

johnugeorge commented Nov 9, 2018

k8s-ci-robot commented Nov 9, 2018