Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick #1415 #1418 to v1.3-branch #1428

Merged
merged 2 commits into from
Oct 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kube

- For a complete reference of the custom resource definitions, please refer to the API Definition.
- [Tensorflow API Definition](pkg/apis/tensorflow/v1/types.go)
- [PyTorch API Definition](pkg/apis/pytorch/v1/types.go)
- [MXNet API Definition](pkg/apis/mxnet/v1/types.go)
- [XGBoost API Definition](pkg/apis/xgboost/v1/types.go)
- [PyTorch API Definition](pkg/apis/pytorch/v1/pytorchjob_types.go)
- [MXNet API Definition](pkg/apis/mxnet/v1/mxjob_types.go)
- [XGBoost API Definition](pkg/apis/xgboost/v1/xgboostjob_types.go)
- For details on API design, please refer to the [v1alpha2 design doc](https://github.com/kubeflow/community/blob/master/proposals/tf-operator-design-v1alpha2.md).
- For details of all-in-one operator design, please refer to the [All-in-one Kubeflow Training Operator](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)
- For details on its obersibility, please refer to the [monitoring design doc](docs/monitoring/README.md).
Expand Down
11 changes: 1 addition & 10 deletions pkg/controller.v1/mxnet/mxjob_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -166,17 +166,8 @@ func (r *MXJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl
replicas[commonv1.ReplicaType(k)] = v
}

// Construct RunPolicy based on MXJob.Spec
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: mxjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: mxjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: mxjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: mxjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
}

// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(mxjob, replicas, mxjob.Status, runPolicy)
err = r.ReconcileJobs(mxjob, replicas, mxjob.Status, &mxjob.Spec.RunPolicy)
if err != nil {
logrus.Warnf("Reconcile MX Job error %v", err)
return ctrl.Result{}, err
Expand Down
11 changes: 1 addition & 10 deletions pkg/controller.v1/pytorch/pytorchjob_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -155,17 +155,8 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request)
// Set default priorities to pytorch job
r.Scheme.Default(pytorchjob)

// Construct RunPolicy based on PyTorchJob.Spec
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
}

// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, runPolicy)
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, &pytorchjob.Spec.RunPolicy)
if err != nil {
logrus.Warnf("Reconcile PyTorch Job error %v", err)
return ctrl.Result{}, err
Expand Down