Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick #1415 #1418 to v1.3-branch #1427

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# Image URL to use all building/pushing image targets
IMG ?= kubeflow/training-operator:latest
# Produce CRDs that work back to Kubernetes 1.11 (no version conversion)
CRD_OPTIONS ?= "crd:trivialVersions=true,preserveUnknownFields=false"
CRD_OPTIONS ?= "crd:trivialVersions=true,preserveUnknownFields=false,generateEmbeddedObjectMeta=true"

# Get the currently used golang install path (in GOPATH/bin, unless GOBIN is set)
ifeq (,$(shell go env GOBIN))
Expand Down Expand Up @@ -87,7 +87,7 @@ undeploy: ## Undeploy controller from the K8s cluster specified in ~/.kube/confi

CONTROLLER_GEN = $(shell pwd)/bin/controller-gen
controller-gen: ## Download controller-gen locally if necessary.
$(call go-get-tool,$(CONTROLLER_GEN),sigs.k8s.io/controller-tools/cmd/controller-gen@v0.4.1)
$(call go-get-tool,$(CONTROLLER_GEN),sigs.k8s.io/controller-tools/cmd/controller-gen@v0.6.0)

KUSTOMIZE = $(shell pwd)/bin/kustomize
kustomize: ## Download kustomize locally if necessary.
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kube

- For a complete reference of the custom resource definitions, please refer to the API Definition.
- [Tensorflow API Definition](pkg/apis/tensorflow/v1/types.go)
- [PyTorch API Definition](pkg/apis/pytorch/v1/types.go)
- [MXNet API Definition](pkg/apis/mxnet/v1/types.go)
- [XGBoost API Definition](pkg/apis/xgboost/v1/types.go)
- [PyTorch API Definition](pkg/apis/pytorch/v1/pytorchjob_types.go)
- [MXNet API Definition](pkg/apis/mxnet/v1/mxjob_types.go)
- [XGBoost API Definition](pkg/apis/xgboost/v1/xgboostjob_types.go)
- For details on API design, please refer to the [v1alpha2 design doc](https://github.com/kubeflow/community/blob/master/proposals/tf-operator-design-v1alpha2.md).
- For details of all-in-one operator design, please refer to the [All-in-one Kubeflow Training Operator](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)
- For details on its obersibility, please refer to the [monitoring design doc](docs/monitoring/README.md).
Expand Down
36 changes: 35 additions & 1 deletion manifests/base/crds/kubeflow.org_mxjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.4.1
controller-gen.kubebuilder.io/version: v0.6.0
creationTimestamp: null
name: mxjobs.kubeflow.org
spec:
Expand Down Expand Up @@ -60,6 +60,23 @@ spec:
properties:
metadata:
description: 'Standard object''s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata'
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: 'Specification of the desired behavior of the
Expand Down Expand Up @@ -5582,6 +5599,23 @@ spec:
that will be copied into the PVC when
creating it. No other fields are allowed
and will be rejected during validation.
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: The specification for the
Expand Down
36 changes: 35 additions & 1 deletion manifests/base/crds/kubeflow.org_pytorchjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.4.1
controller-gen.kubebuilder.io/version: v0.6.0
creationTimestamp: null
name: pytorchjobs.kubeflow.org
spec:
Expand Down Expand Up @@ -56,6 +56,23 @@ spec:
properties:
metadata:
description: 'Standard object''s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata'
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: 'Specification of the desired behavior of the
Expand Down Expand Up @@ -5578,6 +5595,23 @@ spec:
that will be copied into the PVC when
creating it. No other fields are allowed
and will be rejected during validation.
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: The specification for the
Expand Down
36 changes: 35 additions & 1 deletion manifests/base/crds/kubeflow.org_tfjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.4.1
controller-gen.kubebuilder.io/version: v0.6.0
creationTimestamp: null
name: tfjobs.kubeflow.org
spec:
Expand Down Expand Up @@ -112,6 +112,23 @@ spec:
properties:
metadata:
description: 'Standard object''s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata'
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: 'Specification of the desired behavior of the
Expand Down Expand Up @@ -5634,6 +5651,23 @@ spec:
that will be copied into the PVC when
creating it. No other fields are allowed
and will be rejected during validation.
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: The specification for the
Expand Down
36 changes: 35 additions & 1 deletion manifests/base/crds/kubeflow.org_xgboostjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.4.1
controller-gen.kubebuilder.io/version: v0.6.0
creationTimestamp: null
name: xgboostjobs.kubeflow.org
spec:
Expand Down Expand Up @@ -104,6 +104,23 @@ spec:
properties:
metadata:
description: 'Standard object''s metadata. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata'
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: 'Specification of the desired behavior of the
Expand Down Expand Up @@ -5626,6 +5643,23 @@ spec:
that will be copied into the PVC when
creating it. No other fields are allowed
and will be rejected during validation.
properties:
annotations:
additionalProperties:
type: string
type: object
finalizers:
items:
type: string
type: array
labels:
additionalProperties:
type: string
type: object
name:
type: string
namespace:
type: string
type: object
spec:
description: The specification for the
Expand Down
2 changes: 1 addition & 1 deletion manifests/overlays/kubeflow/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ resources:
images:
- name: kubeflow/training-operator
newName: public.ecr.aws/j1r0q0g6/training/training-operator
newTag: "a1a0c188de17e0914bd7adfa79d16052276bffb1"
newTag: "d4423c83124ce7ab58b9a61a2e909b2e9c14c236"
2 changes: 1 addition & 1 deletion manifests/overlays/standalone/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ resources:
images:
- name: kubeflow/training-operator
newName: public.ecr.aws/j1r0q0g6/training/training-operator
newTag: "a1a0c188de17e0914bd7adfa79d16052276bffb1"
newTag: "d4423c83124ce7ab58b9a61a2e909b2e9c14c236"
11 changes: 1 addition & 10 deletions pkg/controller.v1/mxnet/mxjob_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -166,17 +166,8 @@ func (r *MXJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl
replicas[commonv1.ReplicaType(k)] = v
}

// Construct RunPolicy based on MXJob.Spec
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: mxjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: mxjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: mxjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: mxjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
}

// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(mxjob, replicas, mxjob.Status, runPolicy)
err = r.ReconcileJobs(mxjob, replicas, mxjob.Status, &mxjob.Spec.RunPolicy)
if err != nil {
logrus.Warnf("Reconcile MX Job error %v", err)
return ctrl.Result{}, err
Expand Down
11 changes: 1 addition & 10 deletions pkg/controller.v1/pytorch/pytorchjob_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -155,17 +155,8 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request)
// Set default priorities to pytorch job
r.Scheme.Default(pytorchjob)

// Construct RunPolicy based on PyTorchJob.Spec
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
}

// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, runPolicy)
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, &pytorchjob.Spec.RunPolicy)
if err != nil {
logrus.Warnf("Reconcile PyTorch Job error %v", err)
return ctrl.Result{}, err
Expand Down