-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Implement TrainJob Reconciler to manage objects #2295
KEP-2170: Implement TrainJob Reconciler to manage objects #2295
Conversation
b31fc96
to
e0fc3f9
Compare
Pull Request Test Coverage Report for Build 11469279753Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
e0fc3f9
to
55ac657
Compare
Signed-off-by: Yuki Iwai <[email protected]>
55ac657
to
9e04bdd
Compare
/assign @kubeflow/wg-training-leads |
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @tenzen-y!
I left a few comments.
/assign @kubeflow/wg-training-leads @kuizhiqing @akshaychitneni @varshaprasad96
Makefile
Outdated
@@ -127,3 +128,18 @@ controller-gen: ## Download controller-gen locally if necessary. | |||
KUSTOMIZE = $(shell pwd)/bin/kustomize | |||
kustomize: ## Download kustomize locally if necessary. | |||
GOBIN=$(PROJECT_DIR)/bin go install sigs.k8s.io/kustomize/kustomize/[email protected] | |||
|
|||
## Download external CRDs for the integration testings. | |||
EXTERNAL_CRDS_DIR ?= $(PROJECT_DIR)/dep-crds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should put them to /manifests/external-crds
for consistency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for integration testing, so these manifests is not stored git. Indeed, this directory is specified in the .gitignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand this.
Just for the developers convenience, I suggested to put them under manifests/external-crds
and add this folder to the .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I will move the dep-crds
directory to the manifests/external-crds
directory.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder) *TrainJobReconciler { | ||
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runs map[string]runtime.Runtime) *TrainJobReconciler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call it runtimes for consistency.
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runs map[string]runtime.Runtime) *TrainJobReconciler { | |
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runtimes map[string]runtime.Runtime) *TrainJobReconciler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
log := ctrl.LoggerFrom(ctx) | ||
|
||
// Controller assumes the runtime existence has already verified in the webhook on TrainJob creation. | ||
run := r.runtimes[runtimeRefToGroupKind(trainJob.Spec.RuntimeRef).String()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run := r.runtimes[runtimeRefToGroupKind(trainJob.Spec.RuntimeRef).String()] | |
runtime := r.runtimes[runtimeRefToGroupKind(trainJob.Spec.RuntimeRef).String()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Group: ptr.Deref(runtimeRef.APIGroup, ""), | ||
Kind: ptr.Deref(runtimeRef.Kind, ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use Deref here if APIGroup and Kind can't be empty at this stage ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that we can not restrict these field like non empty. Especially, these should be accepted an empty value so that they can specify the core API type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially, these should be accepted an empty value so that they can specify the core API type.
What is the use-case when users can reference the object from Kubernetes Core, like Service or Pods ?
I thought that users can only reference runtimes that being registered within our Training Operator manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that they want to register Pod to the runtimes, and I sometimes have heard from some users the use cases since they often use the plain Pod with many special annotations in the production cluster to mitigate CRD migration costs. For sure, I understand that plain Pod with special annotations is not ideal approach. But, no restrictions would be better in the runtime registration level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, for the Kind, should it be always non-empty ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, for the Kind, should it be always non-empty ?
As runtime specifications, Kind should be always non-empty.
But, there are some ways to avoid the validations like failurePolicy. So, I would recommend leaving this deref.
"namespace", obj.GetNamespace(), | ||
"name", obj.GetName(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set name and namespace for the build objects ?
E.g. here I can see that we only set the TypeMeta and Spec:
training-operator/pkg/runtime.v2/core/trainingruntime.go
Lines 113 to 119 in 9e04bdd
info := runtime.NewInfo(&jobsetv1alpha2.JobSet{ | |
TypeMeta: metav1.TypeMeta{ | |
APIVersion: jobsetv1alpha2.SchemeGroupVersion.String(), | |
Kind: "JobSet", | |
}, | |
Spec: *jobSetTemplateSpec.Spec.DeepCopy(), | |
}, opts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. Actual resource names are set in each plugins in the following:
JobSet:
training-operator/pkg/runtime.v2/framework/plugins/jobset/builder.go
Lines 42 to 43 in 9e04bdd
Namespace: objectKey.Namespace, | |
Name: objectKey.Name, |
PodGroup:
training-operator/pkg/runtime.v2/framework/plugins/coscheduling/coscheduling.go
Lines 121 to 122 in 9e04bdd
Name: trainJob.Name, | |
Namespace: trainJob.Namespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why for JobSet plugin you want to have separate Builder, but for co-scheduling you create a PodSpec object directly under Build()
API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why for JobSet plugin you want to have separate Builder, but for co-scheduling you create a PodSpec object directly under Build() API ?
That is simply reason. There are so many fields and nested levels in the JobSet, but the PodGroup has a few field and a few nested level. For sure, we can provide the builder in the PodGroup as well. But, I guess that the PodGroup builder seems to be slightly overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
}, | ||
Spec: raw.Spec, | ||
}) | ||
oldJobSet = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need this for the needsCreateOrUpdate
function.
switch { | ||
case created: | ||
log.V(5).Info("Succeeded to create object", logKeysAndValues) | ||
continue | ||
case client.IgnoreAlreadyExists(creationErr) != nil: | ||
return creationErr | ||
default: | ||
// This indicates CREATE operation has not been performed or the object has already existed in the cluster. | ||
if err = r.client.Update(ctx, obj); err != nil { | ||
return err | ||
} | ||
log.V(5).Info("Succeeded to update object", logKeysAndValues) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to make this condition simpler ? E.g. if GetResourceVersion != ""
we know that object exists and we should perform UPDATE operation, otherwise we should just perform CREATE operation.
If CREATE or UPDATE action fails, we return the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we need to check if the creationErr
is AlreadyExists error. Because we do not know if the runtime interface returns objects with resourceVersion for UPDATE operation, and can not guarantee to have the resourceVersion as an implementation level.
After we migrate to the SSA patch (#2297), we can avoid this complexity and just issue the SSA patch for CREATE and UPDATE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense!
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
@andreyvelich I addressed all your comments. PTAL, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tenzen-y!
Feel free to unblock it
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you for the review. @varshaprasad96 FYI, now you can start SSA patch implementations. |
What this PR does / why we need it:
I implemented the object creation and update in the TrainJob reconciler.
I will implement the TrainJob status operation in a follow-up PR.
Additionally, I restricted the directories in
make test
so that we can perform tests only for the v1 APIs and controllers.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Part-of #2207
Relates to #2170
Checklist: