Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Implement TrainJob Reconciler to manage objects #2295

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Oct 20, 2024

What this PR does / why we need it:
I implemented the object creation and update in the TrainJob reconciler.
I will implement the TrainJob status operation in a follow-up PR.

Additionally, I restricted the directories in make test so that we can perform tests only for the v1 APIs and controllers.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2207
Relates to #2170

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Oct 20, 2024

Pull Request Test Coverage Report for Build 11469279753

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 7 of 7 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11426928089: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the implement-trainjob-reconciler branch from e0fc3f9 to 55ac657 Compare October 20, 2024 17:20
@tenzen-y tenzen-y force-pushed the implement-trainjob-reconciler branch from 55ac657 to 9e04bdd Compare October 20, 2024 17:30
@tenzen-y
Copy link
Member Author

/assign @kubeflow/wg-training-leads

@tenzen-y
Copy link
Member Author

/hold

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @tenzen-y!
I left a few comments.
/assign @kubeflow/wg-training-leads @kuizhiqing @akshaychitneni @varshaprasad96

Makefile Outdated
@@ -127,3 +128,18 @@ controller-gen: ## Download controller-gen locally if necessary.
KUSTOMIZE = $(shell pwd)/bin/kustomize
kustomize: ## Download kustomize locally if necessary.
GOBIN=$(PROJECT_DIR)/bin go install sigs.k8s.io/kustomize/kustomize/[email protected]

## Download external CRDs for the integration testings.
EXTERNAL_CRDS_DIR ?= $(PROJECT_DIR)/dep-crds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put them to /manifests/external-crds for consistency ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for integration testing, so these manifests is not stored git. Indeed, this directory is specified in the .gitignore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand this.
Just for the developers convenience, I suggested to put them under manifests/external-crds and add this folder to the .gitignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will move the dep-crds directory to the manifests/external-crds directory.
Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder) *TrainJobReconciler {
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runs map[string]runtime.Runtime) *TrainJobReconciler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it runtimes for consistency.

Suggested change
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runs map[string]runtime.Runtime) *TrainJobReconciler {
func NewTrainJobReconciler(client client.Client, recorder record.EventRecorder, runtimes map[string]runtime.Runtime) *TrainJobReconciler {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

log := ctrl.LoggerFrom(ctx)

// Controller assumes the runtime existence has already verified in the webhook on TrainJob creation.
run := r.runtimes[runtimeRefToGroupKind(trainJob.Spec.RuntimeRef).String()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run := r.runtimes[runtimeRefToGroupKind(trainJob.Spec.RuntimeRef).String()]
runtime := r.runtimes[runtimeRefToGroupKind(trainJob.Spec.RuntimeRef).String()]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +112 to +113
Group: ptr.Deref(runtimeRef.APIGroup, ""),
Kind: ptr.Deref(runtimeRef.Kind, ""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use Deref here if APIGroup and Kind can't be empty at this stage ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that we can not restrict these field like non empty. Especially, these should be accepted an empty value so that they can specify the core API type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially, these should be accepted an empty value so that they can specify the core API type.

What is the use-case when users can reference the object from Kubernetes Core, like Service or Pods ?
I thought that users can only reference runtimes that being registered within our Training Operator manager.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that they want to register Pod to the runtimes, and I sometimes have heard from some users the use cases since they often use the plain Pod with many special annotations in the production cluster to mitigate CRD migration costs. For sure, I understand that plain Pod with special annotations is not ideal approach. But, no restrictions would be better in the runtime registration level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, for the Kind, should it be always non-empty ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, for the Kind, should it be always non-empty ?

As runtime specifications, Kind should be always non-empty.
But, there are some ways to avoid the validations like failurePolicy. So, I would recommend leaving this deref.

Comment on lines +82 to +83
"namespace", obj.GetNamespace(),
"name", obj.GetName(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we set name and namespace for the build objects ?
E.g. here I can see that we only set the TypeMeta and Spec:

info := runtime.NewInfo(&jobsetv1alpha2.JobSet{
TypeMeta: metav1.TypeMeta{
APIVersion: jobsetv1alpha2.SchemeGroupVersion.String(),
Kind: "JobSet",
},
Spec: *jobSetTemplateSpec.Spec.DeepCopy(),
}, opts...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. Actual resource names are set in each plugins in the following:

JobSet:

Namespace: objectKey.Namespace,
Name: objectKey.Name,

PodGroup:
Name: trainJob.Name,
Namespace: trainJob.Namespace,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why for JobSet plugin you want to have separate Builder, but for co-scheduling you create a PodSpec object directly under Build() API ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why for JobSet plugin you want to have separate Builder, but for co-scheduling you create a PodSpec object directly under Build() API ?

That is simply reason. There are so many fields and nested levels in the JobSet, but the PodGroup has a few field and a few nested level. For sure, we can provide the builder in the PodGroup as well. But, I guess that the PodGroup builder seems to be slightly overkill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

},
Spec: raw.Spec,
})
oldJobSet = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need this for the needsCreateOrUpdate function.

Comment on lines +93 to +105
switch {
case created:
log.V(5).Info("Succeeded to create object", logKeysAndValues)
continue
case client.IgnoreAlreadyExists(creationErr) != nil:
return creationErr
default:
// This indicates CREATE operation has not been performed or the object has already existed in the cluster.
if err = r.client.Update(ctx, obj); err != nil {
return err
}
log.V(5).Info("Succeeded to update object", logKeysAndValues)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make this condition simpler ? E.g. if GetResourceVersion != "" we know that object exists and we should perform UPDATE operation, otherwise we should just perform CREATE operation.
If CREATE or UPDATE action fails, we return the error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we need to check if the creationErr is AlreadyExists error. Because we do not know if the runtime interface returns objects with resourceVersion for UPDATE operation, and can not guarantee to have the resourceVersion as an implementation level.

After we migrate to the SSA patch (#2297), we can avoid this complexity and just issue the SSA patch for CREATE and UPDATE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense!

@tenzen-y
Copy link
Member Author

@andreyvelich I addressed all your comments. PTAL, thanks.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tenzen-y!
Feel free to unblock it
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member Author

Thanks @tenzen-y!

Feel free to unblock it

/lgtm

/approve

Thank you for the review.
/hold cancel

@varshaprasad96 FYI, now you can start SSA patch implementations.

@google-oss-prow google-oss-prow bot merged commit 74b093e into kubeflow:master Oct 23, 2024
40 of 41 checks passed
@tenzen-y tenzen-y deleted the implement-trainjob-reconciler branch October 23, 2024 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants