Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

Open
Tracked by #2170
tenzen-y opened this issue Oct 20, 2024 · 7 comments
Open
Tracked by #2170

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

tenzen-y opened this issue Oct 20, 2024 · 7 comments
Assignees

Comments

@tenzen-y
Copy link
Member

What you would like to be added?

Currently, the TrainJob reconciler does UPSERT operations to create or update objects.

// TODO (tenzen-y): Ideally, we should use the SSA instead of checking existence.
// Non-empty resourceVersion indicates UPDATE operation.
var creationErr error
var created bool
if obj.GetResourceVersion() == "" {
creationErr = r.client.Create(ctx, obj)
created = creationErr == nil
}
switch {
case created:
log.V(5).Info("Succeeded to create object", logKeysAndValues)
continue
case client.IgnoreAlreadyExists(creationErr) != nil:
return creationErr
default:
// This indicates CREATE operation has not been performed or the object has already existed in the cluster.
if err = r.client.Update(ctx, obj); err != nil {
return err
}
log.V(5).Info("Succeeded to update object", logKeysAndValues)
}

Doing only the SSA PATCH operation for CREATE and UPDATE would be great.

Why is this needed?

Eliminating UPSERT operations could mitigate the complexity of the object operating mechanism, and improve performance by reduced API calls.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@tenzen-y
Copy link
Member Author

I'm wondering if @varshaprasad96 has knowledge and is interested in this enhancement.

@tenzen-y
Copy link
Member Author

/remove-label lifecycle/needs-triage

@varshaprasad96
Copy link
Contributor

@tenzen-y Sure, I can take this up, SSA would help in managing conflicts better.

@varshaprasad96
Copy link
Contributor

/assign

@tenzen-y
Copy link
Member Author

@tenzen-y Sure, I can take this up, SSA would help in managing conflicts better.

Thank you for taking this issue.

One question is I'm curious if we should take a similar approach as cluster-api so that we can directly handle the client.Object and reduce unneeded reconciliation. But, now, I'm not clear which fields we should drop or add: https://github.com/kubernetes-sigs/cluster-api/blob/578b70f79659003a005f390cc022cf17f151cebc/internal/util/ssa/patch.go#L64

@varshaprasad96
Copy link
Contributor

varshaprasad96 commented Oct 21, 2024

I'm curious if we should take a similar approach as cluster-api so that we can directly handle the client.Object and reduce unneeded reconciliation.

IIUC, CAPI uses dry run and a cache specifically for SSA to determine if update request is to be sent by calculating the diff b/w server and expected state. This definitely has benefits - especially given it is used in places where multiple controllers are acting on the same object so no. of reconcile calls are indeed a load on API server (I'm not super familiar with the CAPI controllers implementation, so feel free to correct me if I'm wrong).

I'm curious if reconciliations on trainJob are going to be so frequent enough that we need to implement this in the first iteration (given they are majorly being batch workloads). How about we just use SSA with ApplyConfigurations for now, and proceed on to implementing a caching layer in follow ups if frequent reconciliation is turning out to be an issue. Meanwhile I would also be curious in general on how caching metrics for ssa cache turn out in CAPI (kubernetes-sigs/cluster-api#10527).

@tenzen-y
Copy link
Member Author

IIUC, CAPI uses dry run and a cache specifically for SSA to determine if update request is to be sent by calculating the diff b/w server and expected state.

That is the same with my understanding to CAPI SSA PATCH mechanism.

How about we just use SSA with ApplyConfigurations for now, and proceed on to implementing a caching layer in follow ups if frequent reconciliation is turning out to be an issue.

Actually, the dry run calculation mechanism could bring us to mitigate conflicts since a set of resources (client.Object) like JobSet is interested by TrainJob controllers and JobSet controllers.
However, I agree that we start from simplified ApplyConfiguration mechanism, and postpone the SSA cache mechanism in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants