KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

tenzen-y · 2024-10-20T18:29:52Z

What you would like to be added?

Currently, the TrainJob reconciler does UPSERT operations to create or update objects.

training-operator/pkg/controller.v2/trainjob_controller.go

Lines 85 to 105 in 9e04bdd

    
           // TODO (tenzen-y): Ideally, we should use the SSA instead of checking existence. 
        
           // Non-empty resourceVersion indicates UPDATE operation. 
        
           var creationErr error 
        
           var created bool 
        
           if obj.GetResourceVersion() == "" { 
        
           	creationErr = r.client.Create(ctx, obj) 
        
           	created = creationErr == nil 
        
           } 
        
           switch { 
        
           case created: 
        
           	log.V(5).Info("Succeeded to create object", logKeysAndValues) 
        
           	continue 
        
           case client.IgnoreAlreadyExists(creationErr) != nil: 
        
           	return creationErr 
        
           default: 
        
           	// This indicates CREATE operation has not been performed or the object has already existed in the cluster. 
        
           	if err = r.client.Update(ctx, obj); err != nil { 
        
           		return err 
        
           	} 
        
           	log.V(5).Info("Succeeded to update object", logKeysAndValues) 
        
           }

Doing only the SSA PATCH operation for CREATE and UPDATE would be great.

Why is this needed?

Eliminating UPSERT operations could mitigate the complexity of the object operating mechanism, and improve performance by reduced API calls.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

tenzen-y · 2024-10-20T18:31:01Z

I'm wondering if @varshaprasad96 has knowledge and is interested in this enhancement.

tenzen-y · 2024-10-20T18:38:29Z

/remove-label lifecycle/needs-triage

varshaprasad96 · 2024-10-20T20:18:11Z

@tenzen-y Sure, I can take this up, SSA would help in managing conflicts better.

varshaprasad96 · 2024-10-20T20:18:16Z

/assign

tenzen-y · 2024-10-21T00:42:00Z

@tenzen-y Sure, I can take this up, SSA would help in managing conflicts better.

Thank you for taking this issue.

One question is I'm curious if we should take a similar approach as cluster-api so that we can directly handle the client.Object and reduce unneeded reconciliation. But, now, I'm not clear which fields we should drop or add: https://github.com/kubernetes-sigs/cluster-api/blob/578b70f79659003a005f390cc022cf17f151cebc/internal/util/ssa/patch.go#L64

varshaprasad96 · 2024-10-21T20:19:30Z

I'm curious if we should take a similar approach as cluster-api so that we can directly handle the client.Object and reduce unneeded reconciliation.

IIUC, CAPI uses dry run and a cache specifically for SSA to determine if update request is to be sent by calculating the diff b/w server and expected state. This definitely has benefits - especially given it is used in places where multiple controllers are acting on the same object so no. of reconcile calls are indeed a load on API server (I'm not super familiar with the CAPI controllers implementation, so feel free to correct me if I'm wrong).

I'm curious if reconciliations on trainJob are going to be so frequent enough that we need to implement this in the first iteration (given they are majorly being batch workloads). How about we just use SSA with ApplyConfigurations for now, and proceed on to implementing a caching layer in follow ups if frequent reconciliation is turning out to be an issue. Meanwhile I would also be curious in general on how caching metrics for ssa cache turn out in CAPI (kubernetes-sigs/cluster-api#10527).

tenzen-y · 2024-10-21T20:59:12Z

IIUC, CAPI uses dry run and a cache specifically for SSA to determine if update request is to be sent by calculating the diff b/w server and expected state.

That is the same with my understanding to CAPI SSA PATCH mechanism.

How about we just use SSA with ApplyConfigurations for now, and proceed on to implementing a caching layer in follow ups if frequent reconciliation is turning out to be an issue.

Actually, the dry run calculation mechanism could bring us to mitigate conflicts since a set of resources (client.Object) like JobSet is interested by TrainJob controllers and JobSet controllers.
However, I agree that we start from simplified ApplyConfiguration mechanism, and postpone the SSA cache mechanism in the future.

tenzen-y added kind/feature lifecycle/needs-triage labels Oct 20, 2024

tenzen-y added this to KEP-2170: Kubeflow Training V2 API Oct 20, 2024

github-project-automation bot moved this to Todo in KEP-2170: Kubeflow Training V2 API Oct 20, 2024

google-oss-prow bot removed the lifecycle/needs-triage label Oct 20, 2024

tenzen-y mentioned this issue Oct 20, 2024

KEP-2170: Kubeflow Training V2 API #2170

Open

18 tasks

google-oss-prow bot assigned varshaprasad96 Oct 20, 2024

tenzen-y mentioned this issue Oct 21, 2024

KEP-2170: Implement TrainJob Reconciler to manage objects #2295

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

tenzen-y commented Oct 20, 2024

tenzen-y commented Oct 20, 2024

tenzen-y commented Oct 20, 2024

varshaprasad96 commented Oct 20, 2024

varshaprasad96 commented Oct 20, 2024

tenzen-y commented Oct 21, 2024

varshaprasad96 commented Oct 21, 2024 •

edited

Loading

tenzen-y commented Oct 21, 2024

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

KEP-2170: Replace UPSERT operation for the objects with SSA PATCH #2297

Comments

tenzen-y commented Oct 20, 2024

What you would like to be added?

Why is this needed?

Love this feature?

tenzen-y commented Oct 20, 2024

tenzen-y commented Oct 20, 2024

varshaprasad96 commented Oct 20, 2024

varshaprasad96 commented Oct 20, 2024

tenzen-y commented Oct 21, 2024

varshaprasad96 commented Oct 21, 2024 • edited Loading

tenzen-y commented Oct 21, 2024

varshaprasad96 commented Oct 21, 2024 •

edited

Loading