reconciler/managed: only debug log transient conflict errors PT.2 #540

phisco · 2023-09-11T10:10:45Z

Description of your changes

Follow-up PR to #498.

My understanding of the original PR is that the goal was to properly handle any conflict error by logging it in debug mode and re-queueing it.

To fully achieve what the original PR was trying to achieve we have to first understand that the whole reconciliation loop is built by repeatedly going through the following steps:

do something, potentially updating the reconciled resource, which might result in a conflict error
in case of any error from the previous step, log it in debug, set it as condition and then update the status of the resource, which might result in a conflict error too.

In order to actually address all "object has been modified", we need to properly handle the errors from both the steps above.

The original PR already addressed the second step, while to handle the first step's errors I've updated all error handling branches by checking whether the returned error is a conflict error after having logged it at debug level, but before proceeding with the event emitting and conditions updating phase.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

make reviewable

Signed-off-by: Philippe Scorsolini <[email protected]>

negz · 2023-09-18T17:59:43Z

It seems like now we have both a defer handling kerrors.IsConflict and explicit returns. Can we use only explicit returns? For a method as large as this one I would feel more comfortable not having the defer - it seems easy to miss that it exists.

phisco · 2023-09-18T18:12:44Z

@negz, no strong opinions there, the defer would be impossible to bypass by mistake, instead it would be easy to forget adding yet another if kerrors.IsConflict....

turkenh · 2023-09-19T04:04:53Z

I don't have a strong opinion as well, but +1 for being consistent and leaning on explicit returns.

Also, I am wondering if it is time to consider refactoring the logic with a dedicated handleError method (or even an ErrorHandler struct) 🤔 Especially if we go with explicit returns, there will be a lot of duplicate code, which would be quite error-prone.

phisco · 2023-09-20T07:46:28Z

I already tried refactoring it, but didn't find a pleasant solution given the many different behaviours we have, but I was trying to be conservative, I'll give it another shot with a more aggressive refactoring 👍

phisco · 2023-09-20T12:21:25Z

The only thing I can think of would be refactoring it to have a callback function to be called only to handle the error after having "filtered" it, something like this:

// Reconcile a managed resource with an external resource.
func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) { //nolint:gocyclo // See note below.
	log := r.log.WithValues("request", req)
	errHandleFunc, result, err := r.reconcile(ctx, req)
	if err == nil {
		return result, nil
	}
	if kerrors.IsConflict(errors.Cause(err)) {
		// conflict errors are transient in Kubernetes and completely normal.
		// The right reaction is to requeue, but not record the object as error'ing
		// or creating events.
		log.Debug("Transient conflict error", "error", err)
		return reconcile.Result{Requeue: true}, nil
	}
	if errHandleFunc != nil {
		followupErr := errHandleFunc()
		if kerrors.IsConflict(errors.Cause(followupErr)) {
			// conflict errors are transient in Kubernetes and completely normal.
			// The right reaction is to requeue, but not record the object as error'ing
			// or creating events.
			log.Debug("Transient conflict error while handling error", "error", followupErr)
			return reconcile.Result{Requeue: true}, nil
		}
		if followupErr != nil {
			err = errors.Join(err, followupErr)
		}
	}
	return result, err
}

func (r *Reconciler) reconcile(ctx context.Context, req reconcile.Request) (func() error, reconcile.Result, error) {
	// NOTE(negz): This method is a well over our cyclomatic complexity goal.
	// Be wary of adding additional complexity.

	log := r.log.WithValues("request", req)

	log.Debug("Reconciling")

	ctx, cancel := context.WithTimeout(ctx, r.timeout+reconcileGracePeriod)
	defer cancel()

	// Govet linter has a check for lost cancel funcs but it's a false positive
	// for child contexts as because parent's cancel is called, so we skip it
	// for this line.
	externalCtx, _ := context.WithTimeout(ctx, r.timeout) //nolint:govet // See note above.

	managed := r.newManaged()
	if err := r.client.Get(ctx, req.NamespacedName, managed); err != nil {
		// There's no need to requeue if we no longer exist. Otherwise we'll be
		// requeued implicitly because we return an error.
		log.Debug("Cannot get managed resource", "error", err)
		return nil, reconcile.Result{}, errors.Wrap(resource.IgnoreNotFound(err), errGetManaged)
	}

	record := r.record.WithAnnotations("external-name", meta.GetExternalName(managed))
	log = log.WithValues(
		"uid", managed.GetUID(),
		"version", managed.GetResourceVersion(),
		"external-name", meta.GetExternalName(managed),
	)

	managementPoliciesEnabled := r.features.Enabled(feature.EnableAlphaManagementPolicies)
	if managementPoliciesEnabled {
		log.WithValues("managementPolicies", managed.GetManagementPolicies())
	}

	// Create the management policy resolver which will assist us in determining
	// what actions to take on the managed resource based on the management
	// and deletion policies.
	policy := NewManagementPoliciesResolver(managementPoliciesEnabled, managed.GetManagementPolicies(), managed.GetDeletionPolicy(), WithSupportedManagementPolicies(r.supportedManagementPolicies))

	// Check if the resource has paused reconciliation based on the
	// annotation or the management policies.
	// Log, publish an event and update the SYNC status condition.
	if meta.IsPaused(managed) || policy.IsPaused() {
		log.Debug("Reconciliation is paused either through the `spec.managementPolicies` or the pause annotation", "annotation", meta.AnnotationKeyReconciliationPaused)
		return func() error {
			record.Event(managed, event.Normal(reasonReconciliationPaused, "Reconciliation is paused either through the `spec.managementPolicies` or the pause annotation",
				"annotation", meta.AnnotationKeyReconciliationPaused))
			managed.SetConditions(xpv1.ReconcilePaused())
			// if the pause annotation is removed or the management policies changed, we will have a chance to reconcile
			// again and resume and if status update fails, we will reconcile again to retry to update the status
			return errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
		}, reconcile.Result{}, nil
	}
...

I think it's even worst than having the defer + the explicit check for each error branch...

sttts · 2023-09-25T14:04:16Z

Related: #556

func Reconcile(...) (reconcile.Result, error) {
    ...

    return errors.SilentlyRequeueOnConflict(reconcile.Result{}, r.hcClient.Patch(ctx, obj, client.MergeFrom(orig)))
}

and

errors.WithSilentRequeueOnConflict(NewReconciler())

Doesn't replace the intermediate client call check with kerrors.IsConflicts, but especially the second variant makes it easy to wire this centrallly without too much noise.

sttts · 2023-09-25T14:06:13Z

@phisco actually your proposal with Reconcile and reconcile is not far from mine. Just realizing that. I only put the logic into a helper.

phisco · 2023-09-25T14:31:11Z

@sttts, I don't think the helper in #556 is actually doing the same thing, I don't think we want to emit the event, nor try updating the status with a condition in case of conflicts on the first level of errors. That's why I'm returning a function as a callback in the proposal above, which is only going to be called in case of a "good" error, while with the helper you mention you would still emit the event and only catch the error on the status update, so you would only replace the defer you originally introduced in #498. No?

sttts · 2023-09-25T15:18:35Z

Oh, you are right. Looked too briefly on the code before, missed the function bit. I agree, that's awkward.

My 2 cents are on 1. #556 for the non-error updates of a reconciler 2. add the if IsConflict blocks for the intermediate update as you have proposed here.

These intermediate updates are not so common in traditional reconcilers. Here, for the external client one the added complexity is fine IMO. It does some kind of checkpointing which make this necessary.

phisco · 2023-09-25T16:25:01Z

@sttts feel free to use the newly introduced helper in #556 and remove the defer here if that's how you envision using it.

sttts · 2023-09-26T10:23:06Z

feel free to use the newly introduced helper in #556 and remove the defer here if that's how you envision using it.

@phisco done

phisco · 2023-09-26T11:24:51Z

@negz, I'd say we can handle refactoring the defer in #556, so the changes in this PR should be good to merge, no?

sttts · 2023-09-26T13:45:22Z

@lsviben please also take a look here. That's the necessary counterpart for our custom error handling in the external client reconciler.

lsviben

Ok if I understand this correctly:
This PR will take care of not publishing events or setting conditions in case of conflicts. And by returning Reque: true, nil also skipping logging the error
#556 will take care of not logging those conflicts which come out of errors this PR could have missed, or anywhere an conflict could be returned? Plus adding the mechanism to apply this to other reconcilers.

I see how #556 could be seen as unneeded in most cases, but I like that we cover everything, so I would be for merging both although I dont feel strongly about it.

phisco · 2023-09-26T14:10:55Z

@lsviben #556 is just refactoring what we already have in the defer to use the new helper function. This PR is stopping the event emission/status updating to happen in case of conflict errors, while the defer is doing the same thing but given that we close most of the error branches with a status update setting the condition, it's just going to catch those, instead of actually handling the original error.

lsviben · 2023-09-26T14:23:14Z

@lsviben #556 is just refactoring what we already have in the defer to use the new helper function. This PR is stopping the event emission/status updating to happen in case of conflict errors, while the defer is doing the same thing but given that we close most of the error branches with a status update setting the condition, it's just going to catch those, instead of actually handling the original error.

Ah i was refering to having or removing the defer at all, but miscommunicated that :).

Gotcha, missed the status updates which the defer will catch, nice.

All good from me then 👍

sttts · 2024-01-27T14:48:50Z

pkg/reconciler/managed/reconciler.go

@@ -991,6 +1027,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (resu
 			// issue we'll be requeued implicitly when we update our status with
 			// the new error condition. If not, we requeue explicitly, which will trigger backoff.
 			log.Debug("Cannot create external resource", "error", err)
+			if kerrors.IsConflict(err) {
+				return reconcile.Result{Requeue: true}, nil
+			}


this is a return of death. The object won't have the critical failed/succeeded annotation yet. With the pending annotation only, the reconciler will refuse to proceed with this object.

phisco requested review from sttts and negz September 11, 2023 10:10

phisco requested review from a team as code owners September 11, 2023 10:10

phisco requested a review from nullable-eth September 11, 2023 10:10

phisco changed the title ~~Fix/conflict errors remake 2~~ reconciler/managed: only debug log transient conflict errors PT.2 Sep 11, 2023

reconciler: requeue without error on conflict

ce49d22

Signed-off-by: Philippe Scorsolini <[email protected]>

phisco force-pushed the fix/conflict-errors-remake-2 branch from 7a84718 to ce49d22 Compare September 11, 2023 15:26

lsviben approved these changes Sep 26, 2023

View reviewed changes

sttts approved these changes Sep 26, 2023

View reviewed changes

phisco merged commit c05e74a into crossplane:master Sep 26, 2023
8 of 9 checks passed

lsviben mentioned this pull request Oct 3, 2023

fix missing import #565

Merged

2 tasks

phisco mentioned this pull request Oct 9, 2023

fix: just requeue all conflict errors crossplane/crossplane#4758

Merged

5 tasks

sttts reviewed Jan 27, 2024

View reviewed changes

sttts mentioned this pull request Jan 27, 2024

reconciler/managed: make more resilient to error conditions #651

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reconciler/managed: only debug log transient conflict errors PT.2 #540

reconciler/managed: only debug log transient conflict errors PT.2 #540

phisco commented Sep 11, 2023 •

edited

Loading

negz commented Sep 18, 2023

phisco commented Sep 18, 2023

turkenh commented Sep 19, 2023

phisco commented Sep 20, 2023

phisco commented Sep 20, 2023 •

edited

Loading

sttts commented Sep 25, 2023

sttts commented Sep 25, 2023

phisco commented Sep 25, 2023

sttts commented Sep 25, 2023

phisco commented Sep 25, 2023

sttts commented Sep 26, 2023

phisco commented Sep 26, 2023

sttts commented Sep 26, 2023

lsviben left a comment

phisco commented Sep 26, 2023

lsviben commented Sep 26, 2023

sttts Jan 27, 2024

sttts Jan 27, 2024

reconciler/managed: only debug log transient conflict errors PT.2 #540

reconciler/managed: only debug log transient conflict errors PT.2 #540

Conversation

phisco commented Sep 11, 2023 • edited Loading

Description of your changes

How has this code been tested

negz commented Sep 18, 2023

phisco commented Sep 18, 2023

turkenh commented Sep 19, 2023

phisco commented Sep 20, 2023

phisco commented Sep 20, 2023 • edited Loading

sttts commented Sep 25, 2023

sttts commented Sep 25, 2023

phisco commented Sep 25, 2023

sttts commented Sep 25, 2023

phisco commented Sep 25, 2023

sttts commented Sep 26, 2023

phisco commented Sep 26, 2023

sttts commented Sep 26, 2023

lsviben left a comment

Choose a reason for hiding this comment

phisco commented Sep 26, 2023

lsviben commented Sep 26, 2023

sttts Jan 27, 2024

Choose a reason for hiding this comment

sttts Jan 27, 2024

Choose a reason for hiding this comment

phisco commented Sep 11, 2023 •

edited

Loading

phisco commented Sep 20, 2023 •

edited

Loading