Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler/managed: only debug log transient conflict errors PT.2 #540

Merged
merged 1 commit into from
Sep 26, 2023

Conversation

phisco
Copy link
Contributor

@phisco phisco commented Sep 11, 2023

Description of your changes

Follow-up PR to #498.

My understanding of the original PR is that the goal was to properly handle any conflict error by logging it in debug mode and re-queueing it.

To fully achieve what the original PR was trying to achieve we have to first understand that the whole reconciliation loop is built by repeatedly going through the following steps:

  1. do something, potentially updating the reconciled resource, which might result in a conflict error
  2. in case of any error from the previous step, log it in debug, set it as condition and then update the status of the resource, which might result in a conflict error too.

In order to actually address all "object has been modified", we need to properly handle the errors from both the steps above.

The original PR already addressed the second step, while to handle the first step's errors I've updated all error handling branches by checking whether the returned error is a conflict error after having logged it at debug level, but before proceeding with the event emitting and conditions updating phase.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

make reviewable

@phisco phisco requested review from sttts and negz September 11, 2023 10:10
@phisco phisco requested review from a team as code owners September 11, 2023 10:10
@phisco phisco changed the title Fix/conflict errors remake 2 reconciler/managed: only debug log transient conflict errors PT.2 Sep 11, 2023
@negz
Copy link
Member

negz commented Sep 18, 2023

It seems like now we have both a defer handling kerrors.IsConflict and explicit returns. Can we use only explicit returns? For a method as large as this one I would feel more comfortable not having the defer - it seems easy to miss that it exists.

@phisco
Copy link
Contributor Author

phisco commented Sep 18, 2023

@negz, no strong opinions there, the defer would be impossible to bypass by mistake, instead it would be easy to forget adding yet another if kerrors.IsConflict....

@turkenh
Copy link
Member

turkenh commented Sep 19, 2023

I don't have a strong opinion as well, but +1 for being consistent and leaning on explicit returns.

Also, I am wondering if it is time to consider refactoring the logic with a dedicated handleError method (or even an ErrorHandler struct) 🤔 Especially if we go with explicit returns, there will be a lot of duplicate code, which would be quite error-prone.

@phisco
Copy link
Contributor Author

phisco commented Sep 20, 2023

I already tried refactoring it, but didn't find a pleasant solution given the many different behaviours we have, but I was trying to be conservative, I'll give it another shot with a more aggressive refactoring 👍

@phisco
Copy link
Contributor Author

phisco commented Sep 20, 2023

The only thing I can think of would be refactoring it to have a callback function to be called only to handle the error after having "filtered" it, something like this:

// Reconcile a managed resource with an external resource.
func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) { //nolint:gocyclo // See note below.
	log := r.log.WithValues("request", req)
	errHandleFunc, result, err := r.reconcile(ctx, req)
	if err == nil {
		return result, nil
	}
	if kerrors.IsConflict(errors.Cause(err)) {
		// conflict errors are transient in Kubernetes and completely normal.
		// The right reaction is to requeue, but not record the object as error'ing
		// or creating events.
		log.Debug("Transient conflict error", "error", err)
		return reconcile.Result{Requeue: true}, nil
	}
	if errHandleFunc != nil {
		followupErr := errHandleFunc()
		if kerrors.IsConflict(errors.Cause(followupErr)) {
			// conflict errors are transient in Kubernetes and completely normal.
			// The right reaction is to requeue, but not record the object as error'ing
			// or creating events.
			log.Debug("Transient conflict error while handling error", "error", followupErr)
			return reconcile.Result{Requeue: true}, nil
		}
		if followupErr != nil {
			err = errors.Join(err, followupErr)
		}
	}
	return result, err
}

func (r *Reconciler) reconcile(ctx context.Context, req reconcile.Request) (func() error, reconcile.Result, error) {
	// NOTE(negz): This method is a well over our cyclomatic complexity goal.
	// Be wary of adding additional complexity.

	log := r.log.WithValues("request", req)

	log.Debug("Reconciling")

	ctx, cancel := context.WithTimeout(ctx, r.timeout+reconcileGracePeriod)
	defer cancel()

	// Govet linter has a check for lost cancel funcs but it's a false positive
	// for child contexts as because parent's cancel is called, so we skip it
	// for this line.
	externalCtx, _ := context.WithTimeout(ctx, r.timeout) //nolint:govet // See note above.

	managed := r.newManaged()
	if err := r.client.Get(ctx, req.NamespacedName, managed); err != nil {
		// There's no need to requeue if we no longer exist. Otherwise we'll be
		// requeued implicitly because we return an error.
		log.Debug("Cannot get managed resource", "error", err)
		return nil, reconcile.Result{}, errors.Wrap(resource.IgnoreNotFound(err), errGetManaged)
	}

	record := r.record.WithAnnotations("external-name", meta.GetExternalName(managed))
	log = log.WithValues(
		"uid", managed.GetUID(),
		"version", managed.GetResourceVersion(),
		"external-name", meta.GetExternalName(managed),
	)

	managementPoliciesEnabled := r.features.Enabled(feature.EnableAlphaManagementPolicies)
	if managementPoliciesEnabled {
		log.WithValues("managementPolicies", managed.GetManagementPolicies())
	}

	// Create the management policy resolver which will assist us in determining
	// what actions to take on the managed resource based on the management
	// and deletion policies.
	policy := NewManagementPoliciesResolver(managementPoliciesEnabled, managed.GetManagementPolicies(), managed.GetDeletionPolicy(), WithSupportedManagementPolicies(r.supportedManagementPolicies))

	// Check if the resource has paused reconciliation based on the
	// annotation or the management policies.
	// Log, publish an event and update the SYNC status condition.
	if meta.IsPaused(managed) || policy.IsPaused() {
		log.Debug("Reconciliation is paused either through the `spec.managementPolicies` or the pause annotation", "annotation", meta.AnnotationKeyReconciliationPaused)
		return func() error {
			record.Event(managed, event.Normal(reasonReconciliationPaused, "Reconciliation is paused either through the `spec.managementPolicies` or the pause annotation",
				"annotation", meta.AnnotationKeyReconciliationPaused))
			managed.SetConditions(xpv1.ReconcilePaused())
			// if the pause annotation is removed or the management policies changed, we will have a chance to reconcile
			// again and resume and if status update fails, we will reconcile again to retry to update the status
			return errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
		}, reconcile.Result{}, nil
	}
...

I think it's even worst than having the defer + the explicit check for each error branch...

@sttts
Copy link
Contributor

sttts commented Sep 25, 2023

Related: #556

func Reconcile(...) (reconcile.Result, error) {
    ...

    return errors.SilentlyRequeueOnConflict(reconcile.Result{}, r.hcClient.Patch(ctx, obj, client.MergeFrom(orig)))
}

and

errors.WithSilentRequeueOnConflict(NewReconciler())

Doesn't replace the intermediate client call check with kerrors.IsConflicts, but especially the second variant makes it easy to wire this centrallly without too much noise.

@sttts
Copy link
Contributor

sttts commented Sep 25, 2023

@phisco actually your proposal with Reconcile and reconcile is not far from mine. Just realizing that. I only put the logic into a helper.

@phisco
Copy link
Contributor Author

phisco commented Sep 25, 2023

@sttts, I don't think the helper in #556 is actually doing the same thing, I don't think we want to emit the event, nor try updating the status with a condition in case of conflicts on the first level of errors. That's why I'm returning a function as a callback in the proposal above, which is only going to be called in case of a "good" error, while with the helper you mention you would still emit the event and only catch the error on the status update, so you would only replace the defer you originally introduced in #498. No?

@sttts
Copy link
Contributor

sttts commented Sep 25, 2023

Oh, you are right. Looked too briefly on the code before, missed the function bit. I agree, that's awkward.

My 2 cents are on 1. #556 for the non-error updates of a reconciler 2. add the if IsConflict blocks for the intermediate update as you have proposed here.

These intermediate updates are not so common in traditional reconcilers. Here, for the external client one the added complexity is fine IMO. It does some kind of checkpointing which make this necessary.

@phisco
Copy link
Contributor Author

phisco commented Sep 25, 2023

@sttts feel free to use the newly introduced helper in #556 and remove the defer here if that's how you envision using it.

@sttts
Copy link
Contributor

sttts commented Sep 26, 2023

feel free to use the newly introduced helper in #556 and remove the defer here if that's how you envision using it.

@phisco done

@phisco
Copy link
Contributor Author

phisco commented Sep 26, 2023

@negz, I'd say we can handle refactoring the defer in #556, so the changes in this PR should be good to merge, no?

@sttts
Copy link
Contributor

sttts commented Sep 26, 2023

@lsviben please also take a look here. That's the necessary counterpart for our custom error handling in the external client reconciler.

Copy link
Contributor

@lsviben lsviben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok if I understand this correctly:
This PR will take care of not publishing events or setting conditions in case of conflicts. And by returning Reque: true, nil also skipping logging the error
#556 will take care of not logging those conflicts which come out of errors this PR could have missed, or anywhere an conflict could be returned? Plus adding the mechanism to apply this to other reconcilers.

I see how #556 could be seen as unneeded in most cases, but I like that we cover everything, so I would be for merging both although I dont feel strongly about it.

@phisco
Copy link
Contributor Author

phisco commented Sep 26, 2023

@lsviben #556 is just refactoring what we already have in the defer to use the new helper function. This PR is stopping the event emission/status updating to happen in case of conflict errors, while the defer is doing the same thing but given that we close most of the error branches with a status update setting the condition, it's just going to catch those, instead of actually handling the original error.

@lsviben
Copy link
Contributor

lsviben commented Sep 26, 2023

@lsviben #556 is just refactoring what we already have in the defer to use the new helper function. This PR is stopping the event emission/status updating to happen in case of conflict errors, while the defer is doing the same thing but given that we close most of the error branches with a status update setting the condition, it's just going to catch those, instead of actually handling the original error.

Ah i was refering to having or removing the defer at all, but miscommunicated that :).

Gotcha, missed the status updates which the defer will catch, nice.

All good from me then 👍

@phisco phisco merged commit c05e74a into crossplane:master Sep 26, 2023
8 of 9 checks passed
@lsviben lsviben mentioned this pull request Oct 3, 2023
2 tasks
@@ -991,6 +1027,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (resu
// issue we'll be requeued implicitly when we update our status with
// the new error condition. If not, we requeue explicitly, which will trigger backoff.
log.Debug("Cannot create external resource", "error", err)
if kerrors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a return of death. The object won't have the critical failed/succeeded annotation yet. With the pending annotation only, the reconciler will refuse to proceed with this object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants