-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reconciler/managed: only debug log transient conflict errors PT.2 #540
Conversation
Signed-off-by: Philippe Scorsolini <[email protected]>
7a84718
to
ce49d22
Compare
It seems like now we have both a defer handling |
@negz, no strong opinions there, the defer would be impossible to bypass by mistake, instead it would be easy to forget adding yet another |
I don't have a strong opinion as well, but +1 for being consistent and leaning on explicit returns. Also, I am wondering if it is time to consider refactoring the logic with a dedicated handleError method (or even an ErrorHandler struct) 🤔 Especially if we go with explicit returns, there will be a lot of duplicate code, which would be quite error-prone. |
I already tried refactoring it, but didn't find a pleasant solution given the many different behaviours we have, but I was trying to be conservative, I'll give it another shot with a more aggressive refactoring 👍 |
The only thing I can think of would be refactoring it to have a callback function to be called only to handle the error after having "filtered" it, something like this: // Reconcile a managed resource with an external resource.
func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) { //nolint:gocyclo // See note below.
log := r.log.WithValues("request", req)
errHandleFunc, result, err := r.reconcile(ctx, req)
if err == nil {
return result, nil
}
if kerrors.IsConflict(errors.Cause(err)) {
// conflict errors are transient in Kubernetes and completely normal.
// The right reaction is to requeue, but not record the object as error'ing
// or creating events.
log.Debug("Transient conflict error", "error", err)
return reconcile.Result{Requeue: true}, nil
}
if errHandleFunc != nil {
followupErr := errHandleFunc()
if kerrors.IsConflict(errors.Cause(followupErr)) {
// conflict errors are transient in Kubernetes and completely normal.
// The right reaction is to requeue, but not record the object as error'ing
// or creating events.
log.Debug("Transient conflict error while handling error", "error", followupErr)
return reconcile.Result{Requeue: true}, nil
}
if followupErr != nil {
err = errors.Join(err, followupErr)
}
}
return result, err
}
func (r *Reconciler) reconcile(ctx context.Context, req reconcile.Request) (func() error, reconcile.Result, error) {
// NOTE(negz): This method is a well over our cyclomatic complexity goal.
// Be wary of adding additional complexity.
log := r.log.WithValues("request", req)
log.Debug("Reconciling")
ctx, cancel := context.WithTimeout(ctx, r.timeout+reconcileGracePeriod)
defer cancel()
// Govet linter has a check for lost cancel funcs but it's a false positive
// for child contexts as because parent's cancel is called, so we skip it
// for this line.
externalCtx, _ := context.WithTimeout(ctx, r.timeout) //nolint:govet // See note above.
managed := r.newManaged()
if err := r.client.Get(ctx, req.NamespacedName, managed); err != nil {
// There's no need to requeue if we no longer exist. Otherwise we'll be
// requeued implicitly because we return an error.
log.Debug("Cannot get managed resource", "error", err)
return nil, reconcile.Result{}, errors.Wrap(resource.IgnoreNotFound(err), errGetManaged)
}
record := r.record.WithAnnotations("external-name", meta.GetExternalName(managed))
log = log.WithValues(
"uid", managed.GetUID(),
"version", managed.GetResourceVersion(),
"external-name", meta.GetExternalName(managed),
)
managementPoliciesEnabled := r.features.Enabled(feature.EnableAlphaManagementPolicies)
if managementPoliciesEnabled {
log.WithValues("managementPolicies", managed.GetManagementPolicies())
}
// Create the management policy resolver which will assist us in determining
// what actions to take on the managed resource based on the management
// and deletion policies.
policy := NewManagementPoliciesResolver(managementPoliciesEnabled, managed.GetManagementPolicies(), managed.GetDeletionPolicy(), WithSupportedManagementPolicies(r.supportedManagementPolicies))
// Check if the resource has paused reconciliation based on the
// annotation or the management policies.
// Log, publish an event and update the SYNC status condition.
if meta.IsPaused(managed) || policy.IsPaused() {
log.Debug("Reconciliation is paused either through the `spec.managementPolicies` or the pause annotation", "annotation", meta.AnnotationKeyReconciliationPaused)
return func() error {
record.Event(managed, event.Normal(reasonReconciliationPaused, "Reconciliation is paused either through the `spec.managementPolicies` or the pause annotation",
"annotation", meta.AnnotationKeyReconciliationPaused))
managed.SetConditions(xpv1.ReconcilePaused())
// if the pause annotation is removed or the management policies changed, we will have a chance to reconcile
// again and resume and if status update fails, we will reconcile again to retry to update the status
return errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}, reconcile.Result{}, nil
}
... I think it's even worst than having the defer + the explicit check for each error branch... |
Related: #556 func Reconcile(...) (reconcile.Result, error) {
...
return errors.SilentlyRequeueOnConflict(reconcile.Result{}, r.hcClient.Patch(ctx, obj, client.MergeFrom(orig)))
} and errors.WithSilentRequeueOnConflict(NewReconciler()) Doesn't replace the intermediate client call check with |
@phisco actually your proposal with |
@sttts, I don't think the helper in #556 is actually doing the same thing, I don't think we want to emit the event, nor try updating the status with a condition in case of conflicts on the first level of errors. That's why I'm returning a function as a callback in the proposal above, which is only going to be called in case of a "good" error, while with the helper you mention you would still emit the event and only catch the error on the status update, so you would only replace the defer you originally introduced in #498. No? |
Oh, you are right. Looked too briefly on the code before, missed the function bit. I agree, that's awkward. My 2 cents are on 1. #556 for the non-error updates of a reconciler 2. add the These intermediate updates are not so common in traditional reconcilers. Here, for the external client one the added complexity is fine IMO. It does some kind of checkpointing which make this necessary. |
@lsviben please also take a look here. That's the necessary counterpart for our custom error handling in the external client reconciler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok if I understand this correctly:
This PR will take care of not publishing events or setting conditions in case of conflicts. And by returning Reque: true, nil
also skipping logging the error
#556 will take care of not logging those conflicts which come out of errors this PR could have missed, or anywhere an conflict could be returned? Plus adding the mechanism to apply this to other reconcilers.
I see how #556 could be seen as unneeded in most cases, but I like that we cover everything, so I would be for merging both although I dont feel strongly about it.
@lsviben #556 is just refactoring what we already have in the defer to use the new helper function. This PR is stopping the event emission/status updating to happen in case of conflict errors, while the defer is doing the same thing but given that we close most of the error branches with a status update setting the condition, it's just going to catch those, instead of actually handling the original error. |
Ah i was refering to having or removing the defer at all, but miscommunicated that :). Gotcha, missed the status updates which the defer will catch, nice. All good from me then 👍 |
@@ -991,6 +1027,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (resu | |||
// issue we'll be requeued implicitly when we update our status with | |||
// the new error condition. If not, we requeue explicitly, which will trigger backoff. | |||
log.Debug("Cannot create external resource", "error", err) | |||
if kerrors.IsConflict(err) { | |||
return reconcile.Result{Requeue: true}, nil | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a return of death. The object won't have the critical failed/succeeded annotation yet. With the pending annotation only, the reconciler will refuse to proceed with this object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of your changes
Follow-up PR to #498.
My understanding of the original PR is that the goal was to properly handle any conflict error by logging it in debug mode and re-queueing it.
To fully achieve what the original PR was trying to achieve we have to first understand that the whole reconciliation loop is built by repeatedly going through the following steps:
In order to actually address all "object has been modified", we need to properly handle the errors from both the steps above.
The original PR already addressed the second step, while to handle the first step's errors I've updated all error handling branches by checking whether the returned error is a conflict error after having logged it at debug level, but before proceeding with the event emitting and conditions updating phase.
I have:
make reviewable test
to ensure this PR is ready for review.How has this code been tested
make reviewable