Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic #617

akutz · 2019-09-30T14:52:47Z

After discussing this with @detiber, we realized there's no good solution for the following case:

A request is received
The request cannot proceed because of a missing dependency on resource A
The reconciler is watching events for resource A
The reconciler returns a result that indicates the reconciliation is incomplete due to reason X, but the request should not be requeued

A reconciler is using one or more watches to trigger requests
A request cannot proceed due to a missing dependency
If result.RequeueAfter > 0 then the request is added to the queue for processing after the value specified by result.RequeueAfter
If result.Requeue is true then the request is added to the queue with the same exponential backoff logic used when an error is returned
If an error is returned then the request is added to the queue with the exponential backoff logic

Today there is currently no way to indicate a reconciliation is incomplete without also having the request requeued by the manager either via an explicit amount of time or the exponential backoff logic (due to error or Requeue == true).

There should be a way to signal:

The reconciliation is incomplete
Do not requeue, the reconciler is watching the resources necessary to trigger its own events

Thanks!

The text was updated successfully, but these errors were encountered:

detiber · 2019-09-30T15:12:22Z

To add a bit of additional context it is currently very difficult to test controller-runtime based reconciliation loops with the current behavior if there are cases where only partial reconciliation is expected due to external dependencies that trigger reconciliation based on watches.

Allowing for a case where Requeue is explicitly set to false and an error is returned would allow for easier testing of these cases.

DirectXMan12 · 2019-09-30T21:53:00Z

To summarize: you want an error condition that says "don't requeue" (i.e. what we've referred to as ignorable errors in #377)?

DirectXMan12 · 2019-09-30T21:53:21Z

/kind feature

If so, can we move the discussion over there?

akutz · 2019-09-30T21:58:09Z

Hi @DirectXMan12,

Yes, but I also wonder now if this is the right way to do this. After speaking more about this with @detiber and @vincepri, I created this example that relies on asserting expected, eventual state of the objects.

DirectXMan12 · 2019-09-30T22:36:01Z

Yeah, we've thus far been recommending writing tests like that (using eventually and/or consistently), since it allows you to update your logic to occur over multiple reconciles, etc, w/o needing to update your tests.

P.S. Ginkgo tip: Expect(operation()).To(Succeed()) and Expect(err).NotTo(HaveOccurred()), but try not to Expect(operation()).NotTo(HaveOccurred()), since that's a bit confusing.

Also, Eventually(func() int { return operation() }).Should(BeNumerically(">", 10)) should yield better errors than just returning a bool, but that ones more of a matter of of taste.

detiber · 2019-10-01T13:19:03Z

Yes, the problem is very similar to #377.

The specific use case that is difficult to test for with the current model: We have some resources that we want to wait until they have an OwnerRef from a related resource prior to reconciling. Currently we have no way to test Reconcile in a way that tells us: 1) we haven't mutated the resource 2) We haven't completed a full reconciliation without requeueing.

Since the update to the ownerRef would trigger a new reconciliation, requeueing is pretty pointless here.

akutz · 2019-10-01T16:40:32Z

Hi @DirectXMan12,

Thank you again for your suggestions. I implemented most of them here!

enxebre · 2019-10-02T08:43:18Z

I came through this and #377 as well. @DirectXMan12 Some thoughts:

One approach:

type Result struct {
	// Requeue tells the Controller to requeue the reconcile key.  Defaults to false.
	Requeue bool
        // RequeueError tells the Controller to requeue the reconcile key when error is not nil.  If nil defaults to true.
	RequeueError *bool
	// RequeueAfter if greater than 0, tells the Controller to requeue the reconcile key after the Duration.
	// Implies that Requeue is true, there is no need to set Requeue to true at the same time as RequeueAfter.
	RequeueAfter time.Duration
}

if result, err := c.Do.Reconcile(req); err != nil {
	if requeueError(result.RequeueError) {
		c.Queue.AddRateLimited(req)
		log.Error(err, "Reconciler error", "controller", c.Name, "request", req)
		ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
		ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
		return false
	}
	c.Queue.Forget(obj)
	return false
}

func requeueError(requeueError *bool) bool {
	if reconcileError == nil {
		return true
	}
	return *requeueError
}

Another approach:

if result, err := c.Do.Reconcile(req); err != nil {
	if requeueError(err) {
		c.Queue.AddRateLimited(req)
		log.Error(err, "Reconciler error", "controller", c.Name, "request", req)
		ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
		ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
		return false
	}
	c.Queue.Forget(obj)
	return false
}

func requeueError(err error) bool {
	if apierrors.IsNotFound(err) {
		return false
	}
	if _, ok := err.(*client.IgnorableError); ok {
		return false
	}

	return true
}

// IgnorableError represents an error that should not be
// requeued for further processing.
type IgnorableError struct {}

// Error implements the error interface
func (e *IgnorableError) Error() string {
	return fmt.Sprintf("%s", e)
}

fejta-bot · 2019-12-31T09:00:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

detiber · 2019-12-31T15:50:43Z

/remove-lifecycle stale

DirectXMan12 · 2020-01-08T00:25:55Z

I'm tempted to lean towards tying this into ignorable errors, because it seems to be another side of the same die, if you will. There's a few of options there:

make ignorable a property you can implement on errors
make ignorable a wrapper that we test with errors.Is or errors.As (not mutually exclusive with option 1)
make ignorability configurable via some sort of "error filter" that gets to mess with requeue logic. We default to requeuing in most cases except some sane set (like NotFound).

seh · 2020-01-08T01:23:04Z

I vote for option 2. I wonder, though, if the error is ignorable, why return an error at all? Did something actually fail, or did the reconciler just decide not to do its job it does when all the dependencies are available? I see above that @detiber mentions needing to return an error.

DirectXMan12 · 2020-01-14T21:00:16Z

In the case of "ignorable errors", it's about translating "not found" to "decided not to do job until all dependencies are available" automatically.

In this case, it seems like it's about attaching additional information about why we're requeuing -- the error says "we didn't actually do what you asked", which is useful for diagnostic and testing purposes (as opposed to "we did what you asked, and we're trying again for whatever reason"). Practical outcome wise, they're pretty much the same, but it's nice to be able to see why things occurred.

detiber · 2020-01-21T20:17:00Z

I'm not sure I like the terminology "ignorable error", but the behavior would indeed fit our use case.

The main thing we'd like to accomplish is a way to say that reconciliation wasn't completed (generally due to waiting on some dependency), but not to force a requeue since we would already have a watch registered for the resource(s) involved.

I think both option 1 and 2 would fit our needs well, but maybe some other term instead of 'ignorable'? In our use case we aren't as much ignoring the error as much as trying to avoid unnecessary reconciliations of the resource since we'll get an update from the watch when we should re-reconcile.

DirectXMan12 · 2020-01-22T18:37:58Z

but maybe some other term instead of 'ignorable'

Sure, of course :-) If you have ideas, lmk

detiber · 2020-01-22T18:44:15Z

but maybe some other term instead of 'ignorable'

Sure, of course :-) If you have ideas, lmk

If discussing a property of an Error, maybe something to the effect of 'do not requeue'. I'm not a huge fan of the negative, but it would help for the purposes of defaulting.

For a test wrapper maybe ShouldRequeue?

DirectXMan12 · 2020-01-22T19:46:23Z

sure, seems reasonable

vincepri · 2020-02-20T18:49:20Z

/kind design
/help

k8s-ci-robot · 2020-02-20T18:49:21Z

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind design
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2020-02-20T18:49:29Z

/priority important-longterm

mhrivnak · 2020-02-20T20:45:07Z

Returning an error from a Reconciler seems to have limited value. It's hard to know what to do in response that's useful. I think we should consider changing the Reconciler interface so that it does not return an error at all.

When a Reconciler returns an error, three things happen (I'm ignoring metrics for the moment):

An error message gets logged
The request is requeued with backoff
reconcileHandler returns false, whose meaning is not documented, but causes the worker function to return and get called again after a jitter period.

(2) is hard to get right, as identified by this issue. It's hard to know which errors deserve a retry. Isn't it better to just let the Reconciler make its own decision, and set Requeue: true on the response when it determines that's appropriate? It's easy for a Reconciler author to make their own helper function that modifies or produces a response based on an error. That may be just as easy as giving them some way to pass a filter function or similar to controller-runtime that does the same.

(3) I'm not sure what the value is in this. What's it accomplishing?

(1) also seems limited in value. Should all errors be logged the same way? Do all errors deserve to be logged at all? Generally no. We could instead expect the Reconciler to do its own error handling and log a useful message if/when/how appropriate. An optional helper function to generically log errors from within the Reconciler would be just as useful as the log behavior today.

Back to metrics, as it is today with a generic error count, it's not clear what is being measured. I don't think a broad error count, where an error could mean many different things, is actionable or particularly useful. A Reconciler implementation can capture its own metrics that are more meaningful.

Lastly to testing. As already observed, it's hard to define what it means for a Reconciler run to be "incomplete". Many Reconcilers are designed to make incremental progress and re-run many times while converging to desired state. Perhaps what is most useful to communicate in terms of "completeness" is whether the Reconcile logic is blocked from progressing toward desired state. I'm not sure if there's a good generic way to capture that, or if that should be part of the Reconciler interface at all. Maybe that's best handled and tested as an implementation detail behind the Reconciler interface.

In many cases when progress toward desired state is blocked, it's useful to communicate that on the object's Status. An example is when a required Secret is missing or has invalid credentials, because it's a problem that the API user can fix. In these cases, the Status is a natural place for a test to determine success or failure.

In sum, when a Reconciler returns an error, it's hard to know what to do with it. Rather than have an interface that lets a Reconciler pass us an error and something that helps us understand how to handle it, we could just let the Reconciler handle it.

detiber · 2020-02-20T21:25:28Z

I like @mhrivnak's proposal, however if we go down that path, it would be nice if there was a way in the result to distinguish between:

requeue immediately
requeue after a set period of time
requeue with backoff (preferably where the backoff can be customized when instantiating the reconciler)

I'd rather not have to attempt to bolt on some type of backoff mechanism on top of the current result struct.

mhrivnak · 2020-02-20T21:35:06Z

I like @mhrivnak's proposal, however if we go down that path, it would be nice if there was a way in the result to distinguish between:

requeue immediately

requeue after a set period of time

requeue with backoff (preferably where the backoff can be customized when instantiating the reconciler)

Agreed. It seems that right now, setting Requeue: true always uses a backoff (am I reading that correctly?), but I suspect that's often not what people want or expect. Perhaps most controllers don't requeue so many times in a row that the backoff gets noticed.

alexeldeib · 2020-02-21T16:53:16Z

I like this idea! As a small side note, I don't think you get backoff while using RequeueAfter even if Requeue is true (in the non-error case).

DirectXMan12 · 2020-03-13T23:40:07Z

It doesn't seem like it would help with the problem where descriptions are simply too long

That's correct (just double-checked the code) -- Requeue and RequeueAfter are mutually exclusive (/me grumbles about Go's lack of tagged unions):

controller-runtime/pkg/internal/controller/controller.go

Lines 262 to 275 in e00985b

    
           } else if result.RequeueAfter > 0 { 
        
           	// The result.RequeueAfter request will be lost, if it is returned 
        
           	// along with a non-nil error. But this is intended as 
        
           	// We need to drive to stable reconcile loops before queuing due 
        
           	// to result.RequestAfter 
        
           	c.Queue.Forget(obj) 
        
           	c.Queue.AddAfter(req, result.RequeueAfter) 
        
           	ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "requeue_after").Inc() 
        
           	return true 
        
           } else if result.Requeue { 
        
           	c.Queue.AddRateLimited(req) 
        
           	ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "requeue").Inc() 
        
           	return true 
        
           }

fejta-bot · 2020-06-11T23:53:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

DirectXMan12 · 2020-06-12T21:29:58Z

/remove-lifecycle stale

fejta-bot · 2020-09-10T22:19:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

detiber · 2020-10-05T15:32:07Z

/lifecycle frozen

In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backoff. To avoid the exponential backoff, we need to return the request with the requeueAfter value set. See: kubernetes-sigs/controller-runtime#617 Signed-off-by: Raghavendra Talur <[email protected]>

- Reduce log noise by logging errors instead of successes - Use context logger provided by controller-runtime - Patch status instead of update to avoid "the object has been modified; please apply your changes to the latest version and try again" - Add finalizer even if object is already under deletion, in case we never got a chance yet - Don't set RequeueAfter on errors since it is ignored anyway [0] [0]: kubernetes-sigs/controller-runtime#617 Change-Id: Ic06aa74f9e1465d3f7e32137559231e940c8a74d

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 30, 2019

detiber mentioned this issue Dec 3, 2019

🏃 add tests for gcp_machine_controller kubernetes-sigs/cluster-api-provider-gcp#260

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019

enxebre mentioned this issue Feb 5, 2020

REQUEST: New membership for enxebre kubernetes/org#1614

Closed

6 tasks

vincepri mentioned this issue Feb 20, 2020

Ignorable Errors #377

Closed

k8s-ci-robot added kind/design Categorizes issue or PR as related to design. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 20, 2020

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 20, 2020

vincepri added this to the Next milestone Feb 20, 2020

vincepri added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Feb 20, 2020

vincepri mentioned this issue Feb 20, 2020

Requeue Overhaul Tracking #417

Closed

3 tasks

djzager mentioned this issue Feb 20, 2020

Document ways to trigger/not trigger exponential backoff #808

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2020

mhrivnak mentioned this issue Apr 28, 2021

MGMT-4261 Agent CR and ACI cleanup when deleting CD openshift/assisted-service#1606

Merged

TomNeyland mentioned this issue Mar 7, 2023

Consider adding a terminal error type that doesn't get retried #2215

Closed

tommyknows mentioned this issue Mar 10, 2023

[TIKI-3] feat: implement store backend to send data snyk/kubernetes-scanner#4

Merged

pmalek mentioned this issue Sep 23, 2024

feat(konnect): KongTarget reconciler Kong/gateway-operator#627

Merged

1 task

ShyamsundarR mentioned this issue Dec 3, 2024

Silence "deletion in progress" errors when deleting VRGs RamenDR/ramen#1689

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic #617

Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic #617

akutz commented Sep 30, 2019 •

edited

Loading

detiber commented Sep 30, 2019

DirectXMan12 commented Sep 30, 2019

DirectXMan12 commented Sep 30, 2019

akutz commented Sep 30, 2019

DirectXMan12 commented Sep 30, 2019 •

edited

Loading

detiber commented Oct 1, 2019 •

edited

Loading

akutz commented Oct 1, 2019

enxebre commented Oct 2, 2019 •

edited

Loading

fejta-bot commented Dec 31, 2019

detiber commented Dec 31, 2019

DirectXMan12 commented Jan 8, 2020

seh commented Jan 8, 2020

DirectXMan12 commented Jan 14, 2020

detiber commented Jan 21, 2020

DirectXMan12 commented Jan 22, 2020

detiber commented Jan 22, 2020

DirectXMan12 commented Jan 22, 2020

vincepri commented Feb 20, 2020

k8s-ci-robot commented Feb 20, 2020

vincepri commented Feb 20, 2020

mhrivnak commented Feb 20, 2020

detiber commented Feb 20, 2020

mhrivnak commented Feb 20, 2020

alexeldeib commented Feb 21, 2020

DirectXMan12 commented Mar 13, 2020

fejta-bot commented Jun 11, 2020

DirectXMan12 commented Jun 12, 2020

fejta-bot commented Sep 10, 2020

detiber commented Oct 5, 2020

Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic #617

Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic #617

Comments

akutz commented Sep 30, 2019 • edited Loading

detiber commented Sep 30, 2019

DirectXMan12 commented Sep 30, 2019

DirectXMan12 commented Sep 30, 2019

akutz commented Sep 30, 2019

DirectXMan12 commented Sep 30, 2019 • edited Loading

detiber commented Oct 1, 2019 • edited Loading

akutz commented Oct 1, 2019

enxebre commented Oct 2, 2019 • edited Loading

fejta-bot commented Dec 31, 2019

detiber commented Dec 31, 2019

DirectXMan12 commented Jan 8, 2020

seh commented Jan 8, 2020

DirectXMan12 commented Jan 14, 2020

detiber commented Jan 21, 2020

DirectXMan12 commented Jan 22, 2020

detiber commented Jan 22, 2020

DirectXMan12 commented Jan 22, 2020

vincepri commented Feb 20, 2020

k8s-ci-robot commented Feb 20, 2020

vincepri commented Feb 20, 2020

mhrivnak commented Feb 20, 2020

detiber commented Feb 20, 2020

mhrivnak commented Feb 20, 2020

alexeldeib commented Feb 21, 2020

DirectXMan12 commented Mar 13, 2020

fejta-bot commented Jun 11, 2020

DirectXMan12 commented Jun 12, 2020

fejta-bot commented Sep 10, 2020

detiber commented Oct 5, 2020

akutz commented Sep 30, 2019 •

edited

Loading

DirectXMan12 commented Sep 30, 2019 •

edited

Loading

detiber commented Oct 1, 2019 •

edited

Loading

enxebre commented Oct 2, 2019 •

edited

Loading