Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic #617

Open
akutz opened this issue Sep 30, 2019 · 29 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@akutz
Copy link
Contributor

akutz commented Sep 30, 2019

After discussing this with @detiber, we realized there's no good solution for the following case:

  • A request is received
  • The request cannot proceed because of a missing dependency on resource A
  • The reconciler is watching events for resource A
  • The reconciler returns a result that indicates the reconciliation is incomplete due to reason X, but the request should not be requeued

Instead the current logic is:

  • A reconciler is using one or more watches to trigger requests
  • A request cannot proceed due to a missing dependency
  • If result.RequeueAfter > 0 then the request is added to the queue for processing after the value specified by result.RequeueAfter
  • If result.Requeue is true then the request is added to the queue with the same exponential backoff logic used when an error is returned
  • If an error is returned then the request is added to the queue with the exponential backoff logic

Today there is currently no way to indicate a reconciliation is incomplete without also having the request requeued by the manager either via an explicit amount of time or the exponential backoff logic (due to error or Requeue == true).

There should be a way to signal:

  • The reconciliation is incomplete
  • Do not requeue, the reconciler is watching the resources necessary to trigger its own events

Thanks!

@detiber
Copy link
Member

detiber commented Sep 30, 2019

To add a bit of additional context it is currently very difficult to test controller-runtime based reconciliation loops with the current behavior if there are cases where only partial reconciliation is expected due to external dependencies that trigger reconciliation based on watches.

Allowing for a case where Requeue is explicitly set to false and an error is returned would allow for easier testing of these cases.

@DirectXMan12
Copy link
Contributor

To summarize: you want an error condition that says "don't requeue" (i.e. what we've referred to as ignorable errors in #377)?

@DirectXMan12
Copy link
Contributor

/kind feature

If so, can we move the discussion over there?

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 30, 2019
@akutz
Copy link
Contributor Author

akutz commented Sep 30, 2019

Hi @DirectXMan12,

Yes, but I also wonder now if this is the right way to do this. After speaking more about this with @detiber and @vincepri, I created this example that relies on asserting expected, eventual state of the objects.

@DirectXMan12
Copy link
Contributor

DirectXMan12 commented Sep 30, 2019

Yeah, we've thus far been recommending writing tests like that (using eventually and/or consistently), since it allows you to update your logic to occur over multiple reconciles, etc, w/o needing to update your tests.

P.S. Ginkgo tip: Expect(operation()).To(Succeed()) and Expect(err).NotTo(HaveOccurred()), but try not to Expect(operation()).NotTo(HaveOccurred()), since that's a bit confusing.

Also, Eventually(func() int { return operation() }).Should(BeNumerically(">", 10)) should yield better errors than just returning a bool, but that ones more of a matter of of taste.

@detiber
Copy link
Member

detiber commented Oct 1, 2019

Yes, the problem is very similar to #377.

The specific use case that is difficult to test for with the current model: We have some resources that we want to wait until they have an OwnerRef from a related resource prior to reconciling. Currently we have no way to test Reconcile in a way that tells us: 1) we haven't mutated the resource 2) We haven't completed a full reconciliation without requeueing.

Since the update to the ownerRef would trigger a new reconciliation, requeueing is pretty pointless here.

@akutz
Copy link
Contributor Author

akutz commented Oct 1, 2019

Hi @DirectXMan12,

Thank you again for your suggestions. I implemented most of them here!

@enxebre
Copy link
Member

enxebre commented Oct 2, 2019

I came through this and #377 as well. @DirectXMan12 Some thoughts:

One approach:

type Result struct {
	// Requeue tells the Controller to requeue the reconcile key.  Defaults to false.
	Requeue bool
        // RequeueError tells the Controller to requeue the reconcile key when error is not nil.  If nil defaults to true.
	RequeueError *bool
	// RequeueAfter if greater than 0, tells the Controller to requeue the reconcile key after the Duration.
	// Implies that Requeue is true, there is no need to set Requeue to true at the same time as RequeueAfter.
	RequeueAfter time.Duration
}

if result, err := c.Do.Reconcile(req); err != nil {
	if requeueError(result.RequeueError) {
		c.Queue.AddRateLimited(req)
		log.Error(err, "Reconciler error", "controller", c.Name, "request", req)
		ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
		ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
		return false
	}
	c.Queue.Forget(obj)
	return false
}

func requeueError(requeueError *bool) bool {
	if reconcileError == nil {
		return true
	}
	return *requeueError
}

Another approach:

if result, err := c.Do.Reconcile(req); err != nil {
	if requeueError(err) {
		c.Queue.AddRateLimited(req)
		log.Error(err, "Reconciler error", "controller", c.Name, "request", req)
		ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
		ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
		return false
	}
	c.Queue.Forget(obj)
	return false
}

func requeueError(err error) bool {
	if apierrors.IsNotFound(err) {
		return false
	}
	if _, ok := err.(*client.IgnorableError); ok {
		return false
	}

	return true
}

// IgnorableError represents an error that should not be
// requeued for further processing.
type IgnorableError struct {}

// Error implements the error interface
func (e *IgnorableError) Error() string {
	return fmt.Sprintf("%s", e)
}

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019
@detiber
Copy link
Member

detiber commented Dec 31, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019
@DirectXMan12
Copy link
Contributor

I'm tempted to lean towards tying this into ignorable errors, because it seems to be another side of the same die, if you will. There's a few of options there:

  1. make ignorable a property you can implement on errors
  2. make ignorable a wrapper that we test with errors.Is or errors.As (not mutually exclusive with option 1)
  3. make ignorability configurable via some sort of "error filter" that gets to mess with requeue logic. We default to requeuing in most cases except some sane set (like NotFound).

@seh
Copy link

seh commented Jan 8, 2020

I vote for option 2. I wonder, though, if the error is ignorable, why return an error at all? Did something actually fail, or did the reconciler just decide not to do its job it does when all the dependencies are available? I see above that @detiber mentions needing to return an error.

@DirectXMan12
Copy link
Contributor

In the case of "ignorable errors", it's about translating "not found" to "decided not to do job until all dependencies are available" automatically.

In this case, it seems like it's about attaching additional information about why we're requeuing -- the error says "we didn't actually do what you asked", which is useful for diagnostic and testing purposes (as opposed to "we did what you asked, and we're trying again for whatever reason"). Practical outcome wise, they're pretty much the same, but it's nice to be able to see why things occurred.

@detiber
Copy link
Member

detiber commented Jan 21, 2020

I'm not sure I like the terminology "ignorable error", but the behavior would indeed fit our use case.

The main thing we'd like to accomplish is a way to say that reconciliation wasn't completed (generally due to waiting on some dependency), but not to force a requeue since we would already have a watch registered for the resource(s) involved.

I think both option 1 and 2 would fit our needs well, but maybe some other term instead of 'ignorable'? In our use case we aren't as much ignoring the error as much as trying to avoid unnecessary reconciliations of the resource since we'll get an update from the watch when we should re-reconcile.

@DirectXMan12
Copy link
Contributor

but maybe some other term instead of 'ignorable'

Sure, of course :-) If you have ideas, lmk

@detiber
Copy link
Member

detiber commented Jan 22, 2020

but maybe some other term instead of 'ignorable'

Sure, of course :-) If you have ideas, lmk

If discussing a property of an Error, maybe something to the effect of 'do not requeue'. I'm not a huge fan of the negative, but it would help for the purposes of defaulting.

For a test wrapper maybe ShouldRequeue?

@DirectXMan12
Copy link
Contributor

sure, seems reasonable

@vincepri
Copy link
Member

/kind design
/help

@k8s-ci-robot
Copy link
Contributor

@vincepri:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/kind design
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/design Categorizes issue or PR as related to design. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 20, 2020
@vincepri
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 20, 2020
@vincepri vincepri added this to the Next milestone Feb 20, 2020
@vincepri vincepri added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Feb 20, 2020
@vincepri vincepri mentioned this issue Feb 20, 2020
3 tasks
@mhrivnak
Copy link
Contributor

Returning an error from a Reconciler seems to have limited value. It's hard to know what to do in response that's useful. I think we should consider changing the Reconciler interface so that it does not return an error at all.

When a Reconciler returns an error, three things happen (I'm ignoring metrics for the moment):

  1. An error message gets logged
  2. The request is requeued with backoff
  3. reconcileHandler returns false, whose meaning is not documented, but causes the worker function to return and get called again after a jitter period.

(2) is hard to get right, as identified by this issue. It's hard to know which errors deserve a retry. Isn't it better to just let the Reconciler make its own decision, and set Requeue: true on the response when it determines that's appropriate? It's easy for a Reconciler author to make their own helper function that modifies or produces a response based on an error. That may be just as easy as giving them some way to pass a filter function or similar to controller-runtime that does the same.

(3) I'm not sure what the value is in this. What's it accomplishing?

(1) also seems limited in value. Should all errors be logged the same way? Do all errors deserve to be logged at all? Generally no. We could instead expect the Reconciler to do its own error handling and log a useful message if/when/how appropriate. An optional helper function to generically log errors from within the Reconciler would be just as useful as the log behavior today.

Back to metrics, as it is today with a generic error count, it's not clear what is being measured. I don't think a broad error count, where an error could mean many different things, is actionable or particularly useful. A Reconciler implementation can capture its own metrics that are more meaningful.

Lastly to testing. As already observed, it's hard to define what it means for a Reconciler run to be "incomplete". Many Reconcilers are designed to make incremental progress and re-run many times while converging to desired state. Perhaps what is most useful to communicate in terms of "completeness" is whether the Reconcile logic is blocked from progressing toward desired state. I'm not sure if there's a good generic way to capture that, or if that should be part of the Reconciler interface at all. Maybe that's best handled and tested as an implementation detail behind the Reconciler interface.

In many cases when progress toward desired state is blocked, it's useful to communicate that on the object's Status. An example is when a required Secret is missing or has invalid credentials, because it's a problem that the API user can fix. In these cases, the Status is a natural place for a test to determine success or failure.

In sum, when a Reconciler returns an error, it's hard to know what to do with it. Rather than have an interface that lets a Reconciler pass us an error and something that helps us understand how to handle it, we could just let the Reconciler handle it.

@detiber
Copy link
Member

detiber commented Feb 20, 2020

I like @mhrivnak's proposal, however if we go down that path, it would be nice if there was a way in the result to distinguish between:

  • requeue immediately
  • requeue after a set period of time
  • requeue with backoff (preferably where the backoff can be customized when instantiating the reconciler)

I'd rather not have to attempt to bolt on some type of backoff mechanism on top of the current result struct.

@mhrivnak
Copy link
Contributor

I like @mhrivnak's proposal, however if we go down that path, it would be nice if there was a way in the result to distinguish between:

  • requeue immediately
  • requeue after a set period of time
  • requeue with backoff (preferably where the backoff can be customized when instantiating the reconciler)

Agreed. It seems that right now, setting Requeue: true always uses a backoff (am I reading that correctly?), but I suspect that's often not what people want or expect. Perhaps most controllers don't requeue so many times in a row that the backoff gets noticed.

@alexeldeib
Copy link
Contributor

I like this idea! As a small side note, I don't think you get backoff while using RequeueAfter even if Requeue is true (in the non-error case).

@DirectXMan12
Copy link
Contributor

It doesn't seem like it would help with the problem where descriptions are simply too long

That's correct (just double-checked the code) -- Requeue and RequeueAfter are mutually exclusive (/me grumbles about Go's lack of tagged unions):

} else if result.RequeueAfter > 0 {
// The result.RequeueAfter request will be lost, if it is returned
// along with a non-nil error. But this is intended as
// We need to drive to stable reconcile loops before queuing due
// to result.RequestAfter
c.Queue.Forget(obj)
c.Queue.AddAfter(req, result.RequeueAfter)
ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "requeue_after").Inc()
return true
} else if result.Requeue {
c.Queue.AddRateLimited(req)
ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "requeue").Inc()
return true
}

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2020
@DirectXMan12
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2020
@detiber
Copy link
Member

detiber commented Oct 5, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2020
raghavendra-talur added a commit to raghavendra-talur/ocs-operator that referenced this issue Oct 19, 2020
In the reconciler, we considered a pending uninstall operation as an
error. It resulted in slower reconciliation because of exponential
backoff. To avoid the exponential backoff, we need to return the request
with the requeueAfter value set.

See: kubernetes-sigs/controller-runtime#617

Signed-off-by: Raghavendra Talur <[email protected]>
raghavendra-talur added a commit to raghavendra-talur/ocs-operator that referenced this issue Oct 19, 2020
In the reconciler, we considered a pending uninstall operation as an
error. It resulted in slower reconciliation because of exponential
backoff. To avoid the exponential backoff, we need to return the request
with the requeueAfter value set.

See: kubernetes-sigs/controller-runtime#617

Signed-off-by: Raghavendra Talur <[email protected]>
raghavendra-talur added a commit to raghavendra-talur/ocs-operator that referenced this issue Oct 19, 2020
In the reconciler, we considered a pending uninstall operation as an
error. It resulted in slower reconciliation because of exponential
backoff. To avoid the exponential backoff, we need to return the request
with the requeueAfter value set.

See: kubernetes-sigs/controller-runtime#617

Signed-off-by: Raghavendra Talur <[email protected]>
raghavendra-talur added a commit to raghavendra-talur/ocs-operator that referenced this issue Oct 19, 2020
In the reconciler, we considered a pending uninstall operation as an
error. It resulted in slower reconciliation because of exponential
backoff. To avoid the exponential backoff, we need to return the request
with the requeueAfter value set.

See: kubernetes-sigs/controller-runtime#617

Signed-off-by: Raghavendra Talur <[email protected]>
raghavendra-talur added a commit to raghavendra-talur/ocs-operator that referenced this issue Oct 20, 2020
In the reconciler, we considered a pending uninstall operation as an
error. It resulted in slower reconciliation because of exponential
backoff. To avoid the exponential backoff, we need to return the request
with the requeueAfter value set.

See: kubernetes-sigs/controller-runtime#617

Signed-off-by: Raghavendra Talur <[email protected]>
raghavendra-talur added a commit to raghavendra-talur/ocs-operator that referenced this issue Oct 20, 2020
In the reconciler, we considered a pending uninstall operation as an
error. It resulted in slower reconciliation because of exponential
backoff. To avoid the exponential backoff, we need to return the request
with the requeueAfter value set.

See: kubernetes-sigs/controller-runtime#617

Signed-off-by: Raghavendra Talur <[email protected]>
airshipbot pushed a commit to airshipit/vino that referenced this issue Jan 22, 2021
- Reduce log noise by logging errors instead of successes
- Use context logger provided by controller-runtime
- Patch status instead of update to avoid "the object has been modified; please apply your changes to the latest version and try again"
- Add finalizer even if object is already under deletion, in case we never got a chance yet
- Don't set RequeueAfter on errors since it is ignored anyway [0]

[0]: kubernetes-sigs/controller-runtime#617
Change-Id: Ic06aa74f9e1465d3f7e32137559231e940c8a74d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

10 participants