Fixes the requeueing bug which requeues a reconcile request in a wrong workqueue #259
+36
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of your changes
During some load tests with
upbound/[email protected]
, we've spotted a bug where a reconcile request might be requeued in a wrong workqueue. The bug manifests itself in the provider debug logs with messages similar to the following:The issue here is that the reconciler is attempting to observe a resource using a wrong kind and this results in a not-found error. The root cause of the issue is that we are unintentionally sharing the event handlers across all the reconcilers in a given API group. The fix is to have separate event handlers per GVK.
This bug causes the reconciler to do unnecessary work. Reconciliation opportunities are lost for a given (GVK,name) pair because another pair is observed. This bug does not affect the periodic reconciles done at each poll period and hence an affected resource is eventually successfully reconciled. But successful reconciliation needs more attempts. This results in, for instance, increased TTR, and similar.
This PR also improves some debug logging that proved to be useful in interpreting the results we observed in the load tests. We are also planning to expose some further custom metrics reporting these measurements to improve monitoring.
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
Performed load tests with 3000 MRs. Based on just one observation, initially it took ~2h for a resource affected by this bug to transition to the ready state under heavy load (99% total CPU (over all available CPU) utilization). With the fix, it took ~1857 s on average for 23 tries (experiments) with a single resource on top of 3000 MRs.