Make HandleError prevent hot-loops #40497

lavalamp · 2017-01-26T00:01:31Z

Add an error "handler" that just sleeps for a bit if errors happen more
often than 500ms. Manually tested against #39816. This doesn't fix #39816 but it does keep it from crippling a cluster.

Prevent hotloops on error conditions, which could fill up the disk faster than log rotation can free space.

k8s-reviewable · 2017-01-26T00:02:37Z

This change is

k8s-cherrypick-bot · 2017-01-26T00:03:12Z

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

lavalamp · 2017-01-26T00:06:29Z

I will make another PR with a test. This is the minimum fix and it should be easy to cherrypick.

mikedanese · 2017-01-26T00:11:04Z

staging/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go

+	// package for that to be accessible here.
+	lastErrorTime time.Time
+	minPeriod     time.Duration
+	lock          sync.Mutex


nit, lock should be the field above the thing it locks and should be named lasteErrorTimeLock or something.

ok, will fix here and in the cherrypick branch

cherrypick branch is fixed.

k8s-github-robot · 2017-01-26T00:12:04Z

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

staging/src/k8s.io/apimachinery/pkg/OWNERS

We suggest the following people:
cc @smarterclayton
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

mikedanese · 2017-01-26T00:21:34Z

/lgtm

lavalamp · 2017-01-26T00:23:05Z

I want to make the fix in the release branch here before this merges.

lavalamp · 2017-01-26T00:23:21Z

(but I'm building and testing the release branch first)

Add an error "handler" that just sleeps for a bit if errors happen more often than 500ms. Manually tested against kubernetes#39816.

lavalamp · 2017-01-26T01:38:30Z

OK, this should be mergable now.

…97-upstream-release-1.5 Automatic merge from submit-queue Automated cherry pick of #40497 Cherry pick of #40497 on release-1.5. #40497: Make HandleError prevent hot-loops

deads2k · 2017-01-26T12:37:50Z

Added a tag to resolve questions about alternative solutions before this merges and affects all callers.

ncdc · 2017-01-26T14:07:03Z

I'll second what @deads2k said. Let's focus on the code that's hot-looping and formalize our error handling functionality in the controller framework.

deads2k · 2017-01-26T14:12:04Z

I'll second what @deads2k said. Let's focus on the code that's hot-looping and formalize our error handling functionality in the controller framework.

.AddRateLimited( is pretty common. apiservice, certificate, daemon, deployment, disruption, endpoints, jobs, namespaces, replicaset, quota, service accounts, tokens, statefulset.

smarterclayton · 2017-01-26T16:50:02Z

So there's two different arguments going on here:

The process has a spout that is unbounded (handle error) which is intended to be used for handling "shrug" scenarios (literally, we fail, and it's better than eating the error) - we want all code to use that spout, but the spout must be rate limited
Controllers already have a mechanism for this, but it's not being used in this case

David is concerned that 2 should be fixed at the same time as 1, and that 1 makes 2 not work as well (because it allows other code that is not controllers to break controllers). I agree with this, but we'd need to police the mechanism for 2.

So:

We must rate limit util.HandleError to non-system destructive levels (arguably this should be per handler type, because system logging is not the same as shipping errors off to third party systems, not global, but whatever)
We need a way to allow proper rate limited things like controllers to bypass 1 safely
We need to prevent other people in the system from abusing 2 to bypass 1. That's effectively something like import boss.

smarterclayton · 2017-01-26T16:52:06Z

Doesn't glog have a rate limiter? Why wouldn't we also set that?

deads2k · 2017-01-26T18:06:01Z

Controllers already have a mechanism for this, but it's not being used

It's be used by the majority, it's not being used by GC. This effectively breaks the backoff handling of most controllers in a way that unnecessarily blocks execution of that controller (AddRateLimited doesn't) and the majority of callers from the controller packages don't want a delay like this.

Further, if you're building something to just stop a hotloop on an unconsidered error (again, not the case in the majority of controllers), you can sleep for a very short period (10s of milliseconds) and you'd just do it unconditionally since the purpose is to just avoid ddos-ing yourself. However, since its the opposite of what the majority of controllers want, all the existing controllers with proper handling need to be updated to not use this new (or severely changed) method.

Ending up in this place means that instead of fixing rating limiting for the GC controller (clients have rate limiters) and instead of fixing the GC controller to AddRateLimited (libraries already exist), means that we've taken the biggest hammer we have and applied it across multiple processes and allowed errors in on go routine to negatively rate limit already rate limited controllers for one bad apple.

smarterclayton · 2017-01-26T18:11:39Z

However, since its the opposite of what the majority of controllers want, all the existing controllers with proper handling need to be updated to not use this new (or severely changed) method.

Yes, that's 2, and we need 3 to prevent it from being abused.

Ending up in this place means that instead of fixing rating limiting for the GC controller (clients have rate limiters) and instead of fixing the GC controller to AddRateLimited (libraries already exist), means that we've taken the biggest hammer we have and applied it across multiple processes and allowed errors in on go routine to negatively rate limit already rate limited controllers for one bad apple.

I had the expectation that

GC gets fixed
We add the new method and use it in controllers
We fix controllers that are sending too much to handle error to more accurately ignore common errors
We reduce the delay imposed here to something much smaller.

I would expect us to fix that in 1.6.

deads2k · 2017-01-26T18:14:57Z

#38679 switches GC to a ratelimited work queue for 1.6 and simple wait could be added here https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/garbagecollector/garbagecollector.go#L594 as a minimal touch in 1.5 instead of disrupting most of the other controllers in the 1.5 stream.

lavalamp · 2017-01-26T18:26:05Z

I did this because I never want to have an afternoon destroyed by someone logging every 20 microseconds again, and this seemed like the most general place with the fewest side effects. I don't care if we make it 50ms instead of 500. Should we really be HandleErroring that frequently? I think that's a problem even if it's separate components.

deads2k · 2017-01-26T19:28:56Z

We spoke in slack and decided that a 1ms delay would protect infrastructure with minimal impact to callers like controllers. Once its updated, I'm ok with sleeping here.

mikedanese · 2017-01-26T19:31:04Z

@lavalamp can you cherrypick into 1.4 while you are at it?

lavalamp · 2017-01-26T20:47:23Z

We had a discussion on slack and it seems 1ms is the number that makes everyone able to live with a global limit like this. I will update this PR and send an adjustment to the 1.5 branch.

lavalamp · 2017-01-26T20:55:41Z

OK, number adjusted, in a second commit so it'll be easier to cherrypick.

mikedanese · 2017-01-27T00:02:06Z

/lgtm

…97-upstream-release-1.4 Automatic merge from submit-queue Automated cherry pick of #40497 Cherry pick of #40497 on release-1.4. #40497: Make HandleError prevent hot-loops

lavalamp · 2017-01-27T23:53:23Z

@grodrigues3 @apelisse The bot seems confused about the LGTM ordering here? Or is there some other reason why this is stuck?

apelisse · 2017-01-28T00:02:17Z

Agreed, there is a confusion. Thanks Lavalamp, somebody noticed this bug before but we couldn't figure out what was going wrong. I think I somewhat understand now.

k8s-github-robot · 2017-01-28T01:26:03Z

Automatic merge from submit-queue

k8s-cherrypick-bot · 2017-01-28T01:28:24Z

Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 26, 2017

lavalamp added release-note Denotes a PR that will be considered when it comes time to generate release notes. cherrypick-candidate labels Jan 26, 2017

lavalamp assigned mikedanese Jan 26, 2017

k8s-cherrypick-bot removed the cherrypick-candidate label Jan 26, 2017

lavalamp added this to the v1.5 milestone Jan 26, 2017

lavalamp added the cherrypick-candidate label Jan 26, 2017

saad-ali added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jan 26, 2017

mikedanese reviewed Jan 26, 2017

View reviewed changes

lavalamp mentioned this pull request Jan 26, 2017

Automated cherry pick of #40497 #40498

Merged

k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 26, 2017

mikedanese closed this Jan 26, 2017

mikedanese reopened this Jan 26, 2017

mikedanese added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 26, 2017

lavalamp added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jan 26, 2017

Make HandleError prevent hot-loops

b03b5de

Add an error "handler" that just sleeps for a bit if errors happen more often than 500ms. Manually tested against kubernetes#39816.

lavalamp force-pushed the log2much branch from c4fd269 to b03b5de Compare January 26, 2017 01:37

lavalamp removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jan 26, 2017

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2017

smarterclayton added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2017

Adjust global log limit to 1ms

16b7bee

lavalamp mentioned this pull request Jan 26, 2017

Automated cherry pick of #40497 #40552

Merged

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 26, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2017

deads2k removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jan 27, 2017

apelisse added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 28, 2017

k8s-github-robot merged commit fe2829c into kubernetes:master Jan 28, 2017

k8s-cherrypick-bot removed the cherrypick-candidate label Jan 28, 2017

xiang90 mentioned this pull request Jan 29, 2017

[GC] controller manager gets error "unable to get REST mapping for kind" for ownerRefs to TPR and add-on APIs #39816

Closed

enj mentioned this pull request Jan 15, 2018

Improve default error logging of HandleError #58307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make HandleError prevent hot-loops #40497

Make HandleError prevent hot-loops #40497

lavalamp commented Jan 26, 2017

k8s-reviewable commented Jan 26, 2017

k8s-cherrypick-bot commented Jan 26, 2017

lavalamp commented Jan 26, 2017

mikedanese Jan 26, 2017

lavalamp Jan 26, 2017

lavalamp Jan 26, 2017

k8s-github-robot commented Jan 26, 2017

mikedanese commented Jan 26, 2017

lavalamp commented Jan 26, 2017

lavalamp commented Jan 26, 2017

lavalamp commented Jan 26, 2017

deads2k commented Jan 26, 2017

ncdc commented Jan 26, 2017

deads2k commented Jan 26, 2017

smarterclayton commented Jan 26, 2017 •

edited

Loading

smarterclayton commented Jan 26, 2017

deads2k commented Jan 26, 2017

smarterclayton commented Jan 26, 2017

deads2k commented Jan 26, 2017

lavalamp commented Jan 26, 2017

deads2k commented Jan 26, 2017

mikedanese commented Jan 26, 2017

lavalamp commented Jan 26, 2017

lavalamp commented Jan 26, 2017

mikedanese commented Jan 27, 2017

lavalamp commented Jan 27, 2017

apelisse commented Jan 28, 2017

k8s-github-robot commented Jan 28, 2017

k8s-cherrypick-bot commented Jan 28, 2017

Make HandleError prevent hot-loops #40497

Make HandleError prevent hot-loops #40497

Conversation

lavalamp commented Jan 26, 2017

k8s-reviewable commented Jan 26, 2017

k8s-cherrypick-bot commented Jan 26, 2017

lavalamp commented Jan 26, 2017

mikedanese Jan 26, 2017

Choose a reason for hiding this comment

lavalamp Jan 26, 2017

Choose a reason for hiding this comment

lavalamp Jan 26, 2017

Choose a reason for hiding this comment

k8s-github-robot commented Jan 26, 2017

mikedanese commented Jan 26, 2017

lavalamp commented Jan 26, 2017

lavalamp commented Jan 26, 2017

lavalamp commented Jan 26, 2017

deads2k commented Jan 26, 2017

ncdc commented Jan 26, 2017

deads2k commented Jan 26, 2017

smarterclayton commented Jan 26, 2017 • edited Loading

smarterclayton commented Jan 26, 2017

deads2k commented Jan 26, 2017

smarterclayton commented Jan 26, 2017

deads2k commented Jan 26, 2017

lavalamp commented Jan 26, 2017

deads2k commented Jan 26, 2017

mikedanese commented Jan 26, 2017

lavalamp commented Jan 26, 2017

lavalamp commented Jan 26, 2017

mikedanese commented Jan 27, 2017

lavalamp commented Jan 27, 2017

apelisse commented Jan 28, 2017

k8s-github-robot commented Jan 28, 2017

k8s-cherrypick-bot commented Jan 28, 2017

smarterclayton commented Jan 26, 2017 •

edited

Loading