kvserver: exempt certain requests from circuit breakers #76673

tbg · 2022-02-16T14:13:28Z

While running the nightly roachtest suite with circuit breakers enabled
(#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

Export
AddSSTable
RevertRange
ClearRange
GC
Probe

Touches #33007.

Release note: None

cockroach-teamcity · 2022-02-16T14:13:37Z

This change is

erikgrinaker

I find this a bit worrysome. If bulk ops trigger the circuit breaker then that indicates that the circuit breaker may be too aggressive.

However, it does make sense for certain work to wait on circuit breakers instead of erroring, typically async vs. interactive work. While request classification is a good proxy for this, it seems like it would be better to let the client decide this e.g. via a request parameter?

Nothing significant to note on the actual implementation though.

pkg/kv/kvserver/replica.go

pkg/roachpb/api.go

pkg/kv/kvserver/replica_backpressure.go

tbg · 2022-02-17T12:40:22Z

I find this a bit worrysome. If bulk ops trigger the circuit breaker then that indicates that the circuit breaker may be too aggressive.

There are two distinct points. In my testing, the circuit breaker was set to 15s which may be too aggressive, which is why #76146 is open. But seeing the fail-fast behavior on bulk operations, I realized it would be problematic even with a more reasonable breaker setting in case of an actual outage; a large (say multi-day) import can ideally preserve partial progress across a temporary loss of availability. The old behavior is better and this is what motivates this PR; it's not that I'm trying to get to a place where breakers tripping during "normal" overload is okay.

it seems like it would be better to let the client decide this e.g. via a request parameter?

I had thought about this before typing up this PR and I think as an end state it's possibly desirable. But at this point in the cycle, it seemed problematic to want to ask the internal consumers to basically slap a bool on all of their requests (which they may forget next time they make a change, and likely there won't be an end-to-end test catching it). Maybe there is a good "meta" solution where we say anything that runs under a job shouldn't do circuit breakers? It becomes this bigger question to get this right and I would rather not hold up this straightforward change that "gets the job done" well enough for 22.1 (I still have to verify that btw) for a change that necessarily will take some time to get consensus and be fleshed out, as shipping breakers without any kind of bulk opt-out seems possibly dangerous.

dt · 2022-02-17T12:57:56Z

We could continue to have an individual request hit and error on the breakers in jobs, but could then treat a circuit breaker error the same way we handle a node failure or other error at the top level job execution loop, which is that the job has to restart (from persisted progress), replanning its execution graph (which might now exclude the bad node if it fell out of live instances)

dt · 2022-02-17T13:01:02Z

(Though at least a setting to bypass and do current behavior seems nice to have no matter what, eg in case it turns out we get stuck replanning over and over)

tbg · 2022-02-17T13:11:36Z

Thanks David! That still doesn't seem right, as the retry policy isn't obvious. If a range is down, you basically expect to get that error every time you talk to any of the replicas. So you're going into a replanning loop where we'd have to be careful not to do too much work in a tight loop (you point this out, I know). I think there is probably some work and testing required there to make that sound, and it's not completely obvious what the upshot is. I could imagine observability would be better under the fail-fast scenario assuming it is exposed adequately. But that seems like step 2; right now we're mostly focused on improving the UX for user traffic in this first step, and opting in and out of fail-fast scenario still strikes me as something that is better to add in a separate round (or not if there isn't a need for some time).

I think it comes down to how much Bulk/CDC are willing to invest here. Are you folks interested in making jobs handle failfast errors gracefully? Then I can easily make a change at the KV layer to remove the hard-coded opt-out. But I would probably still want to merge this PR to avoid putting the circuit breakers at risk, unless there are some reasons that make fail-slow behavior for the requests mentioned in the PR "dangerous". But since this is simply preserving the pre-circuitbreaker behavior, I have trouble imagining what this could be.

dt · 2022-02-17T13:29:20Z

Sure, yes, I buy that we can improve things for queries now without making jobs worse, then do jobs later, so I have no objection to this PR as-is if it just maintains status quo for job requests.

That said, jobs also really do suffer from silently hanging requests, and in may ways it is worse than it is for queries: it is obvious when a query that takes 8ms is stuck if it hasn't returned after 12 minutes. If a job that runs for 2h has been stuck on a bad range for 12min, it's easy to miss that / think it is just slower today. If/when a job isn't making progress because something it depends on can't fulfill a request, telling the job that instead of just hanging would be a win, and work towards addressing one of the absolute top issues with big jobs i.e. that it is consistently hard to tell if, when, and most importantly why, they are not performing as expected.

dt · 2022-02-17T13:34:38Z

Some more ideas for when we revisit this:

yes, a job is willing to wait a longer for the situation to clear up than a low-latency query request; we still want to "fail fast" if we're not going to get a reply, but our idea of "fast" is probably more like 15-30s, if the breaker is still open. Even if we then wait a bit and retry to the flow later (in 30/60/120/300s, etc) and then succeed, tearing down a flow to then retry throws away up to a minute, or several gb, of progress per node in most jobs, so we're happy to be more lenient to see if we can just wait it out at first.
If after a delay and retry in the coordinator loop we're still hitting the same breaker error, we could elect to move the job to a paused state with the relevant error message, explicitly signaling to the operator that this job hit some condition in the cluster that indicated it would not be able to complete successfully; they can now elect to cancel it and roll back its progress, or try to resume it again if they've remedied that condition / it has passed on its own.

erikgrinaker · 2022-02-17T13:40:05Z

I'm fine with this as a temporary solution, and will defer to bulk on the longer-term solution.

To avoid the tedium of a request parameter or something, we could add on an RPC middleware when we instantiate the RPC client which adjusts the breaker behavior for the entire client session.

erikgrinaker

Preemptive approval, will leave it to you to decide on a path forward.

tbg · 2022-02-17T13:46:30Z

That said, jobs also really do suffer from silently hanging requests, and in may ways it is worse than it is for queries: it is obvious when a query that takes 8ms is stuck if it hasn't returned after 12 minutes. If a job that runs for 2h has been stuck on a bad range for 12min, it's easy to miss that / think it is just slower today. If/when a job isn't making progress because something it depends on can't fulfill a request, telling the job that instead of just hanging would be a win, and work towards addressing one of the absolute top issues with big jobs i.e. that it is consistently hard to tell if, when, and most importantly why, they are not performing as expected.

That's good to know. The option to fail fast is easy to add once Bulk (or anyone really) is interested in using this. Let me know if Bulk is going to work on this for 22.1 and I can make sure the kvserver piece is in place in time. I can also add it preemptively if you think you just might work on it if it were there.

yes, a job is willing to wait a longer for the situation to clear up than a low-latency query request; we still want to "fail fast" if we're not going to get a reply, but our idea of "fast" is probably more like 15-30s, if the breaker is still open. Even if we then wait a bit and retry to the flow later (in 30/60/120/300s, etc) and then succeed, tearing down a flow to then retry throws away up to a minute, or several gb, of progress per node in most jobs, so we're happy to be more lenient to see if we can just wait it out at first.

This sounds like you want a per-request deadline (e.g. context cancellation) but with more information coming back, right? You don't necessarily care if the breaker is open. If it takes 120s to run an ExportRequest you might want to surface that somehow, even if the replication layer is healthy.

While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None

tbg · 2022-02-21T15:41:39Z

bors r=erikgrinaker

tbg · 2022-02-21T15:41:52Z

bors single on

craig · 2022-02-21T17:12:58Z

Build succeeded:

GitHub CI (Cockroach)

tbg force-pushed the circuit-breaker-exemptoins branch from 0bcc939 to a552ad7 Compare February 17, 2022 09:58

tbg marked this pull request as ready for review February 17, 2022 09:59

tbg requested a review from a team as a code owner February 17, 2022 09:59

tbg requested review from erikgrinaker and a team February 17, 2022 09:59

erikgrinaker reviewed Feb 17, 2022

View reviewed changes

pkg/kv/kvserver/replica.go Outdated Show resolved Hide resolved

pkg/roachpb/api.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/replica_backpressure.go Outdated Show resolved Hide resolved

erikgrinaker approved these changes Feb 17, 2022

View reviewed changes

tbg mentioned this pull request Feb 18, 2022

kvserver: allow reads on circuit-broken replicas #74799

Closed

tbg force-pushed the circuit-breaker-exemptoins branch from a552ad7 to 5b9f073 Compare February 18, 2022 13:30

tbg force-pushed the circuit-breaker-exemptoins branch from 5b9f073 to 957c6a7 Compare February 21, 2022 15:41

tbg mentioned this pull request Feb 21, 2022

kvserver: slow_replication_threshold := 1m #76146

Merged

craig bot merged commit 5b3067f into cockroachdb:master Feb 21, 2022

tbg deleted the circuit-breaker-exemptoins branch February 21, 2022 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: exempt certain requests from circuit breakers #76673

kvserver: exempt certain requests from circuit breakers #76673

tbg commented Feb 16, 2022 •

edited

Loading

cockroach-teamcity commented Feb 16, 2022

erikgrinaker left a comment

tbg commented Feb 17, 2022 •

edited

Loading

dt commented Feb 17, 2022 •

edited

Loading

dt commented Feb 17, 2022

tbg commented Feb 17, 2022 •

edited

Loading

dt commented Feb 17, 2022

dt commented Feb 17, 2022

erikgrinaker commented Feb 17, 2022

erikgrinaker left a comment

tbg commented Feb 17, 2022

tbg commented Feb 21, 2022

tbg commented Feb 21, 2022

craig bot commented Feb 21, 2022

kvserver: exempt certain requests from circuit breakers #76673

kvserver: exempt certain requests from circuit breakers #76673

Conversation

tbg commented Feb 16, 2022 • edited Loading

cockroach-teamcity commented Feb 16, 2022

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg commented Feb 17, 2022 • edited Loading

dt commented Feb 17, 2022 • edited Loading

dt commented Feb 17, 2022

tbg commented Feb 17, 2022 • edited Loading

dt commented Feb 17, 2022

dt commented Feb 17, 2022

erikgrinaker commented Feb 17, 2022

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg commented Feb 17, 2022

tbg commented Feb 21, 2022

tbg commented Feb 21, 2022

craig bot commented Feb 21, 2022

tbg commented Feb 16, 2022 •

edited

Loading

tbg commented Feb 17, 2022 •

edited

Loading

dt commented Feb 17, 2022 •

edited

Loading

tbg commented Feb 17, 2022 •

edited

Loading