-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: exempt certain requests from circuit breakers #76673
Conversation
0bcc939
to
a552ad7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit worrysome. If bulk ops trigger the circuit breaker then that indicates that the circuit breaker may be too aggressive.
However, it does make sense for certain work to wait on circuit breakers instead of erroring, typically async vs. interactive work. While request classification is a good proxy for this, it seems like it would be better to let the client decide this e.g. via a request parameter?
Nothing significant to note on the actual implementation though.
There are two distinct points. In my testing, the circuit breaker was set to 15s which may be too aggressive, which is why #76146 is open. But seeing the fail-fast behavior on bulk operations, I realized it would be problematic even with a more reasonable breaker setting in case of an actual outage; a large (say multi-day) import can ideally preserve partial progress across a temporary loss of availability. The old behavior is better and this is what motivates this PR; it's not that I'm trying to get to a place where breakers tripping during "normal" overload is okay.
I had thought about this before typing up this PR and I think as an end state it's possibly desirable. But at this point in the cycle, it seemed problematic to want to ask the internal consumers to basically slap a bool on all of their requests (which they may forget next time they make a change, and likely there won't be an end-to-end test catching it). Maybe there is a good "meta" solution where we say anything that runs under a job shouldn't do circuit breakers? It becomes this bigger question to get this right and I would rather not hold up this straightforward change that "gets the job done" well enough for 22.1 (I still have to verify that btw) for a change that necessarily will take some time to get consensus and be fleshed out, as shipping breakers without any kind of bulk opt-out seems possibly dangerous. |
We could continue to have an individual request hit and error on the breakers in jobs, but could then treat a circuit breaker error the same way we handle a node failure or other error at the top level job execution loop, which is that the job has to restart (from persisted progress), replanning its execution graph (which might now exclude the bad node if it fell out of live instances) |
(Though at least a setting to bypass and do current behavior seems nice to have no matter what, eg in case it turns out we get stuck replanning over and over) |
Thanks David! That still doesn't seem right, as the retry policy isn't obvious. If a range is down, you basically expect to get that error every time you talk to any of the replicas. So you're going into a replanning loop where we'd have to be careful not to do too much work in a tight loop (you point this out, I know). I think there is probably some work and testing required there to make that sound, and it's not completely obvious what the upshot is. I could imagine observability would be better under the fail-fast scenario assuming it is exposed adequately. But that seems like step 2; right now we're mostly focused on improving the UX for user traffic in this first step, and opting in and out of fail-fast scenario still strikes me as something that is better to add in a separate round (or not if there isn't a need for some time). I think it comes down to how much Bulk/CDC are willing to invest here. Are you folks interested in making jobs handle failfast errors gracefully? Then I can easily make a change at the KV layer to remove the hard-coded opt-out. But I would probably still want to merge this PR to avoid putting the circuit breakers at risk, unless there are some reasons that make fail-slow behavior for the requests mentioned in the PR "dangerous". But since this is simply preserving the pre-circuitbreaker behavior, I have trouble imagining what this could be. |
Sure, yes, I buy that we can improve things for queries now without making jobs worse, then do jobs later, so I have no objection to this PR as-is if it just maintains status quo for job requests. That said, jobs also really do suffer from silently hanging requests, and in may ways it is worse than it is for queries: it is obvious when a query that takes 8ms is stuck if it hasn't returned after 12 minutes. If a job that runs for 2h has been stuck on a bad range for 12min, it's easy to miss that / think it is just slower today. If/when a job isn't making progress because something it depends on can't fulfill a request, telling the job that instead of just hanging would be a win, and work towards addressing one of the absolute top issues with big jobs i.e. that it is consistently hard to tell if, when, and most importantly why, they are not performing as expected. |
Some more ideas for when we revisit this:
|
I'm fine with this as a temporary solution, and will defer to bulk on the longer-term solution. To avoid the tedium of a request parameter or something, we could add on an RPC middleware when we instantiate the RPC client which adjusts the breaker behavior for the entire client session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemptive approval, will leave it to you to decide on a path forward.
That's good to know. The option to fail fast is easy to add once Bulk (or anyone really) is interested in using this. Let me know if Bulk is going to work on this for 22.1 and I can make sure the kvserver piece is in place in time. I can also add it preemptively if you think you just might work on it if it were there.
This sounds like you want a per-request deadline (e.g. context cancellation) but with more information coming back, right? You don't necessarily care if the breaker is open. If it takes 120s to run an ExportRequest you might want to surface that somehow, even if the replication layer is healthy. |
a552ad7
to
5b9f073
Compare
While running the nightly roachtest suite with circuit breakers enabled (cockroachdb#76146) I observed undesirable interactions between circuit breakers and bulk operations. Bulk operations tend to overload the cluster. When this happens, circuit breakers may fire "spuriously" (at the very least if liveness is lost across multiple nodes for some time), which returns hard errors to the bulk job. The job will then roll back any work it has done, which may also fail due to the circuit breaker (leaving the job in limbo). For long-running bulk jobs, it seems desirable to bypass the circuit breaker entirely. When unavailability occurs, the job will not be able to make progress, but it should be left to the operator whether to cancel it or not; after all, once the outage resolves the job may be able to resume normally. This PR adds machinery that allows request types to bypass circuit breakers. Concretely, any batch that contains either of the following request types: - Export - AddSSTable - RevertRange - ClearRange - GC - Probe Touches cockroachdb#33007. Release note: None
5b9f073
to
957c6a7
Compare
bors r=erikgrinaker |
bors single on |
Build succeeded: |
While running the nightly roachtest suite with circuit breakers enabled
(#76146) I observed undesirable interactions between circuit breakers
and bulk operations.
Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).
For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.
This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:
Touches #33007.
Release note: None