Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: exempt certain requests from circuit breakers #76673

Merged
merged 1 commit into from
Feb 21, 2022

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 16, 2022

While running the nightly roachtest suite with circuit breakers enabled
(#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

  • Export
  • AddSSTable
  • RevertRange
  • ClearRange
  • GC
  • Probe

Touches #33007.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the circuit-breaker-exemptoins branch from 0bcc939 to a552ad7 Compare February 17, 2022 09:58
@tbg tbg marked this pull request as ready for review February 17, 2022 09:59
@tbg tbg requested a review from a team as a code owner February 17, 2022 09:59
@tbg tbg requested review from erikgrinaker and a team February 17, 2022 09:59
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit worrysome. If bulk ops trigger the circuit breaker then that indicates that the circuit breaker may be too aggressive.

However, it does make sense for certain work to wait on circuit breakers instead of erroring, typically async vs. interactive work. While request classification is a good proxy for this, it seems like it would be better to let the client decide this e.g. via a request parameter?

Nothing significant to note on the actual implementation though.

pkg/kv/kvserver/replica.go Outdated Show resolved Hide resolved
pkg/roachpb/api.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_backpressure.go Outdated Show resolved Hide resolved
@tbg
Copy link
Member Author

tbg commented Feb 17, 2022

I find this a bit worrysome. If bulk ops trigger the circuit breaker then that indicates that the circuit breaker may be too aggressive.

There are two distinct points. In my testing, the circuit breaker was set to 15s which may be too aggressive, which is why #76146 is open. But seeing the fail-fast behavior on bulk operations, I realized it would be problematic even with a more reasonable breaker setting in case of an actual outage; a large (say multi-day) import can ideally preserve partial progress across a temporary loss of availability. The old behavior is better and this is what motivates this PR; it's not that I'm trying to get to a place where breakers tripping during "normal" overload is okay.

it seems like it would be better to let the client decide this e.g. via a request parameter?

I had thought about this before typing up this PR and I think as an end state it's possibly desirable. But at this point in the cycle, it seemed problematic to want to ask the internal consumers to basically slap a bool on all of their requests (which they may forget next time they make a change, and likely there won't be an end-to-end test catching it). Maybe there is a good "meta" solution where we say anything that runs under a job shouldn't do circuit breakers? It becomes this bigger question to get this right and I would rather not hold up this straightforward change that "gets the job done" well enough for 22.1 (I still have to verify that btw) for a change that necessarily will take some time to get consensus and be fleshed out, as shipping breakers without any kind of bulk opt-out seems possibly dangerous.

@dt
Copy link
Member

dt commented Feb 17, 2022

We could continue to have an individual request hit and error on the breakers in jobs, but could then treat a circuit breaker error the same way we handle a node failure or other error at the top level job execution loop, which is that the job has to restart (from persisted progress), replanning its execution graph (which might now exclude the bad node if it fell out of live instances)

@dt
Copy link
Member

dt commented Feb 17, 2022

(Though at least a setting to bypass and do current behavior seems nice to have no matter what, eg in case it turns out we get stuck replanning over and over)

@tbg
Copy link
Member Author

tbg commented Feb 17, 2022

Thanks David! That still doesn't seem right, as the retry policy isn't obvious. If a range is down, you basically expect to get that error every time you talk to any of the replicas. So you're going into a replanning loop where we'd have to be careful not to do too much work in a tight loop (you point this out, I know). I think there is probably some work and testing required there to make that sound, and it's not completely obvious what the upshot is. I could imagine observability would be better under the fail-fast scenario assuming it is exposed adequately. But that seems like step 2; right now we're mostly focused on improving the UX for user traffic in this first step, and opting in and out of fail-fast scenario still strikes me as something that is better to add in a separate round (or not if there isn't a need for some time).

I think it comes down to how much Bulk/CDC are willing to invest here. Are you folks interested in making jobs handle failfast errors gracefully? Then I can easily make a change at the KV layer to remove the hard-coded opt-out. But I would probably still want to merge this PR to avoid putting the circuit breakers at risk, unless there are some reasons that make fail-slow behavior for the requests mentioned in the PR "dangerous". But since this is simply preserving the pre-circuitbreaker behavior, I have trouble imagining what this could be.

@dt
Copy link
Member

dt commented Feb 17, 2022

Sure, yes, I buy that we can improve things for queries now without making jobs worse, then do jobs later, so I have no objection to this PR as-is if it just maintains status quo for job requests.

That said, jobs also really do suffer from silently hanging requests, and in may ways it is worse than it is for queries: it is obvious when a query that takes 8ms is stuck if it hasn't returned after 12 minutes. If a job that runs for 2h has been stuck on a bad range for 12min, it's easy to miss that / think it is just slower today. If/when a job isn't making progress because something it depends on can't fulfill a request, telling the job that instead of just hanging would be a win, and work towards addressing one of the absolute top issues with big jobs i.e. that it is consistently hard to tell if, when, and most importantly why, they are not performing as expected.

@dt
Copy link
Member

dt commented Feb 17, 2022

Some more ideas for when we revisit this:

  1. yes, a job is willing to wait a longer for the situation to clear up than a low-latency query request; we still want to "fail fast" if we're not going to get a reply, but our idea of "fast" is probably more like 15-30s, if the breaker is still open. Even if we then wait a bit and retry to the flow later (in 30/60/120/300s, etc) and then succeed, tearing down a flow to then retry throws away up to a minute, or several gb, of progress per node in most jobs, so we're happy to be more lenient to see if we can just wait it out at first.

  2. If after a delay and retry in the coordinator loop we're still hitting the same breaker error, we could elect to move the job to a paused state with the relevant error message, explicitly signaling to the operator that this job hit some condition in the cluster that indicated it would not be able to complete successfully; they can now elect to cancel it and roll back its progress, or try to resume it again if they've remedied that condition / it has passed on its own.

@erikgrinaker
Copy link
Contributor

I'm fine with this as a temporary solution, and will defer to bulk on the longer-term solution.

To avoid the tedium of a request parameter or something, we could add on an RPC middleware when we instantiate the RPC client which adjusts the breaker behavior for the entire client session.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemptive approval, will leave it to you to decide on a path forward.

@tbg
Copy link
Member Author

tbg commented Feb 17, 2022

That said, jobs also really do suffer from silently hanging requests, and in may ways it is worse than it is for queries: it is obvious when a query that takes 8ms is stuck if it hasn't returned after 12 minutes. If a job that runs for 2h has been stuck on a bad range for 12min, it's easy to miss that / think it is just slower today. If/when a job isn't making progress because something it depends on can't fulfill a request, telling the job that instead of just hanging would be a win, and work towards addressing one of the absolute top issues with big jobs i.e. that it is consistently hard to tell if, when, and most importantly why, they are not performing as expected.

That's good to know. The option to fail fast is easy to add once Bulk (or anyone really) is interested in using this. Let me know if Bulk is going to work on this for 22.1 and I can make sure the kvserver piece is in place in time. I can also add it preemptively if you think you just might work on it if it were there.

yes, a job is willing to wait a longer for the situation to clear up than a low-latency query request; we still want to "fail fast" if we're not going to get a reply, but our idea of "fast" is probably more like 15-30s, if the breaker is still open. Even if we then wait a bit and retry to the flow later (in 30/60/120/300s, etc) and then succeed, tearing down a flow to then retry throws away up to a minute, or several gb, of progress per node in most jobs, so we're happy to be more lenient to see if we can just wait it out at first.

This sounds like you want a per-request deadline (e.g. context cancellation) but with more information coming back, right? You don't necessarily care if the breaker is open. If it takes 120s to run an ExportRequest you might want to surface that somehow, even if the replication layer is healthy.

While running the nightly roachtest suite with circuit breakers enabled
(cockroachdb#76146) I observed undesirable interactions between circuit breakers
and bulk operations.

Bulk operations tend to overload the cluster. When this happens, circuit
breakers may fire "spuriously" (at the very least if liveness is lost
across multiple nodes for some time), which returns hard errors to the
bulk job. The job will then roll back any work it has done, which may
also fail due to the circuit breaker (leaving the job in limbo).

For long-running bulk jobs, it seems desirable to bypass the circuit
breaker entirely. When unavailability occurs, the job will not be able
to make progress, but it should be left to the operator whether to
cancel it or not; after all, once the outage resolves the job may
be able to resume normally.

This PR adds machinery that allows request types to bypass circuit
breakers. Concretely, any batch that contains either of the following
request types:

- Export
- AddSSTable
- RevertRange
- ClearRange
- GC
- Probe

Touches cockroachdb#33007.

Release note: None
@tbg tbg force-pushed the circuit-breaker-exemptoins branch from 5b9f073 to 957c6a7 Compare February 21, 2022 15:41
@tbg
Copy link
Member Author

tbg commented Feb 21, 2022

bors r=erikgrinaker

@tbg
Copy link
Member Author

tbg commented Feb 21, 2022

bors single on

@craig
Copy link
Contributor

craig bot commented Feb 21, 2022

Build succeeded:

@craig craig bot merged commit 5b3067f into cockroachdb:master Feb 21, 2022
@tbg tbg deleted the circuit-breaker-exemptoins branch February 21, 2022 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants