Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: provide a way for replicas to re-enter the quota pool #82403

Open
tbg opened this issue Jun 3, 2022 · 3 comments
Open

kvserver: provide a way for replicas to re-enter the quota pool #82403

tbg opened this issue Jun 3, 2022 · 3 comments
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jun 3, 2022

Describe the problem

Replicas can get "ignored" by the quota pool under certain conditions. This primarily happens when a node restarts, as then we want to avoid stalling foreground traffic until the follower has caught up. However, there is no guarantee that the follower will ever catch up.

To Reproduce

I haven't actually done this, but if you take a write-heavy workload with a few hot ranges, take down a node for 1-2 minutes, then bring it up in a state in which it is slightly underprovisioned for the workload, it should forever lag behind, and the quota pool will not be helping it catch up.

Expected behavior

Hard to formulate! There are different regimes. If a follower is behind and is "hopelessly slow", foreground traffic shouldn't slow down in response to it (see #79215). But if it's only marginally slower (making "good progress"), and perhaps slower only because it is a read-only satellite in a faraway region, etc, we need to slowly bring it back into circulation or AOST reads on this replica will fail forever (and, if it's a voter, availability will remain compromised forever since the replica has to catch up before it can make forward progress).

Additional data / screenshots

Environment:

Additional context

I'm not sure we have struggled with this in practice, but it is a legitimate concern and becomes more important if, for #79215, take an approach where the quota pool "temporarily" ignores followers that are overloaded (and stops sending appends to them). These nodes will "intentionally" fall behind but nothing will ensure that they catch up when they have become healthy.

Jira issue: CRDB-16355

Epic CRDB-39898

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-replication labels Jun 3, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 3, 2022

cc @cockroachdb/replication

@tbg
Copy link
Member Author

tbg commented Jun 3, 2022

This is related to this issue which I was about to file but will simply put here:

Is your feature request related to a problem? Please describe.

In the context of #79215, we are thinking about overloaded stores and replicas that are unable to keep up with the incoming raft log due to IO overload. When such a situation occurs, ultimately in the long term we hope that the allocator can find a way to move load off that store and solve the problem. But this may not always be possible, and if it is possible, it may take some time.

What happens in the meantime? We have three options:

  • let it run its course, ie append to the store anyway, thus overloading it more over time.
  • don't append to it.
  • slow down the foreground traffic to match what the overloaded store can support (i.e. some version of distributed admission control)

We currently do some average of 1) and 3) through the quota pool if the traffic is driven primarily through one range. There is a certain amount of quota available and it is only returned when all followers have acked a proposal; if a follower slows down and falls further and further behind, quota will eventually run out, and we will (choppily) wait for the follower to ack something, then append again, etc; so it looks a lot like a dumb version of 3).
If there are many active ranges, no individual quota pool may ever deplete, so it's closer to 1).

  1. causes all kinds of memory blowup and also severely degrades the node and in particular disrupts service on ranges for which that slow store holds the lease.

Whatever end state we work towards (see #79755), it will require determining, for a remote store, a signal on how much replication traffic it can take in, so that the cluster can shape the traffic to the store accordingly.

Describe the solution you'd like

Unclear, as it depends a lot on #79755 and the preliminary steps in #79215, which includes a lot of experimentation. But one approach that has been discussed quite a bit and is roughly in line with how admission control for IO works locally is to negotiate with the remote store's admission control system a throughput rate that the remote considers acceptable (for the near future) and to shape aggregate append traffic to that node according to that target throughput.

This raises the question of what happens if the target throughput rate is "very low". In such a case, likely a good strategy is to treat the follower as unavailable (as long as quorum requirements allow this on a case-by-case basis).

Cut-offs like this are annoying.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

Somewhat related: #77035

@tbg
Copy link
Member Author

tbg commented Jul 3, 2023

This issue is obsolete if we disable/remove the quota pool, x-ref #106063

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

3 participants