-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission,kvserver: subject snapshot ingestion to admission control #80607
Comments
The snapshot rate limits configured in https://github.com/cockroachlabs/support/issues/1558 were dangerously close to maxing out the disks' configured max throughputs. With everything you suggest above done, wouldn't we still see issues owing to the fact that streaming the snapshot alone would overload the disks? There is a mismatch between our L0-centric tracking and snapshots in this regard. I suppose this is fine, the rate limits should really be adaptive to disk throughput or for now, shouldn't be misconfigured like that.
This does worry me a bit. It's just a lot of tokens! Can we acquire the tokens in smaller chunks (say 1mb/s increments) so that the bucket does not go negative, and we're less likely to impose a big delay on a high-priority operation?
In that case, we can apply the snapshot, right? It doesn't really matter if we do, since the caller (replicate queue in all likelihood) will assume the snapshot failed & will not make good use of it. But if we "pre-acked" the snapshot the moment we get to the point at which we're waiting for ingestion tokens, the caller could be more lenient and willing to wait a little longer. Playing devil's advocate:
Is it maybe ok to ignore that for snapshots? Let's say we acquire tokens as the snapshot is streamed in, and then ingest no matter what. Since we're doing at most one snapshot per store at any given time, in the scenario we're worried about wouldn't it still be ok:
Do I have the scenario wrong? It already feels fairly contrived. I might not be understanding it correctly. |
Are you worried about the reads at the source? If so, it depends on how much of this hits the Pebble block cache and the OS page cache. But yes, it is a valid concern -- and admission control at the source can't help yet with disk bandwidth limits (only CPU). We could consider at least hooking up the reads to admission control.
That's definitely a potential alternative and I did consider that briefly. I discarded it because:
I'll give this alternative some more thought. And flesh out the third alternative which I had alluded to above, that overcommits, by consuming tokens for these snapshots that don't touch the normal tokens.
I need to understand the deadline situation better. IIUC, the deadline exists so that the source can try another healthier node, yes? So the leniency above is not indefinite, yes? |
Relevant comment by @nicktrav https://github.com/cockroachlabs/support/issues/1558#issuecomment-1112278968 regarding not ignoring the write bandwidth to write the initial sstables while the snapshot is streaming in, as it could be substantial (20% in that case). |
The approach here is to use a special set of tokens for snapshots kvStoreTokenGranter.availableRangeSnapshotIOTokens that allows for over-commitment, so that normal work is not directly affected. There is a long code comment in kvStoreTokenGranter with justification. Informs cockroachdb#80607 Release note: None
I'm leaning towards this, since it is effective and is the simplest -- created a draft PR in #80914 |
The approach here is to use a special set of tokens for snapshots kvStoreTokenGranter.availableRangeSnapshotIOTokens that allows for over-commitment, so that normal work is not directly affected. There is a long code comment in kvStoreTokenGranter with justification. Informs cockroachdb#80607 Release note: None
Moving this to KV (as the new owners of admission control) and tagging @shralex and @mwang1026 to see who can work on this and when. Note there's a prototype PR #79215 (of Sumeer's) that I've looked at and that I think would be good to experiment with. Before we do that, we should write a roachtest (a la #81516), which I filed as #82116. |
One approach that was being explored in #82132 is to "ignore" overloaded followers as much as possible - in a sense, saying that when we see an overloaded follower, as long as that follower isn't required for quorum, we are better off leaving it alone for some time because the alternative involves backpressuring client traffic, which is essentially like introducing some amount of voluntary unavailability. We could do something similar here: avoid raft snapshots to nodes that are overloaded, unless the snapshot is necessary for forward progress (which is something the initiator of the snapshot can reason about, since it is the raft leader). Note that we already don't rebalance to nodes that have an inverted LSM, see #73714. |
Closing #113720 in favor of this issue. The current situation is that we require manual tuning of the rate, and it causes problems because it is “always” either too slow or too fast. This is likely one of the most tuned parameters today, and it is typically reactionary and modified frequently. Also, some snapshots are “cheap” in that they are for a new range and don’t disrupt the LSM very often vs others are “expensive” since they replace additional data. The KV code has some idea of which it is, but doesn’t take that into account now. We have timeouts and rate computations for snapshots, but they don’t always work as expected and we end up with timeouts today. Short term this is hard because there are throttles on both sides (2 for send, 1 for receive) and the timeout is based on what the setting is configured to, so if it is set too high, even though we send at the same rate, we timeout more often. I would like to remove the timeout as a “normal flow control” mechanism and instead only time out in extreme cases. |
Adding the O-support label given this would have prevented https://github.com/cockroachlabs/support/issues/2841. |
This issue has been resolved through our optimizations in pebble (see #117116). We no longer see the inverted LSM problem as described here due to our use of excise in the ingestion path. This feature is turned on by default in 24.1 and can be turned on in 23.2. We verified the behavior through extensive internal experiments. The end result of which is that with these settings turned on, we see minimal (often none) of the ingests landing in L0. Instead with the help of excise, a big majority lands in L6. Before (left) and after (right): Pebble logs to confirm:
Based on these findings, we can close out this issue. |
Since we are no longer concerned with the snapshot overhead, we have the option to either increase the concurrency of snapshots or simply increase the default ingestion rate (to something like 256MiB). It is kept artificially low with the idea that we would increase it once this enhancement was implemented. Currently many customers manually increase this rate, but when they do that is when they get in trouble on 23.2. @itsbilal had looked at options to increase this rate as part of disaggregated storage. |
Until AC has control over disk bandwidth usage, I think increasing the default is risky in that the disk could become saturated. |
I am going to re-open this issue even if we don't deal with it now. As part of closing this we should also resolve #14768 We are in a strange state where the guidance to customers and customer best practices are to increase (or at least modify) this rate in many escalations and runbooks, but we are always threading a line between this working well for different scenarios. EIther way if we want to keep this closed, we should update with guidance about what it should be in 24.1: https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/2273673403/How+To+Change+Rebalance+and+Recovery+Rates |
Ack, that's fine with me. It is worth noting that we have a tracking issue for the bandwidth specific problem: #86857. It is a problem we plan to solve soon. I think the original motivation of this issue was increased r-amp due to snapshots landing in L0 on the receiver, which is no longer the case, hence why I originally closed this issue. But either way, we should be solving the bandwidth issue with snapshots in mind (ref: #86857 (comment)).
I think the concern is with the change to the defaults since we don't know how it will react in bandwidth constrained setups without AC for bandwidth, especially in environments where reads and writes share the same provisioned bandwidth limits. I think it still makes sense to recommend increasing this limit in cases where it can help. FWIW, in the doc referenced above with my results, I was running with this knob tuned up to 256MB. |
see #120708 for that followup. Closing this issue. |
This patch adds a roachtest for running snapshots with excises enabled. In this workload, when splits and excises are disabled, we see an inverted LSM and degraded p99 latencies. The test asserts that the LSM stays healthy while doing the snapshot ingest, and p99 latencies don't spike over a threshold. Informs cockroachdb#80607. Release note: None
This patch adds a roachtest for running snapshots with excises enabled. In this workload, when splits and excises are disabled, we see an inverted LSM and degraded p99 latencies. The test asserts that the LSM stays healthy while doing the snapshot ingest, and p99 latencies don't spike over a threshold. Informs cockroachdb#80607. Release note: None
This patch adds a roachtest for running snapshots with excises enabled. In this workload, when splits and excises are disabled, we see an inverted LSM and degraded p99 latencies. The test asserts that the LSM stays healthy while doing the snapshot ingest, and p99 latencies don't spike over a threshold. Informs cockroachdb#80607. Release note: None
124591: roachtest: add roachtest for snapshot ingest with excises r=sumeerbhola a=aadityasondhi This patch adds a roachtest for running snapshots with excises enabled. In this workload, when splits and excises are disabled, we see an inverted LSM and degraded p99 latencies. The test asserts that the LSM stays healthy while doing the snapshot ingest, and p99 latencies don't spike over a threshold. Informs #80607. Release note: None Co-authored-by: Aaditya Sondhi <[email protected]>
We currently throttle writes on the receiving store based on store health (e.g. via admission control or via specialized
AddSSTable
throttling). However, this only takes into account the local store health, and not the associated write cost on followers during replication, which isn't always throttled. We've seen this lead to hotspots where follower stores get overwhelmed, since the follower writes bypass admission control. A similar problem exists with snapshot application.This has been touched on in several other issues as well:
A snapshot can be 512MB (the maximum size of a range). Rapid ingestion of snapshots can cause an inverted LSM e.g. https://github.com/cockroachlabs/support/issues/1558 where we were ingesting ~1.6GB of snapshots every 1min, of which ~1GB were being ingested into L0.
kv.snapshot_recovery.max_rate
). Admission control currently calculates tokens at 15s time granularity, and expects admission work to be short-lived compared to this 15s interval (say < 1s).Solution sketch:
We assume that
The kv.snapshot_recovery.max_rate setting continues to be driven by the rate of resource consumption on the source and of the network. That is, it has nothing to do with how fast the destination can ingest. This is reasonable since the ingestion will be atomic, after all the ssts for the snapshot have been locally written, so it doesn't matter how fast the data for a snapshot arrives. This is ignoring the fact that writing the ssts also consumes disk write bandwidth, which may be constrained -- this is acceptable since the write amplification is the biggest consumer of resources.
We keep the Store.snapshotApplySem to limit the number of snapshots that are streaming over their data. After the local ssts are ready to be ingested, we ask for admission tokens equal to the total size of the ssts. This is assuming we have switched store-write admission control to use byte tokens (see admission: change store write admission control to use byte tokens #80480). Once granted the ingestion is performed and after that the snapshotApplySem is released. This is where having a single priority for all snapshot ingestion is convenient -- we don't have to worry about a high-priority snapshot waiting for snapshotApplySem while a low-priority snapshot is waiting for store-write admission.
Potential issues:
Misc things to figure out:
@tbg
Jira issue: CRDB-15330
Epic CRDB-34248
The text was updated successfully, but these errors were encountered: