-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/sql: index backfill and truncate temporarily crater write throughput on large tables #62672
Comments
Have you tried on master? |
No, I have not, but I can. What do we expect to be different on master? |
I could have sworn that we mapped this slowdown to cockroachdb/pebble#981 months ago. |
Here is the behavior on 8b552c7. Looks mostly the same: During this reproduction, I explored why ranges were being split after the TRUNCATE, as there had been some confusion over whether this was due to size or load. It does look like ranges are being split due to load (which is good!), but at a rate of only about 2 splits per range per minute.
I wonder whether there are any easy wins here to be had regarding not resetting a range's load-based splitting tracker during a load-based split: cockroach/pkg/kv/kvserver/split_queue.go Lines 261 to 262 in cafa371
|
Not resetting but rather splitting out the samples makes a lot of sense. |
I think we're going to need to do something like this if we want to make a TRUNCATE fully non-disruptive. As @lunevalex astutely pointed out, the ramp-up period after a TRUNCATE looks a lot like immediately after the load is started, though maybe a fixed multiple slower. Even if we could make this just as efficient as the initial ramp-up period, that likely still wouldn't be sufficient, as that still takes 2-3 minutes to reach peak throughput, and longer on larger clusters. |
We could split the new range along the same boundaries as the pre-truncated table. Or some fraction of those boundaries (e.g. every Nth range boundary). I bet we could hack together something which did this using |
I like this idea a lot. Using the existing range boundaries instead of trying to recreate them from table statistics seems easier, more precise, and has the potential of piggybacking on top of existing load-based splits in case of uneven (but also not divergent) load distribution. @aliher1911 do you have interest in doing some exploration here? We can walk through the relevant code during our pod meeting later today. |
If we pre-split, would we use the range descriptor's "sticky bit" with some arbitrary expiration to keep the splits around for a while? |
If we did add the sticky bit, it might make sense to jitter the expiration times. |
Yes, we'd set some split expiration, similar to what IMPORT does. I was thinking 10 minutes. Jittering the expiration times is a good idea. We should do that in other places as well to avoid waves of simultaneous range merges. |
I hacked together something for this to test out whether pre-splitting based on existing range boundaries would help out, as suggested by @petermattis in #62672 (comment). I then realized @jordanlewis was already testing this out on his LARGE DATA BANK stream in a much cleaner way - sorry to step on your toes! I didn't realize you had actually taken the stream suggestion. I'll let you take over from here. Even the hacky version demonstrated a major improvement. Here's a screenshot of a series of three TRUNCATE operations using the same setup as listed above. TRUNCATE operations occur under three different configurations:
As we see, pre-splitting the new keyspace into some number of ranges and scattering before switching over the load made a very large difference. We should proceed with that approach. |
Haha, no worries @nvanbenschoten! I didn't tell you I was going to work on this on the stream. I also haven't been able to figure out how to nicely test as well as you have here, so hopefully we can merge our two approaches into something that is great. |
63043: sql: TRUNCATE preserves split points of indexes r=jordanlewis a=jordanlewis Touches #62672 Previously, TRUNCATE unconditionally blew away all index split points when creating the replacement empty indexes. This manifested in some bad behavior when performing TRUNCATE on tables or indexes that had heavy load and were heavily split already, because all of a sudden all of the traffic that was nicely dispersed across all of the ranges would redirect to the single new range that TRUNCATE created. The bad performance would remediate over time as the database re-split the new ranges, but preserving the splits across the index swap boundaries is a faster way to get there. Release note (sql change): TRUNCATE is now less disruptive on tables with a lot of concurrent traffic. 64395: roachtest: improve varz regexp r=andreimatei a=andreimatei It was recognizing: follower_reads_success_count{store="1"} 27606 but not: sql_select_count 1.652807e+06 Release note: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
I ran through the original reproduction steps on Things look significantly better. We do see a very minor latency blip during the pre-split-and-scatter and then later as the new table splits further due to load, but these are inconsequential compared to the latency spikes and corresponding throughput craters we were seeing before. Nice job on this @jordanlewis! |
I believe we're seeing a very analogous series of events play out due to an index backfill. The same customer that prompted this issue has seen similar impact to foreground traffic on a table the moment an index backfill is kicked off. On 20.2.8, with @nvanbenschoten's repro steps from above, but instead of The rate of load-based range splits seems to be about the same as what @nvanbenschoten observed in his repro -- I'm seeing 1 split per range once every ~25 seconds. The story is similar on From my very limited understanding of this area, it does seem to check out. The index backfiller only seems to create size based splits as it ingests SSTs, so the moment an index backfill is kicked off, all write traffic will be bottlenecked due to the (new) secondary index being consolidated to a small set of ranges until load based splitting catches up.
I reckon a fix to this would be very similar to #63043 in theory. However, we might not want to do any downsampling in the case of an index backfill. I'm re-opening this issue for now since it has a ton of valuable context. cc @dt as you were looking into this. |
Thanks for writing this up @aayushshah15!
Beyond the impact on
It seems a little less clear what we should do in this case, because we don't trivially know the distribution of rows that will land in the secondary index. With TRUNCATE, things were simpler, because we had some reason to believe that the indexes would look similar to how they looked before the TRUNCATE. Have you given any thought to this? Perhaps we could sample the existing PK and perform the secondary index translation to get a sense for what the secondary index will look like ahead of time? Also, have we confirmed that this is still a pressing issue for the customer? Have we seen the behavior of an index backfill now that the Pebble write stall issues have been resolved? I have to imagine that AddSSTable requests were some of the most impacted operations when file creation was taking 60ms and leading to full engine stalls. So it seems very possible that even though there is work to do here, it no longer has the same urgency for the customer. As much as we can, I think we should re-assess all of the items this customer is interested in to determine whether they were being exacerbated by the write stall issues. |
So it seems like there's no real difference besides the less aggressive range-merges.
Just to confirm, you're referring to cockroachdb/pebble#1125, right? The run on
That's a good point. I'm having a hard time understanding your proposal though. Is the idea that, since the system already some idea of the mapping from a primary index's data distribution to its load distribution, we can somehow come up with a way to form a mapping between a secondary index's data distribution to its load distribution (and pre-split based on that)? Maybe I'm totally misunderstanding you and this is not what you mean at all. [1]: Range counts on the new secondary index right after the backfill is kicked off
|
If this hypothesis is correct, we should see this dip in SQL throughput just by replacing the index backfill itself with a If so, then I wonder if it might help to instead ramp traffic to the index, e.g. making new |
That said, when we looked at a customer's cluster that had seen throughput fall to single-digit percentage of what it was before the CREATE INDEX, it stayed that way for a long time -- much longer than one or two minutes. While it was in that state, we enabled diagnostics collection for one of the slow queries on the table and observed it waited nearly 1s for quota pool (it usually ran in single-digit millis). Given it stayed in that state for a long period, maybe just waiting for load-based splitting wouldn't help after all. |
That's more a function of the size of the table and the overall traffic to that table, right? Like, in this reproduction, the entire cluster had ~400 ranges before the backfill. It seems like it would just be a longer climb back to the peak if the table had more data. I can run a much larger repro than this and confirm this.
Would you mind pointing me to the trace event that you saw? It also seems like what we saw here is mostly expected -- given that the proposal quota pool has a size of |
Unfortunately the trace was collected and viewed during a live debugging session with the customer. I don't know if it was retained, but can follow-up with you offline for details. |
I'm also referring to the I'm not saying we shouldn't continue exploring how to improve this, just that we should re-evaluate (in part, for the customer's sake) how bad things remain now that the other potentially related issues have been resolved. |
@lunevalex suggested an approach to mitigate the foreground impact that we seem to have previously considered for some other use cases as well. The idea was that we would perform 2 backfills -- one that corresponds to when the index creation is kicked off, say with This should get us out of the main issue of load consolidation on a handful of ranges, as by the time ongoing writes have to update the new index, that index at least has a bunch of size based splits. Notably though, this approach wouldn't capture any load-based splits until write traffic has to start updating the index, so it probably warrants some prototyping to figure out if this'll be adequate. cc @shermanCRL and @pbardea, we should discuss where the prototyping / implementation work items for this should be routed, as the work is mostly going to reside in areas owned by bulk. |
See also: #36850 |
#36850 could also be a solution to a lots-of-SQL-traffic-hitting-un-split-index-span, but it is a pretty heavy-weight way to get some key span to split and would be a non-trivial undertaking. If all we need is to split the span, we might want to look at some shorter-term solutions which might offer quicker wins. Off the top of my head:
We've evaluated #36850 a couple times over the years, but we've always said we want to do some experiments to validate its benefits before we sink too much effort into it because it is non-trivial. See #54955 |
@dt suggested that we should confirm that the backfill itself doesn't have a role to play in this. I ran with a 5 minute sleep right before the index backfiller begins constructing and ingesting SSTs and saw the following. We're seeing one valley right when the index creation starts, and a slightly shorter dip during the backfill (albeit with a pretty severe impact to p99 latencies). The first crater seems to decisively point to load-consolidation as we were suspecting before. For instance, we see the CPU utilization on one of the nodes surge along with a corresponding dip for all the other nodes. The cratering we were seeing when the backfill starts seemed to be caused due to the disruptive impact of SST ingestion as @nvanbenschoten had suspected before: To validate this, I ran with |
I am going to close this, as we done everything we are going to do in KV here and there are other issues to track the proposed changes to the index backfiller. |
Note: This issue started off as an investigation into the disruptive impact of a
TRUNCATE
on foreground traffic. However, we later observed that this problem generalizes to index backfills as well. See #62672 (comment) below.We've seen that TRUNCATE operations on large, write-heavy tables are very disruptive. They can cause a dramatic drop in throughput that takes 10s of minutes to recover from.
During a recent investigation, we determined that this was due to how TRUNCATE impacts the load distribution in KV. TRUNCATE replaces the existing table keyspace with a new one, initially backed by a single range. This causes all load that had been previously well-distributed over the cluster to be directed at a single leaseholder. This leaseholder struggles to serve the load while also splitting into multiple ranges (due to range size and load) and balancing these new ranges across the cluster.
While this hotspot persists, throughput is severely impacted.
Reproduction
This causes the following effect:
Notice that throughput does not return to its pre-truncated level for at least 10 minutes. Things are likely even worse with larger clusters.
Next steps
gz#8170
Epic: CRDB-2625
gz#8886
gz#9032
The text was updated successfully, but these errors were encountered: