kv: speed up the split trigger, don't hold latches while computing stats #22348

tbg · 2018-02-03T10:59:15Z

We want to support (and possibly default to) ranges of sizes much larger than our current default of 64mb. A part of this is improving snapshots, but another problem is that splitting a large range is O(range size) due to the linear scan needed to recompute the MVCC stats of one half. Even worse, this split blocks pretty much all read/write access to the range while it happens which is not ideal today but is going to be a lot worse when more data (rows) is affected by this blockage and the trigger takes longer at the same time.

This is mostly a semi-organized brain dump. I think there are two things we should do, namely first make the trigger a lot faster, and secondly (and less importantly after 1) make the split allow more concurrent commands.

Take the stats computation out of the split trigger.

We could accept that the stats are not accurate. Then we could recompute the stats before running the split trigger, and simply pass MVCCStats in, and mark the halves as needing a recomputation (we have this machinery already). The upside is that that's easy, the downside is that we rely on MVCCStats' accuracy for rebalancing and potentially other decisions.

Or we could do more work and make the LHS/RHS stats available in the split trigger. The basic idea is to trigger a stats recomputation for the LHS with a custom Raft command, and maintain the MVCCStats deltas of updates to the LHS and RHS separately. This can be done without declaring any keys (i.e. without making anything wait). Then, once the base computation is done, commit the split (supplying it with the base computation, and letting it access the accumulated LHS/RHS deltas from the Range state).

The main complication with this is that commands can overlap the split key (and so can't be binned into LHS/RHS). I think there's something here where we can make splits three-phase:

Update the meta records and inserts the records that the split would, but with a flag that indicates that this is just an "accounting split". The idea is to make our routing layer not send requests that overlap the split point (I think adding two booleans provisional_{left,right} is enough). This also updates the range descriptor with the split key so that the range henceforth rejects commands overlapping the split. The downside is that various components that read meta records and make decisions based on them must know about the flag. This is hopefully not too onerous if RangeLookup is made aware of this complication.
Run the stats checkpoint.
Run the final split transaction, which updates the meta writes (making them non-provisional) and runs the split trigger (which now uses the checkpoint).

As an alternative to 1) we could also instruct the range to keep a RHS delta as it evaluates commands. This might be easier overall but requires some DistSender-like functionality to be made available to the storage layer.

Declare less keys.

At the time of writing, the split declares a lot of keys:

RW all user keys in both RHS and LHS. For the LHS, I don't understand why that is (@bdarnell?); it seems that declaring a read should be enough to block concurrent writes (mod making sure there's proper locking as the tsCache gets copied over). And pulling on that thread, I think once we have the stats checkpoint we don't need to block writes to the LHS at all. For the RHS, the same applies, though after the split trigger applies they'll catch a mismatch error, making blocking perhaps a better option (or we're willing to add some extra complexity that re-routes those requests to the newly created RHS replica instead).
RW on LHS+RHS local keys (txn records, etc). The same logic as for the user keys should apply (I don't see a difference in how splits handle these).
RW on LHS rangeid-bound keys (RHS too, but nobody accesses that prior to the split, so we don't care to improve it). I can understand declaring the read as we copy a lot of information into the RHS (abort cache, GC threshold, etc) but it's not clear to me why we declare a write.

PS: I think the comment in declareKeysEndTransaction on the above in the code is inaccurate. @bdarnell I'm sure I'm missing some subtleties here.

Epic CRDB-34215

The text was updated successfully, but these errors were encountered:

petermattis · 2018-02-03T16:05:44Z

mod making sure there's proper locking as the tsCache gets copied over

Reminder: the tsCache is now per-store, not per-replica. There is no copying of tsCache entries during a split.

tbg · 2018-02-04T03:11:08Z

Thanks @petermattis, I keep forgetting about that.

nvanbenschoten · 2018-02-04T04:16:44Z

We could accept that the stats are not accurate. Then we could recompute the stats before running the split trigger, and simply pass MVCCStats in, and mark the halves as needing a recomputation.

If we're going to do that, do we even need to compute the stats beforehand at all? We already know that MVCCFindSplitKey will pick a split key that creates a left range of some targetSize (half the range size, at the moment), so we could conceivably just scale the stats by targetSize/(keyBytes+valBytes) and call it close enough. Or we could at least group the stats precomputation in with MVCCFindSplitKey.

Either way, the key question is whether "close enough" is good enough for a short period of time. Perhaps we could address this by adding an Inaccurate flag to the stats or by expanding the ContainsEstimates field to indicate the degree of uncertainty. Any operation that requires 100% correct stats would just need to wait until this field is cleared by the recomputation. We'd then just need to make sure that we aggressively recompute the stats after the split.

RW all user keys in both RHS and LHS. For the LHS, I don't understand why that is; it seems that declaring a read should be enough to block concurrent writes

I had the same thought and discussed with @bdarnell in #14992 (comment). One complication is that declaring the split as a read wouldn't be enough to block all writes, just writes in the past of the split txn's timestamp.

bdarnell · 2018-02-05T02:04:42Z

PS: I think the comment in declareKeysEndTransaction on the above in the code is inaccurate.

It's at least missing a discussion of the timestamp issues that Nathan linked to. But other than that, I think it's accurate. The main reason we have to block all writes and not just those in the past of the split transaction's timestamp is due to stat updates. (so the two paths here are kind of the same thing: if we moved stat recomputation out of the split trigger, we could also declare fewer keys)

tbg · 2018-02-08T07:43:29Z

Ah, I had missed the fact that a read only blocks writes that affect its outcome, thanks for the reminder @nvanbenschoten.

I'm generally wary of introducing more uncertainty in the stats, but it seems a lot more convenient than the alternative. Perhaps the alternative is manageable, but there are lots of corners that need to be handled (including rolling back or resuming interrupted splits).

nvanbenschoten · 2022-05-09T18:57:29Z

Re-opening, since this is still an issue.

tbg · 2022-06-08T09:28:05Z

This came up again in https://github.com/cockroachlabs/support/issues/1631#issuecomment-1149430201, where a customer was seeing highly elevated foreground request latencies which are likely attributable to a split.

tbg · 2023-06-16T15:50:04Z

Came up again here (internal slack)

miraradeva · 2024-04-08T16:02:34Z

All the main pieces of this work have been merged in, with the exception of #119501, which we might come back to later on if we have reasons to believe stats re-computation is causing any instability.

tbg added this to the Later milestone Feb 3, 2018

petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Jul 21, 2018

petermattis removed this from the Later milestone Oct 5, 2018

jeffrey-xiao mentioned this issue Aug 16, 2019

storage: investigate larger default range sizes #39717

Closed

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

blathers-crl bot added the T-kv KV Team label May 9, 2022

nvanbenschoten changed the title ~~storage: speed up the split trigger~~ kv: speed up the split trigger, don't hold latches while computing stats May 9, 2022

nvanbenschoten reopened this May 9, 2022

irfansharif removed the X-stale label May 9, 2022

irfansharif mentioned this issue May 9, 2022

kvserver: serve reads on outgoing leaseholder and improve lease transfer latching #81161

Closed

tbg added X-nostale Marks an issue/pr that should be ignored by the stale bot O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs labels May 17, 2022

jlinder added sync-me-3 and removed sync-me-3 labels May 24, 2022

mwang1026 added the sync-me label Aug 23, 2022

erikgrinaker assigned miraradeva Jun 6, 2023

exalate-issue-sync bot assigned miraradeva and unassigned miraradeva Jun 6, 2023

cockroachdb deleted a comment from amaddahian Nov 6, 2023

williamkulju added O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster O-23.2-scale-testing issues found during 23.2 scale testing labels Nov 10, 2023

nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Jan 11, 2024

miraradeva mentioned this issue Feb 22, 2024

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119499

Closed

nvanbenschoten added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Mar 13, 2024

nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 21, 2024

miraradeva removed GA-blocker branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Apr 8, 2024

miraradeva closed this as completed Apr 8, 2024

exalate-issue-sync bot reopened this Apr 8, 2024

exalate-issue-sync bot closed this as completed Apr 8, 2024

This was referenced Jun 20, 2024

kvserver: smear size based range splits #125929

Open

kvserver: pace range split stats computation #125948

Open

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: speed up the split trigger, don't hold latches while computing stats #22348

kv: speed up the split trigger, don't hold latches while computing stats #22348

tbg commented Feb 3, 2018 •

edited by exalate-issue-sync bot

Loading

petermattis commented Feb 3, 2018

tbg commented Feb 4, 2018 •

edited

Loading

nvanbenschoten commented Feb 4, 2018

bdarnell commented Feb 5, 2018

tbg commented Feb 8, 2018 •

edited

Loading

nvanbenschoten commented May 9, 2022

tbg commented Jun 8, 2022

tbg commented Jun 16, 2023

miraradeva commented Apr 8, 2024

kv: speed up the split trigger, don't hold latches while computing stats #22348

kv: speed up the split trigger, don't hold latches while computing stats #22348

Comments

tbg commented Feb 3, 2018 • edited by exalate-issue-sync bot Loading

Take the stats computation out of the split trigger.

Declare less keys.

petermattis commented Feb 3, 2018

tbg commented Feb 4, 2018 • edited Loading

nvanbenschoten commented Feb 4, 2018

bdarnell commented Feb 5, 2018

tbg commented Feb 8, 2018 • edited Loading

nvanbenschoten commented May 9, 2022

tbg commented Jun 8, 2022

tbg commented Jun 16, 2023

miraradeva commented Apr 8, 2024

tbg commented Feb 3, 2018 •

edited by exalate-issue-sync bot

Loading

tbg commented Feb 4, 2018 •

edited

Loading

tbg commented Feb 8, 2018 •

edited

Loading