Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: speed up the split trigger, don't hold latches while computing stats #22348

Closed
tbg opened this issue Feb 3, 2018 · 9 comments
Closed

kv: speed up the split trigger, don't hold latches while computing stats #22348

tbg opened this issue Feb 3, 2018 · 9 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-23.2-scale-testing issues found during 23.2 scale testing O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA sync-me T-kv KV Team X-nostale Marks an issue/pr that should be ignored by the stale bot

Comments

@tbg
Copy link
Member

tbg commented Feb 3, 2018

We want to support (and possibly default to) ranges of sizes much larger than our current default of 64mb. A part of this is improving snapshots, but another problem is that splitting a large range is O(range size) due to the linear scan needed to recompute the MVCC stats of one half. Even worse, this split blocks pretty much all read/write access to the range while it happens which is not ideal today but is going to be a lot worse when more data (rows) is affected by this blockage and the trigger takes longer at the same time.

This is mostly a semi-organized brain dump. I think there are two things we should do, namely first make the trigger a lot faster, and secondly (and less importantly after 1) make the split allow more concurrent commands.

Take the stats computation out of the split trigger.

We could accept that the stats are not accurate. Then we could recompute the stats before running the split trigger, and simply pass MVCCStats in, and mark the halves as needing a recomputation (we have this machinery already). The upside is that that's easy, the downside is that we rely on MVCCStats' accuracy for rebalancing and potentially other decisions.

Or we could do more work and make the LHS/RHS stats available in the split trigger. The basic idea is to trigger a stats recomputation for the LHS with a custom Raft command, and maintain the MVCCStats deltas of updates to the LHS and RHS separately. This can be done without declaring any keys (i.e. without making anything wait). Then, once the base computation is done, commit the split (supplying it with the base computation, and letting it access the accumulated LHS/RHS deltas from the Range state).

The main complication with this is that commands can overlap the split key (and so can't be binned into LHS/RHS). I think there's something here where we can make splits three-phase:

  1. Update the meta records and inserts the records that the split would, but with a flag that indicates that this is just an "accounting split". The idea is to make our routing layer not send requests that overlap the split point (I think adding two booleans provisional_{left,right} is enough). This also updates the range descriptor with the split key so that the range henceforth rejects commands overlapping the split. The downside is that various components that read meta records and make decisions based on them must know about the flag. This is hopefully not too onerous if RangeLookup is made aware of this complication.

  2. Run the stats checkpoint.

  3. Run the final split transaction, which updates the meta writes (making them non-provisional) and runs the split trigger (which now uses the checkpoint).

As an alternative to 1) we could also instruct the range to keep a RHS delta as it evaluates commands. This might be easier overall but requires some DistSender-like functionality to be made available to the storage layer.

Declare less keys.

At the time of writing, the split declares a lot of keys:

  • RW all user keys in both RHS and LHS. For the LHS, I don't understand why that is (@bdarnell?); it seems that declaring a read should be enough to block concurrent writes (mod making sure there's proper locking as the tsCache gets copied over). And pulling on that thread, I think once we have the stats checkpoint we don't need to block writes to the LHS at all. For the RHS, the same applies, though after the split trigger applies they'll catch a mismatch error, making blocking perhaps a better option (or we're willing to add some extra complexity that re-routes those requests to the newly created RHS replica instead).
  • RW on LHS+RHS local keys (txn records, etc). The same logic as for the user keys should apply (I don't see a difference in how splits handle these).
  • RW on LHS rangeid-bound keys (RHS too, but nobody accesses that prior to the split, so we don't care to improve it). I can understand declaring the read as we copy a lot of information into the RHS (abort cache, GC threshold, etc) but it's not clear to me why we declare a write.

PS: I think the comment in declareKeysEndTransaction on the above in the code is inaccurate. @bdarnell I'm sure I'm missing some subtleties here.

Epic CRDB-34215

@tbg tbg added this to the Later milestone Feb 3, 2018
@petermattis
Copy link
Collaborator

mod making sure there's proper locking as the tsCache gets copied over

Reminder: the tsCache is now per-store, not per-replica. There is no copying of tsCache entries during a split.

@tbg
Copy link
Member Author

tbg commented Feb 4, 2018

Thanks @petermattis, I keep forgetting about that.

@nvanbenschoten
Copy link
Member

We could accept that the stats are not accurate. Then we could recompute the stats before running the split trigger, and simply pass MVCCStats in, and mark the halves as needing a recomputation.

If we're going to do that, do we even need to compute the stats beforehand at all? We already know that MVCCFindSplitKey will pick a split key that creates a left range of some targetSize (half the range size, at the moment), so we could conceivably just scale the stats by targetSize/(keyBytes+valBytes) and call it close enough. Or we could at least group the stats precomputation in with MVCCFindSplitKey.

Either way, the key question is whether "close enough" is good enough for a short period of time. Perhaps we could address this by adding an Inaccurate flag to the stats or by expanding the ContainsEstimates field to indicate the degree of uncertainty. Any operation that requires 100% correct stats would just need to wait until this field is cleared by the recomputation. We'd then just need to make sure that we aggressively recompute the stats after the split.

RW all user keys in both RHS and LHS. For the LHS, I don't understand why that is; it seems that declaring a read should be enough to block concurrent writes

I had the same thought and discussed with @bdarnell in #14992 (comment). One complication is that declaring the split as a read wouldn't be enough to block all writes, just writes in the past of the split txn's timestamp.

@bdarnell
Copy link
Contributor

bdarnell commented Feb 5, 2018

PS: I think the comment in declareKeysEndTransaction on the above in the code is inaccurate.

It's at least missing a discussion of the timestamp issues that Nathan linked to. But other than that, I think it's accurate. The main reason we have to block all writes and not just those in the past of the split transaction's timestamp is due to stat updates. (so the two paths here are kind of the same thing: if we moved stat recomputation out of the split trigger, we could also declare fewer keys)

@tbg
Copy link
Member Author

tbg commented Feb 8, 2018

Ah, I had missed the fact that a read only blocks writes that affect its outcome, thanks for the reminder @nvanbenschoten.

I'm generally wary of introducing more uncertainty in the stats, but it seems a lot more convenient than the alternative. Perhaps the alternative is manageable, but there are lots of corners that need to be handled (including rolling back or resuming interrupted splits).

@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Jul 21, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@nvanbenschoten
Copy link
Member

Re-opening, since this is still an issue.

@blathers-crl blathers-crl bot added the T-kv KV Team label May 9, 2022
@nvanbenschoten nvanbenschoten changed the title storage: speed up the split trigger kv: speed up the split trigger, don't hold latches while computing stats May 9, 2022
@nvanbenschoten nvanbenschoten reopened this May 9, 2022
@tbg tbg added X-nostale Marks an issue/pr that should be ignored by the stale bot O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs labels May 17, 2022
@tbg
Copy link
Member Author

tbg commented Jun 8, 2022

This came up again in https://github.com/cockroachlabs/support/issues/1631#issuecomment-1149430201, where a customer was seeing highly elevated foreground request latencies which are likely attributable to a split.

@tbg
Copy link
Member Author

tbg commented Jun 16, 2023

Came up again here (internal slack)

@cockroachdb cockroachdb deleted a comment from amaddahian Nov 6, 2023
@williamkulju williamkulju added O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster O-23.2-scale-testing issues found during 23.2 scale testing labels Nov 10, 2023
@nvanbenschoten nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Jan 11, 2024
@nvanbenschoten nvanbenschoten added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Mar 13, 2024
@nvanbenschoten nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 21, 2024
@miraradeva miraradeva removed GA-blocker branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Apr 8, 2024
@miraradeva
Copy link
Contributor

All the main pieces of this work have been merged in, with the exception of #119501, which we might come back to later on if we have reasons to believe stats re-computation is causing any instability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-23.2-scale-testing issues found during 23.2 scale testing O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-3 Issues/test failures with no fix SLA sync-me T-kv KV Team X-nostale Marks an issue/pr that should be ignored by the stale bot
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

10 participants