-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: speed up the split trigger, don't hold latches while computing stats #22348
Comments
Reminder: the tsCache is now per-store, not per-replica. There is no copying of tsCache entries during a split. |
Thanks @petermattis, I keep forgetting about that. |
If we're going to do that, do we even need to compute the stats beforehand at all? We already know that Either way, the key question is whether "close enough" is good enough for a short period of time. Perhaps we could address this by adding an
I had the same thought and discussed with @bdarnell in #14992 (comment). One complication is that declaring the split as a read wouldn't be enough to block all writes, just writes in the past of the split txn's timestamp. |
It's at least missing a discussion of the timestamp issues that Nathan linked to. But other than that, I think it's accurate. The main reason we have to block all writes and not just those in the past of the split transaction's timestamp is due to stat updates. (so the two paths here are kind of the same thing: if we moved stat recomputation out of the split trigger, we could also declare fewer keys) |
Ah, I had missed the fact that a read only blocks writes that affect its outcome, thanks for the reminder @nvanbenschoten. I'm generally wary of introducing more uncertainty in the stats, but it seems a lot more convenient than the alternative. Perhaps the alternative is manageable, but there are lots of corners that need to be handled (including rolling back or resuming interrupted splits). |
Re-opening, since this is still an issue. |
This came up again in https://github.com/cockroachlabs/support/issues/1631#issuecomment-1149430201, where a customer was seeing highly elevated foreground request latencies which are likely attributable to a split. |
Came up again here (internal slack) |
All the main pieces of this work have been merged in, with the exception of #119501, which we might come back to later on if we have reasons to believe stats re-computation is causing any instability. |
We want to support (and possibly default to) ranges of sizes much larger than our current default of 64mb. A part of this is improving snapshots, but another problem is that splitting a large range is O(range size) due to the linear scan needed to recompute the MVCC stats of one half. Even worse, this split blocks pretty much all read/write access to the range while it happens which is not ideal today but is going to be a lot worse when more data (rows) is affected by this blockage and the trigger takes longer at the same time.
This is mostly a semi-organized brain dump. I think there are two things we should do, namely first make the trigger a lot faster, and secondly (and less importantly after 1) make the split allow more concurrent commands.
Take the stats computation out of the split trigger.
We could accept that the stats are not accurate. Then we could recompute the stats before running the split trigger, and simply pass MVCCStats in, and mark the halves as needing a recomputation (we have this machinery already). The upside is that that's easy, the downside is that we rely on MVCCStats' accuracy for rebalancing and potentially other decisions.
Or we could do more work and make the LHS/RHS stats available in the split trigger. The basic idea is to trigger a stats recomputation for the LHS with a custom Raft command, and maintain the MVCCStats deltas of updates to the LHS and RHS separately. This can be done without declaring any keys (i.e. without making anything wait). Then, once the base computation is done, commit the split (supplying it with the base computation, and letting it access the accumulated LHS/RHS deltas from the Range state).
The main complication with this is that commands can overlap the split key (and so can't be binned into LHS/RHS). I think there's something here where we can make splits three-phase:
Update the meta records and inserts the records that the split would, but with a flag that indicates that this is just an "accounting split". The idea is to make our routing layer not send requests that overlap the split point (I think adding two booleans
provisional_{left,right}
is enough). This also updates the range descriptor with the split key so that the range henceforth rejects commands overlapping the split. The downside is that various components that read meta records and make decisions based on them must know about the flag. This is hopefully not too onerous if RangeLookup is made aware of this complication.Run the stats checkpoint.
Run the final split transaction, which updates the meta writes (making them non-provisional) and runs the split trigger (which now uses the checkpoint).
As an alternative to 1) we could also instruct the range to keep a RHS delta as it evaluates commands. This might be easier overall but requires some DistSender-like functionality to be made available to the storage layer.
Declare less keys.
At the time of writing, the split declares a lot of keys:
PS: I think the comment in
declareKeysEndTransaction
on the above in the code is inaccurate. @bdarnell I'm sure I'm missing some subtleties here.Epic CRDB-34215
The text was updated successfully, but these errors were encountered: