-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Introduce sublevels in L0 #609
Comments
The writeup, outside of the motivation section, is incomplete. I want to expand the implementation section a lot more based on code reading and observations, but this is a start. |
I suggest not bothering with an
Do we have a list of additional workloads to try, beyond the 10 node TPCC import that I ran with? It would be easy to run them with the existing prototype. |
@sumeerbhola It would be possible to build a mapping of old interval indices to new interval indices, and then go around the data structure and update all intervals, right? That way you still maintain the density, and the rest of the algorithm (after the translation) still works as you'd expect. The reason why I want to preserve the instance of the data structure is to prevent our view of "flush split keys" from changing too significantly every time there's a new revision. Correct me if my mental model is incorrect, but I can see a small "shift" (in any direction) in flush split keys to be pretty detrimental to the point of having those keys in the first place; since flushes after the split would have SSTs that bridge the older split key induced gaps, greatly increasing the surface area for any compactions. It would be more preferable to add or remove keys on a one-off basis to maintain the target sizes, instead of recalculating it all from scratch without any preserved state of the past world. |
My mental model of how the flush split keys evolve is hazy. I can definitely see them shifting after a flush or L0 compaction, but I'm not picturing how significant that shifting will be. I think it is worthwhile to gather some evidence first as to whether this is a real problem or not. It is also possible that trying to incrementally maintain the flush split keys causes problems itself.
You will have to recalculate the state from scratch when opening a DB. Incremental maintenance of a data structure can be a lot more challenging than rebuilding it from scratch whenever there is a change in the underlying data. Consider a balanced tree vs a sorted array. There be dragons with the incremental maintenance, and since we'll need to be able to rebuild from scratch anyways, my preference would be to start with the rebuild from scratch approach, get it working and rock solid, and then add the incremental maintenance if it proves necessary. |
Yes. I was just giving that as an example -- there is more: e.g. each Regarding stability of "flush split keys", I may be misunderstanding what you are suggesting, but an incremental data-structure does nothing there since it behaves as if it was constructed from scratch. Also, stability is easier to achieve by just using the same split keys until say N% of the files in L0 have changed -- i.e., keep a copy of the split keys somewhere. But can you try to construct an example when all files have non-coincident end points to try to see what bad things happen when split keys keep shifting slightly -- I am unable to see the problem with that (the biasing towards the triangular shape of compaction picking should prevent issues). Also there is a tradeoff between stability and quickly responding rapidly to a change in key distribution -- a cold part of the interval space may now be receiving a lot of data (we see that with ingestion). |
Yeah, the complexity will increase pretty substantially for every bit that'll need updating. I'll can the translation idea for now. Thanks for elaborating on the challenges with this approach, really appreciated. I'll also go ahead and update the issue description. Regarding flush split key stability, I'll try to benchmark up a scenario where this is problematic. The issue I saw with the current approach was that the flush split keys were calculated in a left-to-right iteration with no incorporation of past split keys. We could just go the route where we preserve past split keys unless there's a significant change, and rebuild the rest of the data structure at every manifest revision anyway. Either way this discussion doesn't block my work anytime soon. I'm going to polish up the current approach, but this is food for thought for later on. Here's a simplified scenario I was thinking of. Assume
For simplicity we can assume each sublevel depicted corresponds to one flush. (In practice, sstables h..n and p..y will go into sublevel 2 not 3, but that doesn't make a difference here). Note that the first two flushes had split keys Now, it's increasingly difficult to pick a base compaction of stack height 3 without incorporating most of L0 into it, negating the main benefit of flush split keys. In a hypothetical scenario where you do account for past split keys, the algorithm would be smart enough to just introduce What do you think? Let me know if this is not worth worrying about. I'll try to see if it comes up in practice at all. Again, this discussion isn't a high priority at all, but it's something that popped up in my mind. |
The trouble with such heuristics is that it gets tricky to tune things like the "okay for now" -- it can very easily get worse. More importantly, this example is an L0 that can be represented as a "thin rectangle" -- the number of files is not significantly greater than the number of sublevels. And it is an easy "thin rectangle" since there are few files -- picking up 20 L0 files in an L0 compaction is still a fast enough L0=>Lbase compaction. My experimental observation is that sublevels, and the associated compaction algorithms, cause a backlog in L0 to get shaped into a "fat rectangle", say 20 sublevels and 1000 files, so finding compaction concurrency despite shifting split points is not typically a problem. |
I realized I never really linked PRs here, so here are some of the larger ones that are a part of this change: Merging all of the above and compiling it with Cockroach with this patch to enable sublevel compactions, flush splits, and sublevel-driven |
#614, #670, #675 have been merged, and sublevel compactions have been turned on in Cockroach as of cockroachdb/cockroach#50371 . |
This issue falls under the general umbrella of compaction improvements suggested in #552.
Motivation
The LSM tree is made up of multiple levels, and SSTables within a level are not allowed to overlap one another. This invariant is relaxed for the topmost level (L0), where SSTables are allowed to overlap in key ranges (and as of #591, seqnum ranges as well). The side effect of this is that all SSTables in L0 must be merged during read time, as each SSTable contributes one unit to read amplification.
If L0 gets a lot of SSTables, and these SSTables are particularly wide in key ranges, they end up overlapping with many other L0 SSTables as well as with many SSTables in LBase (the next non-empty level "below" L0). This results in really wide and slow compactions (as all overlapping L0 and LBase SSTables must be compacted together). In addition, concurrent L0 -> LBase compactions are reduced.
These situations are observed during large IMPORT operations, where a lot of L0 SSTables are created too quickly, quicker than can be drained by L0 -> LBase compactions (and subsequent compactions). This increases read amplification and increases latency across the board until the LSM tree goes back towards the more "natural" shape of having ~90% of its bytes in the bottommost level.
This would be a simple example of an LSM tree with multiple overlapping L0 SSTables.
Assume that newer L0 SSTables are shown above older ones.
This issue introduces a design that has already been prototyped/experimented in #563. If L0 is organized into sublevels, where the regular level invariants hold within a sublevel (no key range overlaps), and the sublevels are organized in an order where "newer" sublevels (defined as sublevels with higher indexes) shadow keys in older sublevels, we are able to reduce read amplification by using
levelIters
for each sublevel. The above example could then be written as:This has reduced read amplification by 1, as
a..f
andm..z
can occupy the same sublevel. L0's contribution to read amplification can therefore be defined as the highest height of the stack of overlapping L0 SSTables (arranged in sublevels) at any given key. This will always be less than or equal to the number of sublevels.However, we still have an issue with wide SSTables and compactions. If the SST in sublevel 1 is chosen for compaction, we inevitably end up choosing all of L0 and Lbase in the same compaction. To keep compaction shorter and improve concurrency of compactions, we introduce flush split keys. At flush time, each SSTable would be partitioned into multiple SSTables at some defined user keys known as flush split keys. More on how these flush split keys can be picked is under Implementation Design.
In the above case, let's assume
m
andr
are flush split keys. This results in this LSM tree:Such an LSM tree would have the same amount of read amplification as we saw earlier, while also allowing multiple, smaller L0 -> LBase compactions to proceed in parallel. In particular, 3 parallel L0 -> LBase compactions can proceed in this example: all the L0 and L2 SSTables in the key range
a..l
, all the ones inm..q
, and all the ones inr..z
.This vertical rectangular shape of compactions is what we want to find, since it maximizes concurrency and effectiveness. Effectiveness (or compaction score) can be defined as the number of files reduced; therefore, the key interval that is the "tallest" would be the most effective one to compact. Consider this example:
In this case compacting all SSTs in
m..q
starting from them..o
SST in sublevel 4 downwards would be the most effective and still narrow, since it would reduce stack depth. In this case this compaction could single-handedly eliminate sublevels 3 and 4, assuming it's the only operation taking place.Implementation design decisions
Sublevels can be deduced from a set of L0 SSTs by iterating through them from oldest to newest, and for each of these files, seeing what's the highest assigned sublevel for an overlapping L0 SST (if any), and assigning it that sublevel plus 1.
To reduce the number of key comparisons, an ordered list of intervals keys could be produced, as done in #563. These would be the union set of all SSTable boundary keys. After this is produced, all subsequent operations around SSTable boundary comparisons could just deal with index comparisons wrt. the ordered list, instead of having to do more extensive key comparisons.
Compactions can be chosen by picking seed files out of tall intervals (intervals in which many SSTables overlap). For base compactions, we'd iterate downward (towards the 0 sublevel), picking all overlapping SSTables along the way, and for intra-L0 compactions, we'd proceed in the other direction (towards the highest indexed sublevel). These compactions can optionally be expanded in the other direction (higher indexed sublevels for base compactions, lower indexed ones for intra-L0) by progressively adding SSTables in that interval until we can't add one due to concurrent compactions.
Flush split keys could be chosen by iterating through interval keys in order, from lowest to highest, keeping track of all bytes in files observed so far (with an estimated byte count for any files that we're partially through) since the last flush split key, and publishing a split key when that byte count exceeds a specified threshold.
Currently the prototype in #563 recreates the entire sublevel data structure after any addition or deletion to the L0 tables. There's room for performance improvement by leveraging the map of added/deleted files in a
BulkVersionEdit
to modify the sublevel data structure instead of recreating it from scratch. These updates would be required:Adding/deleting from the set of ordered keys, then going around and updating all indexes in the data structure to the new indexes
Updating all booleans denoting ongoing compactions, based on any new info about ongoing compactions.
Recalculating all flush split keys. It would be more efficient to add and remove flush split keys on a one-off basis instead of calculating them all from scratch, resulting in fewer overlapping SSTables.
Remaining challenges
The text was updated successfully, but these errors were encountered: