-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compaction: don't split outputs within a user key #1470
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 15 files reviewed, 2 unresolved discussions
compaction.go, line 161 at r2 (raw file):
// serve as an *exclusive* end boundary truncation point. If we used > then, // we may have already added a key with the user key `g.limit` to the // previous sstable.
We never hit this with L0 outputs because if key == nil && splitL0Outputs
, we flush all of the tombstones to the current sstable.
compaction.go, line 2477 at r2 (raw file):
// (like the grandparent limit is), so that we're guaranteed // that splitterSuggestion is greater than the previous user key // added to the sstable.
Is this possible? When the sublevels return an L0 limit x
is x
the first key that we don't want to overlap, or the last key of the interval that we may overlap?
If we can rework it such that x is the first key that we don't want our current output to overlap with and enforce it as such, then it's safe to use splitKey = x
even when we we've exhausted point keys, because we'll know the previous user key must have been < x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this cleanup / consolidation. Did a first pass and left some comment
Reviewed 15 of 15 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
-- commits, line 2 at r2:
Can probably link #734 to this too, yeah? Would this close that issue completely?
-- commits, line 4 at r2:
nit: s/is hit/hits/
compaction.go, line 858 at r2 (raw file):
// level N. See maxGrandparentOverlapBytes. // // TODO(peter): Stopping compaction output in the middle of a user-key creates
TODO--
nice!
compaction_picker.go, line 381 at r2 (raw file):
// sstables. We could perform a migration, re-compacting any sstables with split // user keys, which would allow us to remove atomic compaction unit expansion // code.
Seems reasonable. Worth opening a ticket that we can track this against?
docs/range_deletions.md, line 3 at r2 (raw file):
# Range Deletions TODO: The following explanation of range deletions does not take into account
Is this like a follow-up thing? Could that section be removed here?
internal/keyspan/fragmenter.go, line 286 at r2 (raw file):
// // The span [a,f) does not cover the key f. func (f *Fragmenter) FlushTo(key []byte) {
Given we're removing this method, is worth rebranding the existing method as FlushTo
or just Flush
, now that there's no ambiguity?
testdata/manual_compaction, line 447 at r2 (raw file):
---- 2: 000007:[a#4,SET-a#4,SET]
The comment mentions sstable (8) and sstable (9). That didn't seem accurate?
Also, we no longer seem to be adjusting the seqnum of the RANGEDEL.
testdata/manual_compaction, line 579 at r2 (raw file):
2: 000007:[a#3,SET-c#72057594037927935,RANGEDEL] 000008:[c#3,RANGEDEL-e#72057594037927935,RANGEDEL]
The test comment above mentions "sstable 8". And are we still tweaking the key kind?
testdata/manual_compaction, line 614 at r2 (raw file):
---- 2: 000006:[a#3,SET-e#72057594037927935,RANGEDEL]
Is the comment block for this test still accurate? We are no longer "cutting", instead outputting a single table.
testdata/manual_compaction_set_with_del, line 95 at r2 (raw file):
---- 2: 000008:[a#3,SETWITHDEL-c#72057594037927935,RANGEDEL]
The copy pasta in this file from the previous one probably means "ditto" for all my previous comments :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 12 of 15 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)
Previously, nicktrav (Nick Travers) wrote…
Can probably link #734 to this too, yeah? Would this close that issue completely?
Good call, done.
docs/range_deletions.md, line 3 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Is this like a follow-up thing? Could that section be removed here?
Yeah, I think much of the discussion of how range deletions work in this document should be updated. We probably to also want to preserve some of the historical discussion of how they used to work, at least for as long as we're supporting stores that may contain them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
compaction.go, line 2477 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Is this possible? When the sublevels return an L0 limit
x
isx
the first key that we don't want to overlap, or the last key of the interval that we may overlap?If we can rework it such that x is the first key that we don't want our current output to overlap with and enforce it as such, then it's safe to use
splitKey = x
even when we we've exhausted point keys, because we'll know the previous user key must have been <x
.
We should change the L0 limit splitter to also return splitNow/true if key >= limit (currently it's just key > limit). That's actually more in line with how we want L0 limits to work; we want limit
to not be in the left-side sstable. We didn't address this well in the past because a lot of this code had the expectation baked in that we only cut the output once you exceed limit
.
compaction.go, line 197 at r3 (raw file):
// beginning of the first file. n := len(g.ve.NewFiles) g.limit = g.c.findGrandparentLimit(g.ve.NewFiles[n-1].Meta.Largest.UserKey)
Given the simpler semantics of output splitting now, maybe we can replace this line with this line from below:
g.limit = g.c.findGrandparentLimit(g.c.rangeDelFrag.Start())
Thoughts?
compaction.go, line 2289 at r3 (raw file):
// equal to the previous table's largest key user key, and // the previous table's largest key is not exclusive. This // violates the invariant that tables are key-space
Nice!
internal/keyspan/fragmenter.go, line 286 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Given we're removing this method, is worth rebranding the existing method as
FlushTo
or justFlush
, now that there's no ambiguity?
I think it's still a good idea to keep it TruncateAndFlushTo
, just to make it clear that we're also truncating range tombstones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 15 files at r1, 1 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @jbowens and @nicktrav)
compaction.go, line 161 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
We never hit this with L0 outputs because if
key == nil && splitL0Outputs
, we flush all of the tombstones to the current sstable.
Isn't that a deficiency of our L0 splitting scheme?
compaction.go, line 2477 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
We should change the L0 limit splitter to also return splitNow/true if key >= limit (currently it's just key > limit). That's actually more in line with how we want L0 limits to work; we want
limit
to not be in the left-side sstable. We didn't address this well in the past because a lot of this code had the expectation baked in that we only cut the output once you exceedlimit
.
This will unify the two cases, and cause L0 outputs to also split tombstones across many sstables (even if we've run out of point keys), yes? If so, that's what I was hoping we were going to get with this simplification.
compaction.go, line 168 at r3 (raw file):
} func (g *grandparentLimitSplitter) onNewOutput(key *InternalKey) []byte {
l0LimitSplitter.onNewOutput
is simple because it doesn't need to be worry about a user key spanning sstables. I was hoping this PR would make grandparentLimitSplitter
be essentially identical to that except that it uses findGrandparentLimit
. We could even combine the two splitters and parameterize it with a func. Am I missing something?
compaction.go, line 871 at r3 (raw file):
// this function for how this is used. if overlappedBytes > c.maxOverlapBytes && c.cmp(start, f.Smallest.UserKey) < 0 { return f.Smallest.UserKey
why this change to use f.Smallest
? Also, we've included this sstable in the overlappedBytes
when comparing with the threshold, and now we are not allowing the compaction sstable to overlap with this. That seems wrong.
Maybe this is a consequence of the >=
change in shouldSplitBefore
but I don't see how.
compaction.go, line 2309 at r3 (raw file):
// purposes of bounds computation -- the current sstable will end up with a Largest key // of c#7,1 so the range tombstone in the current file will be able to delete c#7. if d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) >= 0 {
I think this comment and code is obsolete. But we should verify that d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) <= 0
.
compaction.go, line 2361 at r3 (raw file):
// L0 splits) L0 limits, before wrapping them in an splitterGroup. outputSplitters := []compactionOutputSplitter{ &fileSizeSplitter{maxFileSize: c.maxOutputFileSize},
we could also have wrapper only the fileSizeSplitter
using the userKeyChangeSplitter
, like we used to, yes?
Would that be "cheaper" since we would need to compare with unsafePrevUserKey
only when that splitter was proposing a split?
compaction.go, line 2443 at r3 (raw file):
// The next key either has a greater user key than the previous // key, or if not, the previous key must not have had a zero // sequence number.
Can we now remove the "or if not ..." part?
compaction.go, line 2448 at r3 (raw file):
// grandparent's smallest key may be less than the current key. // Splitting at the current key will cause this output to overlap // a potentially unbounded number of grandparents.
I can't remember the details regarding this TODO -- can you remind me?
During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 13 of 15 files reviewed, 9 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
compaction.go, line 161 at r2 (raw file):
Previously, sumeerbhola wrote…
Isn't that a deficiency of our L0 splitting scheme?
Yeah, it is.
compaction.go, line 168 at r3 (raw file):
Previously, sumeerbhola wrote…
l0LimitSplitter.onNewOutput
is simple because it doesn't need to be worry about a user key spanning sstables. I was hoping this PR would makegrandparentLimitSplitter
be essentially identical to that except that it usesfindGrandparentLimit
. We could even combine the two splitters and parameterize it with a func. Am I missing something?
Nice, that's great.
compaction.go, line 871 at r3 (raw file):
Previously, sumeerbhola wrote…
why this change to use
f.Smallest
? Also, we've included this sstable in theoverlappedBytes
when comparing with the threshold, and now we are not allowing the compaction sstable to overlap with this. That seems wrong.
Maybe this is a consequence of the>=
change inshouldSplitBefore
but I don't see how.
Isn't the overlappedBytes
behavior consistent with the description above?
// such large overlaps because they translate into large compactions. The
// current heuristic stops output of a table if the addition of another key
// would cause the table to overlap more than 10x the target file size at
// level N. See maxGrandparentOverlapBytes.
If overlapping f
would increase overlappedBytes
to be > c.maxOverlapBytes
, then we don't want to overlap f
, and we should return f.Smallest.UserKey
.
The change to f.Smallest
did fall out out of the >=
change. When a grandparent's Largest
is inclusive (not a RANGEDEL
sentinel), the >=
would cause us to put a key equal to the Largest
in the next sstable output. That means the next output sstable would also overlap the grandparent that set the limit, unnecessarily.
The switch to using f.Smallest
avoids all the inclusivity/exclusivity considerations, since f.Smallest
is always inclusive, which plays well with >= limit
splitting logic.
compaction.go, line 2309 at r3 (raw file):
Previously, sumeerbhola wrote…
I think this comment and code is obsolete. But we should verify that
d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) <= 0
.
Nice, done.
compaction.go, line 2361 at r3 (raw file):
Previously, sumeerbhola wrote…
we could also have wrapper only the
fileSizeSplitter
using theuserKeyChangeSplitter
, like we used to, yes?
Would that be "cheaper" since we would need to compare withunsafePrevUserKey
only when that splitter was proposing a split?
Done
compaction.go, line 2443 at r3 (raw file):
Previously, sumeerbhola wrote…
Can we now remove the "or if not ..." part?
Done.
compaction.go, line 2448 at r3 (raw file):
Previously, sumeerbhola wrote…
I can't remember the details regarding this TODO -- can you remind me?
This is talking about a scenario like: The grandparent limit is b
and key
is x
and there's an existing broad range tombstone in the fragmenter with bounds [a,x)
. The grandparent splitter requests a split before x
. We fall into this case, and use x
as the split boundary which creates an sstable that extends well beyond the grandparent limit of b
.
I think to fix this, we could use splitKey = min(key.UserKey, splitterSuggestion)
.
compaction_picker.go, line 381 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Seems reasonable. Worth opening a ticket that we can track this against?
Yeah, I'll create one once this PR merges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 12 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
compaction.go, line 2448 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
This is talking about a scenario like: The grandparent limit is
b
andkey
isx
and there's an existing broad range tombstone in the fragmenter with bounds[a,x)
. The grandparent splitter requests a split beforex
. We fall into this case, and usex
as the split boundary which creates an sstable that extends well beyond the grandparent limit ofb
.I think to fix this, we could use
splitKey = min(key.UserKey, splitterSuggestion)
.
Added a second commit that fixes this and simplifies this code to remove the switch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r5.
Reviewable status: 13 of 15 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)
compaction.go, line 871 at r3 (raw file):
If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.
If the file sizes are roughly S and the target file size is 10S, then on average we were splitting at 10.5S earlier and now we are splitting at 9.5S. I agree it doesn't matter much what we do.
But what if this file happens to be huge for whatever reason (ingested file; many seqnums for a user key which prevented a compaction from stopping at the size guideline) and is > 10S by itself. Seems safer to include it and use Largest. Also, will using Smallest potentially get us stuck at this smallest key, say k. We return k, so stop writing to the current sstable when we reach key k. Then ask for a grandparent limit with k, which again returns k.
Thanks for clarifying how this interacts with >= comparison elsewhere. Can you add a detailed code comment here -- this is very easy to miss. Unless there is a correctness issue (the "get us stuck" part above), I'm fine with this change.
compaction.go, line 2448 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Added a second commit that fixes this and simplifies this code to remove the
switch
.
Great. Can you add a code comment with this example to motivate why we are choosing the smaller.
compaction.go, line 179 at r5 (raw file):
// // a.SET.3 // (L0 limit at b)
nit: L0 limit is stale
If the compaction output splitters suggest a user key less than the next user key, use the the splitters' suggestion as the split key. This avoids arbitrary overlap with grandparents if the next key is far beyond the grandparent limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 12 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
compaction.go, line 871 at r3 (raw file):
Previously, sumeerbhola wrote…
If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.
If the file sizes are roughly S and the target file size is 10S, then on average we were splitting at 10.5S earlier and now we are splitting at 9.5S. I agree it doesn't matter much what we do.
But what if this file happens to be huge for whatever reason (ingested file; many seqnums for a user key which prevented a compaction from stopping at the size guideline) and is > 10S by itself. Seems safer to include it and use Largest. Also, will using Smallest potentially get us stuck at this smallest key, say k. We return k, so stop writing to the current sstable when we reach key k. Then ask for a grandparent limit with k, which again returns k.Thanks for clarifying how this interacts with >= comparison elsewhere. Can you add a detailed code comment here -- this is very easy to miss. Unless there is a correctness issue (the "get us stuck" part above), I'm fine with this change.
I think we're okay, because of the existing requirement that findGrandparentLimit
always return a key greater than start
. I believe that requirement was added to handle a similar problem of being 'stuck' but with the Largest
.
I added a comment explaining the rationale for using Smallest
, and also adjusted the logic enforcing that we return a key > start
so that we avoid the key comparison once we've seen a file with a Smallest > start
.
compaction.go, line 2448 at r3 (raw file):
Previously, sumeerbhola wrote…
Great. Can you add a code comment with this example to motivate why we are choosing the smaller.
Done.
compaction.go, line 179 at r5 (raw file):
Previously, sumeerbhola wrote…
nit: L0 limit is stale
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r6.
Reviewable status: 13 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: 13 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within a level, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.
Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within a level, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.
Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within a level, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.
Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within levels 1-6, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.
Add a new format major version FormatSplitUserKeys that guarantees all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. This change adds a new format major version that can't be committed until all such atomic compaction units have been recompacted to ensure that within levels 1-6, all versions of the same user key are within the same sstable. When ratcheting the format major version to FormatSplitUserKeys, Pebble will look for atomic compaction units and schedule compactions, one-at-a-time, to rewrite these multi-file atomic compaction units into multiple single-file atomic compaction units. Before this new format major version is committed, Pebble will also opportunistically schedule compactions to compact these files in the background when no other compactions may be scheduled. A new field on `pebble.Metrics` exposes the count of such files, populated at Open and whenever one of these compactions is attempted. This migration will allow future versions of Pebble to remove code necessary to handle these atomic compaction units. Close cockroachdb#1495.
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member to a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member to a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member to a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before #1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close #1495.
During a compaction, if the current sstable hits the file size limit, defer
finishing the sstable if the next sstable would share a user key. This is the
current behavior of flushes, and this change brings parity between the two.
This change is motivated by introduction of range keys (see #1339). This
ensures we can always cleanly truncate range keys that span sstable boundaries.
This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit
splitting sstables in the middle of a user key, the Fragmenter's FlushTo
function is unnecessary. Compactions and flushes always use the
TruncateAndFlushTo variant.
This change required a tweak to the way grandparent limits are applied, in
order to switch the grandparent splitter's comparison into a >= comparsion.
This was necessary due to the shift in interpreting
splitterSuggestion
s asexclusive boundaries.
Close #734.