Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction: don't split outputs within a user key #1470

Merged
merged 2 commits into from
Feb 3, 2022

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Jan 24, 2022

During a compaction, if the current sstable hits the file size limit, defer
finishing the sstable if the next sstable would share a user key. This is the
current behavior of flushes, and this change brings parity between the two.

This change is motivated by introduction of range keys (see #1339). This
ensures we can always cleanly truncate range keys that span sstable boundaries.

This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit
splitting sstables in the middle of a user key, the Fragmenter's FlushTo
function is unnecessary. Compactions and flushes always use the
TruncateAndFlushTo variant.

This change required a tweak to the way grandparent limits are applied, in
order to switch the grandparent splitter's comparison into a >= comparsion.
This was necessary due to the shift in interpreting splitterSuggestions as
exclusive boundaries.

Close #734.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 15 files reviewed, 2 unresolved discussions


compaction.go, line 161 at r2 (raw file):

	// serve as an *exclusive* end boundary truncation point. If we used > then,
	// we may have already added a key with the user key `g.limit` to the
	// previous sstable.

We never hit this with L0 outputs because if key == nil && splitL0Outputs, we flush all of the tombstones to the current sstable.


compaction.go, line 2477 at r2 (raw file):

			//    (like the grandparent limit is), so that we're guaranteed
			//    that splitterSuggestion is greater than the previous user key
			//    added to the sstable.

Is this possible? When the sublevels return an L0 limit x is x the first key that we don't want to overlap, or the last key of the interval that we may overlap?

If we can rework it such that x is the first key that we don't want our current output to overlap with and enforce it as such, then it's safe to use splitKey = x even when we we've exhausted point keys, because we'll know the previous user key must have been < x.

@jbowens jbowens requested review from a team, itsbilal and sumeerbhola January 25, 2022 16:58
Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this cleanup / consolidation. Did a first pass and left some comment

Reviewed 15 of 15 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


-- commits, line 2 at r2:
Can probably link #734 to this too, yeah? Would this close that issue completely?


-- commits, line 4 at r2:
nit: s/is hit/hits/


compaction.go, line 858 at r2 (raw file):

// level N. See maxGrandparentOverlapBytes.
//
// TODO(peter): Stopping compaction output in the middle of a user-key creates

TODO-- nice!


compaction_picker.go, line 381 at r2 (raw file):

// sstables. We could perform a migration, re-compacting any sstables with split
// user keys, which would allow us to remove atomic compaction unit expansion
// code.

Seems reasonable. Worth opening a ticket that we can track this against?


docs/range_deletions.md, line 3 at r2 (raw file):

# Range Deletions

TODO: The following explanation of range deletions does not take into account

Is this like a follow-up thing? Could that section be removed here?


internal/keyspan/fragmenter.go, line 286 at r2 (raw file):

//
// The span [a,f) does not cover the key f.
func (f *Fragmenter) FlushTo(key []byte) {

Given we're removing this method, is worth rebranding the existing method as FlushTo or just Flush, now that there's no ambiguity?


testdata/manual_compaction, line 447 at r2 (raw file):

----
2:
  000007:[a#4,SET-a#4,SET]

The comment mentions sstable (8) and sstable (9). That didn't seem accurate?

Also, we no longer seem to be adjusting the seqnum of the RANGEDEL.


testdata/manual_compaction, line 579 at r2 (raw file):

2:
  000007:[a#3,SET-c#72057594037927935,RANGEDEL]
  000008:[c#3,RANGEDEL-e#72057594037927935,RANGEDEL]

The test comment above mentions "sstable 8". And are we still tweaking the key kind?


testdata/manual_compaction, line 614 at r2 (raw file):

----
2:
  000006:[a#3,SET-e#72057594037927935,RANGEDEL]

Is the comment block for this test still accurate? We are no longer "cutting", instead outputting a single table.


testdata/manual_compaction_set_with_del, line 95 at r2 (raw file):

----
2:
  000008:[a#3,SETWITHDEL-c#72057594037927935,RANGEDEL]

The copy pasta in this file from the previous one probably means "ditto" for all my previous comments :(

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 15 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)


-- commits, line 2 at r2:

Previously, nicktrav (Nick Travers) wrote…

Can probably link #734 to this too, yeah? Would this close that issue completely?

Good call, done.


docs/range_deletions.md, line 3 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Is this like a follow-up thing? Could that section be removed here?

Yeah, I think much of the discussion of how range deletions work in this document should be updated. We probably to also want to preserve some of the historical discussion of how they used to work, at least for as long as we're supporting stores that may contain them.

Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: really nice cleanup!

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


compaction.go, line 2477 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Is this possible? When the sublevels return an L0 limit x is x the first key that we don't want to overlap, or the last key of the interval that we may overlap?

If we can rework it such that x is the first key that we don't want our current output to overlap with and enforce it as such, then it's safe to use splitKey = x even when we we've exhausted point keys, because we'll know the previous user key must have been < x.

We should change the L0 limit splitter to also return splitNow/true if key >= limit (currently it's just key > limit). That's actually more in line with how we want L0 limits to work; we want limit to not be in the left-side sstable. We didn't address this well in the past because a lot of this code had the expectation baked in that we only cut the output once you exceed limit.


compaction.go, line 197 at r3 (raw file):

		// beginning of the first file.
		n := len(g.ve.NewFiles)
		g.limit = g.c.findGrandparentLimit(g.ve.NewFiles[n-1].Meta.Largest.UserKey)

Given the simpler semantics of output splitting now, maybe we can replace this line with this line from below:

g.limit = g.c.findGrandparentLimit(g.c.rangeDelFrag.Start())

Thoughts?


compaction.go, line 2289 at r3 (raw file):

					// equal to the previous table's largest key user key, and
					// the previous table's largest key is not exclusive. This
					// violates the invariant that tables are key-space

Nice!


internal/keyspan/fragmenter.go, line 286 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Given we're removing this method, is worth rebranding the existing method as FlushTo or just Flush, now that there's no ambiguity?

I think it's still a good idea to keep it TruncateAndFlushTo, just to make it clear that we're also truncating range tombstones.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 15 files at r1, 1 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @jbowens and @nicktrav)


compaction.go, line 161 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

We never hit this with L0 outputs because if key == nil && splitL0Outputs, we flush all of the tombstones to the current sstable.

Isn't that a deficiency of our L0 splitting scheme?


compaction.go, line 2477 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

We should change the L0 limit splitter to also return splitNow/true if key >= limit (currently it's just key > limit). That's actually more in line with how we want L0 limits to work; we want limit to not be in the left-side sstable. We didn't address this well in the past because a lot of this code had the expectation baked in that we only cut the output once you exceed limit.

This will unify the two cases, and cause L0 outputs to also split tombstones across many sstables (even if we've run out of point keys), yes? If so, that's what I was hoping we were going to get with this simplification.


compaction.go, line 168 at r3 (raw file):

}

func (g *grandparentLimitSplitter) onNewOutput(key *InternalKey) []byte {

l0LimitSplitter.onNewOutput is simple because it doesn't need to be worry about a user key spanning sstables. I was hoping this PR would make grandparentLimitSplitter be essentially identical to that except that it uses findGrandparentLimit. We could even combine the two splitters and parameterize it with a func. Am I missing something?


compaction.go, line 871 at r3 (raw file):

		// this function for how this is used.
		if overlappedBytes > c.maxOverlapBytes && c.cmp(start, f.Smallest.UserKey) < 0 {
			return f.Smallest.UserKey

why this change to use f.Smallest? Also, we've included this sstable in the overlappedBytes when comparing with the threshold, and now we are not allowing the compaction sstable to overlap with this. That seems wrong.
Maybe this is a consequence of the >= change in shouldSplitBefore but I don't see how.


compaction.go, line 2309 at r3 (raw file):

			// purposes of bounds computation -- the current sstable will end up with a Largest key
			// of c#7,1 so the range tombstone in the current file will be able to delete c#7.
			if d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) >= 0 {

I think this comment and code is obsolete. But we should verify that d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) <= 0.


compaction.go, line 2361 at r3 (raw file):

	// L0 splits) L0 limits, before wrapping them in an splitterGroup.
	outputSplitters := []compactionOutputSplitter{
		&fileSizeSplitter{maxFileSize: c.maxOutputFileSize},

we could also have wrapper only the fileSizeSplitter using the userKeyChangeSplitter, like we used to, yes?
Would that be "cheaper" since we would need to compare with unsafePrevUserKey only when that splitter was proposing a split?


compaction.go, line 2443 at r3 (raw file):

			// The next key either has a greater user key than the previous
			// key, or if not, the previous key must not have had a zero
			// sequence number.

Can we now remove the "or if not ..." part?


compaction.go, line 2448 at r3 (raw file):

			// grandparent's smallest key may be less than the current key.
			// Splitting at the current key will cause this output to overlap
			// a potentially unbounded number of grandparents.

I can't remember the details regarding this TODO -- can you remind me?

During a compaction, if the current sstable hits the file size limit, defer
finishing the sstable if the next sstable would share a user key. This is the
current behavior of flushes, and this change brings parity between the two.

This change is motivated by introduction of range keys (see cockroachdb#1339). This
ensures we can always cleanly truncate range keys that span range-key boundaries.

This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit
splitting sstables in the middle of a user key, the Fragmenter's FlushTo
function is unnecessary. Compactions and flushes always use the
TruncateAndFlushTo variant.

This change required a tweak to the way grandparent limits are applied, in
order to switch the grandparent splitter's comparison into a >= comparsion.
This was necessary due to the shift in interpreting `splitterSuggestion`s as
exclusive boundaries.

Close cockroachdb#734.
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 15 files reviewed, 9 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)


compaction.go, line 161 at r2 (raw file):

Previously, sumeerbhola wrote…

Isn't that a deficiency of our L0 splitting scheme?

Yeah, it is.


compaction.go, line 168 at r3 (raw file):

Previously, sumeerbhola wrote…

l0LimitSplitter.onNewOutput is simple because it doesn't need to be worry about a user key spanning sstables. I was hoping this PR would make grandparentLimitSplitter be essentially identical to that except that it uses findGrandparentLimit. We could even combine the two splitters and parameterize it with a func. Am I missing something?

Nice, that's great.


compaction.go, line 871 at r3 (raw file):

Previously, sumeerbhola wrote…

why this change to use f.Smallest? Also, we've included this sstable in the overlappedBytes when comparing with the threshold, and now we are not allowing the compaction sstable to overlap with this. That seems wrong.
Maybe this is a consequence of the >= change in shouldSplitBefore but I don't see how.

Isn't the overlappedBytes behavior consistent with the description above?

// such large overlaps because they translate into large compactions. The
// current heuristic stops output of a table if the addition of another key
// would cause the table to overlap more than 10x the target file size at
// level N. See maxGrandparentOverlapBytes.

If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.


The change to f.Smallest did fall out out of the >= change. When a grandparent's Largest is inclusive (not a RANGEDEL sentinel), the >= would cause us to put a key equal to the Largest in the next sstable output. That means the next output sstable would also overlap the grandparent that set the limit, unnecessarily.

The switch to using f.Smallest avoids all the inclusivity/exclusivity considerations, since f.Smallest is always inclusive, which plays well with >= limit splitting logic.


compaction.go, line 2309 at r3 (raw file):

Previously, sumeerbhola wrote…

I think this comment and code is obsolete. But we should verify that d.cmp(writerMeta.LargestRangeDel.UserKey, splitKey) <= 0.

Nice, done.


compaction.go, line 2361 at r3 (raw file):

Previously, sumeerbhola wrote…

we could also have wrapper only the fileSizeSplitter using the userKeyChangeSplitter, like we used to, yes?
Would that be "cheaper" since we would need to compare with unsafePrevUserKey only when that splitter was proposing a split?

Done


compaction.go, line 2443 at r3 (raw file):

Previously, sumeerbhola wrote…

Can we now remove the "or if not ..." part?

Done.


compaction.go, line 2448 at r3 (raw file):

Previously, sumeerbhola wrote…

I can't remember the details regarding this TODO -- can you remind me?

This is talking about a scenario like: The grandparent limit is b and key is x and there's an existing broad range tombstone in the fragmenter with bounds [a,x). The grandparent splitter requests a split before x. We fall into this case, and use x as the split boundary which creates an sstable that extends well beyond the grandparent limit of b.

I think to fix this, we could use splitKey = min(key.UserKey, splitterSuggestion).


compaction_picker.go, line 381 at r2 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Seems reasonable. Worth opening a ticket that we can track this against?

Yeah, I'll create one once this PR merges.

Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass. :lgtm:

Reviewed 3 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)

@nicktrav nicktrav mentioned this pull request Jan 31, 2022
29 tasks
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)


compaction.go, line 2448 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

This is talking about a scenario like: The grandparent limit is b and key is x and there's an existing broad range tombstone in the fragmenter with bounds [a,x). The grandparent splitter requests a split before x. We fall into this case, and use x as the split boundary which creates an sstable that extends well beyond the grandparent limit of b.

I think to fix this, we could use splitKey = min(key.UserKey, splitterSuggestion).

Added a second commit that fixes this and simplifies this code to remove the switch.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r5.
Reviewable status: 13 of 15 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)


compaction.go, line 871 at r3 (raw file):

If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.

If the file sizes are roughly S and the target file size is 10S, then on average we were splitting at 10.5S earlier and now we are splitting at 9.5S. I agree it doesn't matter much what we do.
But what if this file happens to be huge for whatever reason (ingested file; many seqnums for a user key which prevented a compaction from stopping at the size guideline) and is > 10S by itself. Seems safer to include it and use Largest. Also, will using Smallest potentially get us stuck at this smallest key, say k. We return k, so stop writing to the current sstable when we reach key k. Then ask for a grandparent limit with k, which again returns k.

Thanks for clarifying how this interacts with >= comparison elsewhere. Can you add a detailed code comment here -- this is very easy to miss. Unless there is a correctness issue (the "get us stuck" part above), I'm fine with this change.


compaction.go, line 2448 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Added a second commit that fixes this and simplifies this code to remove the switch.

Great. Can you add a code comment with this example to motivate why we are choosing the smaller.


compaction.go, line 179 at r5 (raw file):

		//
		// a.SET.3
		// (L0 limit at b)

nit: L0 limit is stale

If the compaction output splitters suggest a user key less than the next user
key, use the the splitters' suggestion as the split key. This avoids arbitrary
overlap with grandparents if the next key is far beyond the grandparent limit.
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)


compaction.go, line 871 at r3 (raw file):

Previously, sumeerbhola wrote…

If overlapping f would increase overlappedBytes to be > c.maxOverlapBytes, then we don't want to overlap f, and we should return f.Smallest.UserKey.

If the file sizes are roughly S and the target file size is 10S, then on average we were splitting at 10.5S earlier and now we are splitting at 9.5S. I agree it doesn't matter much what we do.
But what if this file happens to be huge for whatever reason (ingested file; many seqnums for a user key which prevented a compaction from stopping at the size guideline) and is > 10S by itself. Seems safer to include it and use Largest. Also, will using Smallest potentially get us stuck at this smallest key, say k. We return k, so stop writing to the current sstable when we reach key k. Then ask for a grandparent limit with k, which again returns k.

Thanks for clarifying how this interacts with >= comparison elsewhere. Can you add a detailed code comment here -- this is very easy to miss. Unless there is a correctness issue (the "get us stuck" part above), I'm fine with this change.

I think we're okay, because of the existing requirement that findGrandparentLimit always return a key greater than start. I believe that requirement was added to handle a similar problem of being 'stuck' but with the Largest.

I added a comment explaining the rationale for using Smallest, and also adjusted the logic enforcing that we return a key > start so that we avoid the key comparison once we've seen a file with a Smallest > start.


compaction.go, line 2448 at r3 (raw file):

Previously, sumeerbhola wrote…

Great. Can you add a code comment with this example to motivate why we are choosing the smaller.

Done.


compaction.go, line 179 at r5 (raw file):

Previously, sumeerbhola wrote…

nit: L0 limit is stale

Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 1 of 1 files at r6.
Reviewable status: 13 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: 13 of 15 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

jbowens added a commit to jbowens/pebble that referenced this pull request Feb 25, 2022
Add a new format major version FormatSplitUserKeys that guarantees all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

This change adds a new format major version that can't be committed until all
such atomic compaction units have been recompacted to ensure that within a
level, all versions of the same user key are within the same sstable. When
ratcheting the format major version to  FormatSplitUserKeys, Pebble will look
for atomic compaction units and schedule compactions, one-at-a-time, to rewrite
these multi-file atomic compaction units into multiple single-file atomic
compaction units.

This migration will allow future versions of Pebble to remove code necessary to
handle these atomic compaction units.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Feb 25, 2022
Add a new format major version FormatSplitUserKeys that guarantees all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

This change adds a new format major version that can't be committed until all
such atomic compaction units have been recompacted to ensure that within a
level, all versions of the same user key are within the same sstable. When
ratcheting the format major version to  FormatSplitUserKeys, Pebble will look
for atomic compaction units and schedule compactions, one-at-a-time, to rewrite
these multi-file atomic compaction units into multiple single-file atomic
compaction units.

This migration will allow future versions of Pebble to remove code necessary to
handle these atomic compaction units.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Feb 25, 2022
Add a new format major version FormatSplitUserKeys that guarantees all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

This change adds a new format major version that can't be committed until all
such atomic compaction units have been recompacted to ensure that within a
level, all versions of the same user key are within the same sstable. When
ratcheting the format major version to  FormatSplitUserKeys, Pebble will look
for atomic compaction units and schedule compactions, one-at-a-time, to rewrite
these multi-file atomic compaction units into multiple single-file atomic
compaction units.

This migration will allow future versions of Pebble to remove code necessary to
handle these atomic compaction units.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Feb 28, 2022
Add a new format major version FormatSplitUserKeys that guarantees all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

This change adds a new format major version that can't be committed until all
such atomic compaction units have been recompacted to ensure that within levels
1-6, all versions of the same user key are within the same sstable. When
ratcheting the format major version to  FormatSplitUserKeys, Pebble will look
for atomic compaction units and schedule compactions, one-at-a-time, to rewrite
these multi-file atomic compaction units into multiple single-file atomic
compaction units.

This migration will allow future versions of Pebble to remove code necessary to
handle these atomic compaction units.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Feb 28, 2022
Add a new format major version FormatSplitUserKeys that guarantees all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

This change adds a new format major version that can't be committed until all
such atomic compaction units have been recompacted to ensure that within levels
1-6, all versions of the same user key are within the same sstable. When
ratcheting the format major version to  FormatSplitUserKeys, Pebble will look
for atomic compaction units and schedule compactions, one-at-a-time, to rewrite
these multi-file atomic compaction units into multiple single-file atomic
compaction units.

Before this new format major version is committed, Pebble will also
opportunistically schedule compactions to compact these files in the background
when no other compactions may be scheduled. A new field on `pebble.Metrics`
exposes the count of such files, populated at Open and whenever one of these
compactions is attempted.

This migration will allow future versions of Pebble to remove code necessary to
handle these atomic compaction units.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Mar 3, 2022
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member to a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Mar 7, 2022
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member to a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Mar 7, 2022
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member to a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Mar 8, 2022
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member of a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Mar 8, 2022
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member of a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close cockroachdb#1495.
jbowens added a commit to jbowens/pebble that referenced this pull request Mar 10, 2022
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member of a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close cockroachdb#1495.
jbowens added a commit that referenced this pull request Mar 10, 2022
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before #1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member of a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close #1495.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compaction: Don't split user keys during compactions
5 participants