db: add migration to rewrite atomic compaction units #1537

jbowens · 2022-02-25T21:17:25Z

Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before #1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.

The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member to a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.

This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.

The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.

Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.

Close #1495.

cockroach-teamcity · 2022-02-25T21:17:31Z

This change is

jbowens

@sumeerbhola — Thoughts on sneaking something like this in during stability period? If we add a migration like this to 22.1, then we don't have to worry about rangedels with bounds narrower than user keys when we're refactoring the FragmentIterator.

Reviewable status: 0 of 19 files reviewed, all discussions resolved

nicktrav

Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @nicktrav)

format_major_version.go, line 78 at r1 (raw file):

	// level. Ratcheting to the FormatSplitUserKeys version will fail if that
	// guarantee is not met.
	FormatSplitUserKeys

Nice. This was going to be my initial drive-by comment - the ordering of the migrations, given we don't want Pebblev2 in Cockroach for 22.1. 👍 👍

Now it makes me wonder - if block properties aren't yet released, could this jump back another version and use Rocksdbv2, given the table format didn't really change for this feature. It's most likely a no-op anyway, as 22.1 will have block properties, so Pebble will need to ratchet to at least Pebblev1.

jbowens

I'm making the assumption here that there are few of these multi-file atomic compaction units, and so it's okay to synchronously perform the migration during the upgrade. Alternatively, we could schedule these compactions in the background when there's no other compactions to be picked, and only add the format major version gating in 22.2.

Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

sumeerbhola

Alternatively, we could schedule these compactions in the background when there's no other compactions to be picked, and only add the format major version gating in 22.2.

Strong +1 to this. Plus we could export a metric showing how many remaining files, so that in 22.2 if the code blocks on doing these compactions, we can look at the metric on that node prior to the upgrade and understand why. Or the operator could upgrade to 22.2 unless it drops to 0 (imagining someone doing a 21.2 => 22.1 => 22.2 in rapid succession).

Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal and @nicktrav)

sumeerbhola

So I am happy with such background behavior being added during the 22.1 stability.

Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal and @nicktrav)

jbowens

Updated to pursue these compactions in the background, maintain a count of these files exposed via Metrics(), and avoid scanning for these files if there are known to be none. I kept the synchronous migration code, because we can just not ratchet the format major version in 22.1 and add that in 22.2.

Reviewable status: 0 of 22 files reviewed, all discussions resolved (waiting on @itsbilal and @nicktrav)

jbowens

Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @nicktrav)

compaction_picker.go, line 47 at r5 (raw file):

	pickManual(env compactionEnv, manual *manualCompaction) (c *pickedCompaction, retryLater bool)
	pickElisionOnlyCompaction(env compactionEnv) (pc *pickedCompaction)
	pickRewriteCompaction(env compactionEnv, level int, file manifest.LevelFile) (pc *pickedCompaction)

We could use these 'rewrite' compactions to implement cockroachdb/cockroach#74804 for sstables.

nicktrav

Just nits.

Reviewed 13 of 19 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 9 of 9 files at r5, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @itsbilal and @jbowens)

compaction_picker.go, line 1620 at r5 (raw file):

// of such files (may double-count an atomic compaction unit) and it returns the
// level and LevelFile for one such file.
func findSplitUserKey(opts *Options, vers *version) (count, level int, file manifest.LevelFile, ok bool) {

Perhaps this is me needlessly seeking out a (premature) optimization - this function is called repeatedly and ends up re-scanning the levels in the same order looking for compaction units.

That said, the only alternatives I could come up while writing this comment seemed complicated (e.g. I was thinking of some kind of lazy iterator over atomic compaction units, but would hold onto a readstate, which is blergh).

Maybe it's a non-issue in practice as searching for these units happens last in order of compaction types, and the metadata is all in-mem.

compaction_picker.go, line 1622 at r5 (raw file):

func findSplitUserKey(opts *Options, vers *version) (count, level int, file manifest.LevelFile, ok bool) {
	equal := opts.equal()
	for l := numLevels - 1; l > 0; l-- {

Is there a specific reason for starting at L6 and working our way to L1, versus the other way around?

format_major_version.go, line 181 at r5 (raw file):

	FormatSplitUserKeys: func(d *DB) error {
		// Before finalizing the format major version, rewrite any sstables that
		// form multi-file atomic compaction units.

nit: maybe mention that we're guaranteed to hold the d.mu here, and thus safe to call the method that relies on the lock being held?

open.go, line 459 at r5 (raw file):

	// might be files that do not form their own atomic compaction unit.
	// Initialize the count of such files.
	// TODO(jackson): Remove this after CockroachDB 22.2.

Nice. Accompanying ticket / issue, perhaps?

Thinking about it some more, it seems like the migration code has to live on in perpetuity, given that we allow a Pebble DB to be opened at any old version and we'll chug though all the migrations in order - maybe that's fine, but just wanted to mention it, as maybe we have a plan for such thing.

testdata/ingest_load, line 92 at r5 (raw file):


# Loading tables at an unsupported table format results in an error.
# Write a table at version 7 (Pebble,v2) into a DB at version 6 (Pebble,v1).

nit: did this need to change? It just needs to be any two version that straddles a table format change. Same comment in testdata/rangekeys.

testdata/split_user_key_migration, line 1 at r5 (raw file):

define

Nice test.

sumeerbhola

Reviewed 2 of 19 files at r1, 4 of 9 files at r5.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @itsbilal and @jbowens)

compaction_picker.go, line 47 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

We could use these 'rewrite' compactions to implement cockroachdb/cockroach#74804 for sstables.

sounds good

compaction_picker.go, line 1075 at r5 (raw file):

	// units" within the code). Rewrite them in-place.
	if *env.nonatomicFileCount > 0 {
		count, level, file, ok := findSplitUserKey(p.opts, p.vers)

how expensive is it to do this repeatedly. Should we mark the FileMetadata with a bool so do we don't repeat?

compaction_picker.go, line 1081 at r5 (raw file):

		// nonatomicFileCount because compaction-picking is performed while
		// holding DB.mu.
		*env.nonatomicFileCount = count

Is the nonatomicFileCount stale once a compaction finishes, until there are few enough compaction candidates and we fall through to here?
Can we cheaply strengthen the staleness bound e.g. by tracking the non-atomic state in FileMetadata and adjusting the count when applying the VersionEdit?

compaction_picker.go, line 1194 at r5 (raw file):

// compaction outputs files to the same level as the input level.
func (p *compactionPickerByScore) pickRewriteCompaction(env compactionEnv, level int, file manifest.LevelFile) (pc *pickedCompaction) {
	// Find this file's atomic compaction unit.

are we ensuring somewhere that level is not 0?

jbowens

Reworked this to reuse the previously unused FileMetdata.MarkedForCompaction field that's persisted in the manifest. There's now two format major versions. The first guarantees that all non-atomic compaction units are marked for compaction durably. The second guarantees that there are no files marked for compaction.

To avoid too much during compaction picking, a count of the files marked for compaction is included on the Version and updated when applying VersionEdits. The compaction picker can avoid ever looking for rewrite compactions if this count is zero. Additionally, even if the count is positive, the compaction picker uses a manifest b-tree annotation to find the file to compact, avoiding linear scans across the LSM every time it picks a rewrite compaction. This optimization probably doesn't matter much here, since we expect these split user keys to be very rare, but it would likely matter if we use these rewrite compactions for cockroachdb/cockroach#74804 .

Reviewable status: 3 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

compaction_picker.go, line 1075 at r5 (raw file):

Previously, sumeerbhola wrote…

how expensive is it to do this repeatedly. Should we mark the FileMetadata with a bool so do we don't repeat?

It's not cheap. Even if we mark a bool on the FileMetadata, the scan is O(# sstables), which is never great. One thought is that we could reuse the f.MarkedForCompaction bool (which does get persisted to the manifest) which ensures that we only need to do this traversal once, but requires some extra coordination.

compaction_picker.go, line 1081 at r5 (raw file):

Previously, sumeerbhola wrote…

Is the nonatomicFileCount stale once a compaction finishes, until there are few enough compaction candidates and we fall through to here?
Can we cheaply strengthen the staleness bound e.g. by tracking the non-atomic state in FileMetadata and adjusting the count when applying the VersionEdit?

Done.

compaction_picker.go, line 1194 at r5 (raw file):

Previously, sumeerbhola wrote…

are we ensuring somewhere that level is not 0?

findSplitUserKey only looks at levels 1...6.

compaction_picker.go, line 1620 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Perhaps this is me needlessly seeking out a (premature) optimization - this function is called repeatedly and ends up re-scanning the levels in the same order looking for compaction units.

That said, the only alternatives I could come up while writing this comment seemed complicated (e.g. I was thinking of some kind of lazy iterator over atomic compaction units, but would hold onto a readstate, which is blergh).

Maybe it's a non-issue in practice as searching for these units happens last in order of compaction types, and the metadata is all in-mem.

Reworked to avoid duplicate scans by a) persisting which files should be compacted in the existing MarkedForCompaction field that's persisted in the manifest and b) using a b-tree annotation to ensure that don't need to consistently perform linear scans of a level's file metadata looking for files marked for compaction.

compaction_picker.go, line 1622 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Is there a specific reason for starting at L6 and working our way to L1, versus the other way around?

Just that files in L6 are more likely to be older and so more likely to have non-atomic compaction units.

format_major_version.go, line 181 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: maybe mention that we're guaranteed to hold the d.mu here, and thus safe to call the method that relies on the lock being held?

Done.

open.go, line 459 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Nice. Accompanying ticket / issue, perhaps?

Thinking about it some more, it seems like the migration code has to live on in perpetuity, given that we allow a Pebble DB to be opened at any old version and we'll chug though all the migrations in order - maybe that's fine, but just wanted to mention it, as maybe we have a plan for such thing.

I was imagining at some point we'd increase the minimum format major version supported by the master branch of Pebble. So non-CockroachDB users with existing databases would need to ratchet the format major version using a non-master SHA before upgrading to master.

jbowens

Reviewable status: 3 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

compaction_picker.go, line 1075 at r5 (raw file):

It's not cheap. Even if we mark a bool on the FileMetadata, the scan is O(# sstables), which is never great. One thought is that we could reuse the f.MarkedForCompaction bool (which does get persisted to the manifest) which ensures that we only need to do this traversal once, but requires some extra coordination.

This comment is also a stale draft comment. I implemented the MarkedForCompaction change I referenced here.

compaction_picker.go, line 1194 at r5 (raw file):

findSplitUserKey only looks at levels 1...6.

This was a stale draft comment. This was reworked so that pickRewriteCompaction can handle L0 files, which is necessary in case an L0 file is marked for compaction from RocksDB/old Pebble versions.

jbowens

Reviewable status: 3 of 28 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

compaction_picker.go, line 1248 at r6 (raw file):

			continue
		}
		lf := p.vers.Levels[l].Find(p.opts.Comparer.Compare, candidate)

TODO: This needs to be fixed. .Find only works for L1...6 not L0. I'll fix this and add a test for a L0 rewrite compaction.

nicktrav

This looks great - thanks for the extra effort to use make use of the annotations and the persistence.

I did a pass of the more recent changes and it looks good. A few small comments I'll wait to hear back on. But otherwise LGTM (still).

Reviewed 20 of 25 files at r6, 5 of 7 files at r7, 2 of 2 files at r8, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

-- commits, line 13 at r8:
nit: ... every file that is a member of a ...

-- commits, line 15 at r8:

but is repurposed here

No objection, just curious if we're closing the door here to using that field in the future if we're repurposing it here for this specific type of migration that happens once.

compaction.go, line 1451 at r5 (raw file):

	d.mu.versions.incrementCompactionBytes(-c.bytesWritten)

	info.TotalDuration = d.timeNow().Sub(startTime)

Did this removal creep in from a rebase?

compaction_picker.go, line 1183 at r8 (raw file):

		return accum
	}
	// If we haven't accumulated an eligible file yet, or f's LargestSeqNum is

nit: the first part of this comment (before the comma) pertains to this first check, and the second pertains to the next set of type-conversions and checks. Was confused reading it. Suggest splitting the comment?

format_major_version.go, line 75 at r8 (raw file):

	FormatBlockPropertyCollector
	// FormatSplitUserKeysMarked is a format major version that guarantees that
	// all files the share user keys with neighbors are marked for compaction in

nit: ... all files that share ...

format_major_version.go, line 76 at r8 (raw file):

	// FormatSplitUserKeysMarked is a format major version that guarantees that
	// all files the share user keys with neighbors are marked for compaction in
	// the manifest. Ratcheting to FormatSplitUserKeysMarked will block until

When we talk about "blocking" here (and below), I take it we're talking about just blocking ratcheting, and not blocking progress elsewhere in the DB (i.e. holding locks). Worth clarifying?

format_major_version.go, line 379 at r8 (raw file):

	// database lock while we're scanning the file metadata.
	//
	// If we find a file to mark, we'll re-acquire the mutex before marking it,

Given that we're dropping and re-acquiring the lock, is there any issue with a race where the version passed into this function is different to the version of the DB that exists when the we log the version edit in the calling function (above)?

I'm thinking of a situation where version v1 is passed in, we find a set of files to mark for compaction / annotate on one or more levels of v1. However, above us, the version moved forward to v2 (>v1), so we're annotating a stale version. When we exit this function, we return true to say that we marked some files on v1, triggering a version edit to be logged, but this will persist v2. Is it possible that we'd lose the annotations in that case?

I'm probably missing some invariant that guarantees that this is safe. Can we bolster the comment with this proof in that case?

internal/manifest/btree.go, line 593 at r8 (raw file):

	if !n.leaf {
		for i := int16(0); i <= n.count; i++ {
			n.children[i].invalidateAnnotation(a)

I take it the recursion here is safe to perform - i.e. bounded depth, etc. Or can Go optimize this as a tail call (I couldn't find a good answer in the quick googling I just did).

internal/manifest/level_metadata.go, line 74 at r8 (raw file):

// key-sorted (eg, non-L0).
func (lm *LevelMetadata) Find(cmp base.Compare, m *FileMetadata) *LevelFile {
	// TODO(jackson): Add an assertion that lm is key-sorted.

Did we change something elsewhere makes this TODO about lm being key-sortable obsolete?

internal/manifest/level_metadata.go, line 107 at r8 (raw file):

// InvalidateAnnotation clears any cached annotations defined by Annotator. The
// Annotator is used as the key for pre-calculated values, so equal Annotators

Is "equal" here taken to mean equality of the underlying dynamic type (i.e. same instance)?

testdata/split_user_key_migration, line 56 at r8 (raw file):

006

# Ensure the files previously marked for compaction are still marked for

Worth doing something like a DB close / open here too to prove that we don't lose the marked files? Not sure it really tests persistence, just that it's recomputed.

sumeerbhola

Reviewed 9 of 25 files at r6, 1 of 7 files at r7, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)

-- commits, line 15 at r8:

Previously, nicktrav (Nick Travers) wrote…

but is repurposed here

No objection, just curious if we're closing the door here to using that field in the future if we're repurposing it here for this specific type of migration that happens once.

I suppose we could reuse it post-FormatMarkCompacted.

compaction_picker.go, line 1185 at r8 (raw file):

	// If we haven't accumulated an eligible file yet, or f's LargestSeqNum is
	// less than the accumulated file's, use f.
	if accum == nil {

nit: can we factor out

// REQUIRES: f is non-nil, and f.MarkedForCompaction=true.
func mergeHelper(f *FileMetadata, dst interface{}) (interface{}, bool)

and then do

func (a markedForCompactionAnnotator) Accumulate(
	f *fileMetadata, dst interface{},
) (interface{}, bool) {
	if !f.MarkedForCompaction {
		// Not marked for compaction; return dst.
		return dst, true
	}
        return mergeHelper(f, dst)
}

func (a markedForCompactionAnnotator) Merge(v interface{}, accum interface{}) interface{} {
	if v == nil {
		return accum
	}
        accum, _ = mergeHelper(v.(*fileMetadata), accum)
        return accum
}

compaction_picker.go, line 1253 at r8 (raw file):

		}

		inputs := lf.Slice()

How about a comment like the following:
// L0 files generated by a flush have never been split such that adjacent files can contain the same user key. So we do not need to rewrite an atomic compaction unit for L0. Note that there is nothing preventing two different flushes from producing files that are non-overlapping from an InternalKey perspective, but span the same user key. However, such files cannot be in the same L0 sublevel, since each sublevel requires non-overlapping user keys (unlike other levels).

format_major_version.go, line 190 at r8 (raw file):

	FormatSplitUserKeysMarked: func(d *DB) error {
		// Mark any unmarked files with split-user keys. Note all format major
		// versions migrations are invoked with DB.mu locked.

Should we move the formatVers out of DB.mu and give it its own mutex? I am worried about hiccups in normal writes if we hold this lock for too long.
With that change we could have the implementation acquire and release DB.mu periodically, say after scanning each level to mark, and we would not be risking that a second migration could start concurrently.

format_major_version.go, line 320 at r8 (raw file):

		d.maybeScheduleCompactionPicker(func(picker compactionPicker, env compactionEnv) *pickedCompaction {
			return picker.pickRewriteCompaction(env)
		})

How will a compaction actually run while this method holds onto DB.mu?

format_major_version.go, line 424 at r8 (raw file):

		// If we marked any files for compaction, clear the compaction-picking
		// annotation that caches files marked-for-compaction, as it's now
		// out-of-date.

I don't understand what is out of date? Could you elaborate?

jbowens

Reviewable status: 27 of 30 files reviewed, 14 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

-- commits, line 15 at r8:

Previously, sumeerbhola wrote…

I suppose we could reuse it post-FormatMarkCompacted.

Yeah, that's right. I think we're also free to use it before then as a means of triggering explicit compaction of files (eg, non-inplace rewrite compactions) if we want to.

I was imagining it could be reused for cockroachdb/cockroach#74804 as well.

compaction.go, line 1451 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Did this removal creep in from a rebase?

What removal? I'm not sure if I'm just navigating reviewable poorly, but there's no diff here.

compaction_picker.go, line 1185 at r8 (raw file):

Previously, sumeerbhola wrote…

nit: can we factor out

// REQUIRES: f is non-nil, and f.MarkedForCompaction=true.
func mergeHelper(f *FileMetadata, dst interface{}) (interface{}, bool)

and then do

func (a markedForCompactionAnnotator) Accumulate(
	f *fileMetadata, dst interface{},
) (interface{}, bool) {
	if !f.MarkedForCompaction {
		// Not marked for compaction; return dst.
		return dst, true
	}
        return mergeHelper(f, dst)
}

func (a markedForCompactionAnnotator) Merge(v interface{}, accum interface{}) interface{} {
	if v == nil {
		return accum
	}
        accum, _ = mergeHelper(v.(*fileMetadata), accum)
        return accum
}

done.

compaction_picker.go, line 1253 at r8 (raw file):

Previously, sumeerbhola wrote…

How about a comment like the following:
// L0 files generated by a flush have never been split such that adjacent files can contain the same user key. So we do not need to rewrite an atomic compaction unit for L0. Note that there is nothing preventing two different flushes from producing files that are non-overlapping from an InternalKey perspective, but span the same user key. However, such files cannot be in the same L0 sublevel, since each sublevel requires non-overlapping user keys (unlike other levels).

Done.

format_major_version.go, line 190 at r8 (raw file):

Previously, sumeerbhola wrote…

Should we move the formatVers out of DB.mu and give it its own mutex? I am worried about hiccups in normal writes if we hold this lock for too long.
With that change we could have the implementation acquire and release DB.mu periodically, say after scanning each level to mark, and we would not be risking that a second migration could start concurrently.

I'm wary of a second mutex. We need to read the format version during normal database operation, typically when we already have d.mu held, like starting a compaction. If we have a second mutex, we need to dictate a new lock ordering, which is easy to forget.

I'd prefer we dictate that format major version migrations must be careful with holding the mutex for long-running operations, like we do elsewhere in the DB. In this case, we don't do much with the mutex because we drop it before scanning all the file metadata. I don't think even scanning all the file metadata is that expensive. Before the introduction of the b-tree, we would perform O(n) key comparisons during compaction picking while picking a file within a level (and RocksDB still performs these O(# sstables) key comparisons during its compaction picking).

Can prevent a second migration from running concurrently with a d.mu.formatVers.ratcheting bool. I'll add that here.

format_major_version.go, line 320 at r8 (raw file):

Previously, sumeerbhola wrote…

How will a compaction actually run while this method holds onto DB.mu?

We wait on the d.mu.compact.cond condition variable, which unlocks d.mu while it waits. Added a comment there.

format_major_version.go, line 379 at r8 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Given that we're dropping and re-acquiring the lock, is there any issue with a race where the version passed into this function is different to the version of the DB that exists when the we log the version edit in the calling function (above)?

I'm thinking of a situation where version v1 is passed in, we find a set of files to mark for compaction / annotate on one or more levels of v1. However, above us, the version moved forward to v2 (>v1), so we're annotating a stale version. When we exit this function, we return true to say that we marked some files on v1, triggering a version edit to be logged, but this will persist v2. Is it possible that we'd lose the annotations in that case?

I'm probably missing some invariant that guarantees that this is safe. Can we bolster the comment with this proof in that case?

You're right, there was a race here. Marking files as marked on the *FileMetadata is okay, because if the file metadata still exists in the LSM, both the new and the old versions hold pointers to the same FileMetadata struct. The vers.Stats.MarkedForCompaction counts however would not be right.

I updated this to collect all the split user-key files, and then re-find them in the new version once we reacquire the mutex.

format_major_version.go, line 424 at r8 (raw file):

Previously, sumeerbhola wrote…

I don't understand what is out of date? Could you elaborate?

Elaborated the comment, lemme know if still unclear

internal/manifest/btree.go, line 593 at r8 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I take it the recursion here is safe to perform - i.e. bounded depth, etc. Or can Go optimize this as a tail call (I couldn't find a good answer in the quick googling I just did).

It's bounded by the depth of the b-tree (which is log₁₆(# files in the level))

internal/manifest/level_metadata.go, line 74 at r8 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Did we change something elsewhere makes this TODO about lm being key-sortable obsolete?

The change to this function removed that requirement.

Level 0 files are not key-sortable, so previously Find only worked with L1+ because it assumed lm is key-sorted.

internal/manifest/level_metadata.go, line 107 at r8 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Is "equal" here taken to mean equality of the underlying dynamic type (i.e. same instance)?

Yes, although in practice all of B-Tree annotator implementations are empty struct{}s, implementing Annotator implemented on the non-pointer receiver, so it's really just a type comparison.

testdata/split_user_key_migration, line 56 at r8 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Worth doing something like a DB close / open here too to prove that we don't lose the marked files? Not sure it really tests persistence, just that it's recomputed.

I might be misunderstanding, but just above this is a reopen. The above format-major-version call is only reading the current format major version.

nicktrav

All my comments have been addressed. This looks great!

Reviewed 3 of 3 files at r9, 11 of 11 files at r10, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

compaction.go, line 1451 at r5 (raw file):

Previously, jbowens (Jackson Owens) wrote…

What removal? I'm not sure if I'm just navigating reviewable poorly, but there's no diff here.

Weird. At r5 there was a removal, but it's gone now. Disregard.

testdata/split_user_key_migration, line 56 at r8 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I might be misunderstanding, but just above this is a reopen. The above format-major-version call is only reading the current format major version.

Urgh - reviewable was hiding it. Thanks.

jbowens

@sumeerbhola any more comments?

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

sumeerbhola

(I didn't look at the tests since @nicktrav was doing a careful review)

Reviewed 1 of 25 files at r6, 2 of 3 files at r9, 1 of 11 files at r10, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

db.go, line 326 at r10 (raw file):

			// currently in the process of ratcheting the format major version
			// to vers + 1. As a part of ratcheting the format major version,
			// migrations may may drop and re-acquire the mutex.

may drop

Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.

jbowens

TFTRs!

Reviewable status: 28 of 30 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

db.go, line 326 at r10 (raw file):

Previously, sumeerbhola wrote…

may drop

Done.

jbowens

TFTRs!

Reviewable status: 28 of 30 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

db.go, line 326 at r10 (raw file):

Previously, sumeerbhola wrote…

may drop

Done.

jbowens force-pushed the rewrite-split-userkeys branch from 6bc9a06 to e5c53c6 Compare February 25, 2022 21:19

jbowens commented Feb 25, 2022

View reviewed changes

jbowens force-pushed the rewrite-split-userkeys branch from e5c53c6 to 859312a Compare February 25, 2022 21:26

nicktrav reviewed Feb 25, 2022

View reviewed changes

jbowens force-pushed the rewrite-split-userkeys branch 5 times, most recently from d09b411 to fb3e90d Compare February 28, 2022 16:11

jbowens requested review from itsbilal, nicktrav, sumeerbhola and a team February 28, 2022 16:21

jbowens commented Feb 28, 2022

View reviewed changes

sumeerbhola reviewed Feb 28, 2022

View reviewed changes

jbowens force-pushed the rewrite-split-userkeys branch from fb3e90d to e4bbcf2 Compare February 28, 2022 22:12

jbowens commented Feb 28, 2022

View reviewed changes

nicktrav approved these changes Mar 1, 2022

View reviewed changes

sumeerbhola requested changes Mar 1, 2022

View reviewed changes

jbowens mentioned this pull request Mar 2, 2022

db: use L0 target file size, grandparent limit for intra L0 compactions #1556

Merged

jbowens force-pushed the rewrite-split-userkeys branch 3 times, most recently from 659fb4d to 59caac4 Compare March 3, 2022 23:03

jbowens commented Mar 3, 2022

View reviewed changes

jbowens force-pushed the rewrite-split-userkeys branch from 59caac4 to 718b28f Compare March 7, 2022 17:39

jbowens force-pushed the rewrite-split-userkeys branch from 718b28f to 268f7c3 Compare March 7, 2022 18:10

nicktrav reviewed Mar 7, 2022

View reviewed changes

sumeerbhola requested changes Mar 8, 2022

View reviewed changes

jbowens force-pushed the rewrite-split-userkeys branch from 268f7c3 to 50bf8ef Compare March 8, 2022 17:36

jbowens commented Mar 8, 2022

View reviewed changes

jbowens force-pushed the rewrite-split-userkeys branch 3 times, most recently from 7fd4e3e to 7dc3d8d Compare March 8, 2022 17:52

nicktrav approved these changes Mar 8, 2022

View reviewed changes

jbowens commented Mar 9, 2022

View reviewed changes

sumeerbhola approved these changes Mar 9, 2022

View reviewed changes

jbowens force-pushed the rewrite-split-userkeys branch from 7dc3d8d to 16bccdb Compare March 10, 2022 00:33

jbowens commented Mar 10, 2022

View reviewed changes

jbowens merged commit 93d4722 into cockroachdb:master Mar 10, 2022

jbowens deleted the rewrite-split-userkeys branch March 10, 2022 01:10

nicktrav mentioned this pull request Jun 9, 2022

storage: add time-bound iterator handling for range keys cockroachdb/cockroach#82596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: add migration to rewrite atomic compaction units #1537

db: add migration to rewrite atomic compaction units #1537

jbowens commented Feb 25, 2022 •

edited

Loading

cockroach-teamcity commented Feb 25, 2022

jbowens left a comment

nicktrav left a comment

jbowens left a comment

sumeerbhola left a comment

sumeerbhola left a comment

jbowens left a comment

jbowens left a comment

nicktrav left a comment

sumeerbhola left a comment

jbowens left a comment

jbowens left a comment

jbowens left a comment

nicktrav left a comment

sumeerbhola left a comment

jbowens left a comment

nicktrav left a comment

jbowens left a comment

sumeerbhola left a comment

jbowens left a comment

jbowens left a comment

db: add migration to rewrite atomic compaction units #1537

db: add migration to rewrite atomic compaction units #1537

Conversation

jbowens commented Feb 25, 2022 • edited Loading

cockroach-teamcity commented Feb 25, 2022

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens commented Feb 25, 2022 •

edited

Loading