-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: add migration to rewrite atomic compaction units #1537
Conversation
6bc9a06
to
e5c53c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola — Thoughts on sneaking something like this in during stability period? If we add a migration like this to 22.1, then we don't have to worry about rangedels with bounds narrower than user keys when we're refactoring the FragmentIterator
.
Reviewable status: 0 of 19 files reviewed, all discussions resolved
e5c53c6
to
859312a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @nicktrav)
format_major_version.go, line 78 at r1 (raw file):
// level. Ratcheting to the FormatSplitUserKeys version will fail if that // guarantee is not met. FormatSplitUserKeys
Nice. This was going to be my initial drive-by comment - the ordering of the migrations, given we don't want Pebblev2 in Cockroach for 22.1. 👍 👍
Now it makes me wonder - if block properties aren't yet released, could this jump back another version and use Rocksdbv2, given the table format didn't really change for this feature. It's most likely a no-op anyway, as 22.1 will have block properties, so Pebble will need to ratchet to at least Pebblev1.
d09b411
to
fb3e90d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm making the assumption here that there are few of these multi-file atomic compaction units, and so it's okay to synchronously perform the migration during the upgrade. Alternatively, we could schedule these compactions in the background when there's no other compactions to be picked, and only add the format major version gating in 22.2.
Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could schedule these compactions in the background when there's no other compactions to be picked, and only add the format major version gating in 22.2.
Strong +1 to this. Plus we could export a metric showing how many remaining files, so that in 22.2 if the code blocks on doing these compactions, we can look at the metric on that node prior to the upgrade and understand why. Or the operator could upgrade to 22.2 unless it drops to 0 (imagining someone doing a 21.2 => 22.1 => 22.2 in rapid succession).
Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal and @nicktrav)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am happy with such background behavior being added during the 22.1 stability.
Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal and @nicktrav)
fb3e90d
to
e4bbcf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to pursue these compactions in the background, maintain a count of these files exposed via Metrics()
, and avoid scanning for these files if there are known to be none. I kept the synchronous migration code, because we can just not ratchet the format major version in 22.1 and add that in 22.2.
Reviewable status: 0 of 22 files reviewed, all discussions resolved (waiting on @itsbilal and @nicktrav)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @nicktrav)
compaction_picker.go, line 47 at r5 (raw file):
pickManual(env compactionEnv, manual *manualCompaction) (c *pickedCompaction, retryLater bool) pickElisionOnlyCompaction(env compactionEnv) (pc *pickedCompaction) pickRewriteCompaction(env compactionEnv, level int, file manifest.LevelFile) (pc *pickedCompaction)
We could use these 'rewrite' compactions to implement cockroachdb/cockroach#74804 for sstables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 19 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 9 of 9 files at r5, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @itsbilal and @jbowens)
compaction_picker.go, line 1620 at r5 (raw file):
// of such files (may double-count an atomic compaction unit) and it returns the // level and LevelFile for one such file. func findSplitUserKey(opts *Options, vers *version) (count, level int, file manifest.LevelFile, ok bool) {
Perhaps this is me needlessly seeking out a (premature) optimization - this function is called repeatedly and ends up re-scanning the levels in the same order looking for compaction units.
That said, the only alternatives I could come up while writing this comment seemed complicated (e.g. I was thinking of some kind of lazy iterator over atomic compaction units, but would hold onto a readstate, which is blergh).
Maybe it's a non-issue in practice as searching for these units happens last in order of compaction types, and the metadata is all in-mem.
compaction_picker.go, line 1622 at r5 (raw file):
func findSplitUserKey(opts *Options, vers *version) (count, level int, file manifest.LevelFile, ok bool) { equal := opts.equal() for l := numLevels - 1; l > 0; l-- {
Is there a specific reason for starting at L6 and working our way to L1, versus the other way around?
format_major_version.go, line 181 at r5 (raw file):
FormatSplitUserKeys: func(d *DB) error { // Before finalizing the format major version, rewrite any sstables that // form multi-file atomic compaction units.
nit: maybe mention that we're guaranteed to hold the d.mu
here, and thus safe to call the method that relies on the lock being held?
open.go, line 459 at r5 (raw file):
// might be files that do not form their own atomic compaction unit. // Initialize the count of such files. // TODO(jackson): Remove this after CockroachDB 22.2.
Nice. Accompanying ticket / issue, perhaps?
Thinking about it some more, it seems like the migration code has to live on in perpetuity, given that we allow a Pebble DB to be opened at any old version and we'll chug though all the migrations in order - maybe that's fine, but just wanted to mention it, as maybe we have a plan for such thing.
testdata/ingest_load, line 92 at r5 (raw file):
# Loading tables at an unsupported table format results in an error. # Write a table at version 7 (Pebble,v2) into a DB at version 6 (Pebble,v1).
nit: did this need to change? It just needs to be any two version that straddles a table format change. Same comment in testdata/rangekeys
.
testdata/split_user_key_migration, line 1 at r5 (raw file):
define
Nice test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 19 files at r1, 4 of 9 files at r5.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @itsbilal and @jbowens)
compaction_picker.go, line 47 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
We could use these 'rewrite' compactions to implement cockroachdb/cockroach#74804 for sstables.
sounds good
compaction_picker.go, line 1075 at r5 (raw file):
// units" within the code). Rewrite them in-place. if *env.nonatomicFileCount > 0 { count, level, file, ok := findSplitUserKey(p.opts, p.vers)
how expensive is it to do this repeatedly. Should we mark the FileMetadata with a bool so do we don't repeat?
compaction_picker.go, line 1081 at r5 (raw file):
// nonatomicFileCount because compaction-picking is performed while // holding DB.mu. *env.nonatomicFileCount = count
Is the nonatomicFileCount stale once a compaction finishes, until there are few enough compaction candidates and we fall through to here?
Can we cheaply strengthen the staleness bound e.g. by tracking the non-atomic state in FileMetadata and adjusting the count when applying the VersionEdit?
compaction_picker.go, line 1194 at r5 (raw file):
// compaction outputs files to the same level as the input level. func (p *compactionPickerByScore) pickRewriteCompaction(env compactionEnv, level int, file manifest.LevelFile) (pc *pickedCompaction) { // Find this file's atomic compaction unit.
are we ensuring somewhere that level is not 0?
659fb4d
to
59caac4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked this to reuse the previously unused FileMetdata.MarkedForCompaction
field that's persisted in the manifest. There's now two format major versions. The first guarantees that all non-atomic compaction units are marked for compaction durably. The second guarantees that there are no files marked for compaction.
To avoid too much during compaction picking, a count of the files marked for compaction is included on the Version
and updated when applying VersionEdit
s. The compaction picker can avoid ever looking for rewrite compactions if this count is zero. Additionally, even if the count is positive, the compaction picker uses a manifest b-tree annotation to find the file to compact, avoiding linear scans across the LSM every time it picks a rewrite compaction. This optimization probably doesn't matter much here, since we expect these split user keys to be very rare, but it would likely matter if we use these rewrite compactions for cockroachdb/cockroach#74804 .
Reviewable status: 3 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)
compaction_picker.go, line 1075 at r5 (raw file):
Previously, sumeerbhola wrote…
how expensive is it to do this repeatedly. Should we mark the FileMetadata with a bool so do we don't repeat?
It's not cheap. Even if we mark a bool on the FileMetadata
, the scan is O(# sstables), which is never great. One thought is that we could reuse the f.MarkedForCompaction
bool (which does get persisted to the manifest) which ensures that we only need to do this traversal once, but requires some extra coordination.
compaction_picker.go, line 1081 at r5 (raw file):
Previously, sumeerbhola wrote…
Is the nonatomicFileCount stale once a compaction finishes, until there are few enough compaction candidates and we fall through to here?
Can we cheaply strengthen the staleness bound e.g. by tracking the non-atomic state in FileMetadata and adjusting the count when applying the VersionEdit?
Done.
compaction_picker.go, line 1194 at r5 (raw file):
Previously, sumeerbhola wrote…
are we ensuring somewhere that level is not 0?
findSplitUserKey
only looks at levels 1...6.
compaction_picker.go, line 1620 at r5 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Perhaps this is me needlessly seeking out a (premature) optimization - this function is called repeatedly and ends up re-scanning the levels in the same order looking for compaction units.
That said, the only alternatives I could come up while writing this comment seemed complicated (e.g. I was thinking of some kind of lazy iterator over atomic compaction units, but would hold onto a readstate, which is blergh).
Maybe it's a non-issue in practice as searching for these units happens last in order of compaction types, and the metadata is all in-mem.
Reworked to avoid duplicate scans by a) persisting which files should be compacted in the existing MarkedForCompaction
field that's persisted in the manifest and b) using a b-tree annotation to ensure that don't need to consistently perform linear scans of a level's file metadata looking for files marked for compaction.
compaction_picker.go, line 1622 at r5 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Is there a specific reason for starting at L6 and working our way to L1, versus the other way around?
Just that files in L6 are more likely to be older and so more likely to have non-atomic compaction units.
format_major_version.go, line 181 at r5 (raw file):
Previously, nicktrav (Nick Travers) wrote…
nit: maybe mention that we're guaranteed to hold the
d.mu
here, and thus safe to call the method that relies on the lock being held?
Done.
open.go, line 459 at r5 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Nice. Accompanying ticket / issue, perhaps?
Thinking about it some more, it seems like the migration code has to live on in perpetuity, given that we allow a Pebble DB to be opened at any old version and we'll chug though all the migrations in order - maybe that's fine, but just wanted to mention it, as maybe we have a plan for such thing.
I was imagining at some point we'd increase the minimum format major version supported by the master branch of Pebble. So non-CockroachDB users with existing databases would need to ratchet the format major version using a non-master SHA before upgrading to master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 28 files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)
compaction_picker.go, line 1075 at r5 (raw file):
It's not cheap. Even if we mark a bool on the FileMetadata, the scan is O(# sstables), which is never great. One thought is that we could reuse the f.MarkedForCompaction bool (which does get persisted to the manifest) which ensures that we only need to do this traversal once, but requires some extra coordination.
This comment is also a stale draft comment. I implemented the MarkedForCompaction
change I referenced here.
compaction_picker.go, line 1194 at r5 (raw file):
findSplitUserKey
only looks at levels 1...6.
This was a stale draft comment. This was reworked so that pickRewriteCompaction can handle L0 files, which is necessary in case an L0 file is marked for compaction from RocksDB/old Pebble versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 28 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)
compaction_picker.go, line 1248 at r6 (raw file):
continue } lf := p.vers.Levels[l].Find(p.opts.Comparer.Compare, candidate)
TODO: This needs to be fixed. .Find
only works for L1...6 not L0. I'll fix this and add a test for a L0 rewrite compaction.
59caac4
to
718b28f
Compare
718b28f
to
268f7c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great - thanks for the extra effort to use make use of the annotations and the persistence.
I did a pass of the more recent changes and it looks good. A few small comments I'll wait to hear back on. But otherwise LGTM (still).
Reviewed 20 of 25 files at r6, 5 of 7 files at r7, 2 of 2 files at r8, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
-- commits, line 13 at r8:
nit: ... every file that is a member of a ...
but is repurposed here
No objection, just curious if we're closing the door here to using that field in the future if we're repurposing it here for this specific type of migration that happens once.
compaction.go, line 1451 at r5 (raw file):
d.mu.versions.incrementCompactionBytes(-c.bytesWritten) info.TotalDuration = d.timeNow().Sub(startTime)
Did this removal creep in from a rebase?
compaction_picker.go, line 1183 at r8 (raw file):
return accum } // If we haven't accumulated an eligible file yet, or f's LargestSeqNum is
nit: the first part of this comment (before the comma) pertains to this first check, and the second pertains to the next set of type-conversions and checks. Was confused reading it. Suggest splitting the comment?
format_major_version.go, line 75 at r8 (raw file):
FormatBlockPropertyCollector // FormatSplitUserKeysMarked is a format major version that guarantees that // all files the share user keys with neighbors are marked for compaction in
nit: ... all files that share ...
format_major_version.go, line 76 at r8 (raw file):
// FormatSplitUserKeysMarked is a format major version that guarantees that // all files the share user keys with neighbors are marked for compaction in // the manifest. Ratcheting to FormatSplitUserKeysMarked will block until
When we talk about "blocking" here (and below), I take it we're talking about just blocking ratcheting, and not blocking progress elsewhere in the DB (i.e. holding locks). Worth clarifying?
format_major_version.go, line 379 at r8 (raw file):
// database lock while we're scanning the file metadata. // // If we find a file to mark, we'll re-acquire the mutex before marking it,
Given that we're dropping and re-acquiring the lock, is there any issue with a race where the version passed into this function is different to the version of the DB that exists when the we log the version edit in the calling function (above)?
I'm thinking of a situation where version v1
is passed in, we find a set of files to mark for compaction / annotate on one or more levels of v1
. However, above us, the version moved forward to v2
(>v1
), so we're annotating a stale version. When we exit this function, we return true
to say that we marked some files on v1
, triggering a version edit to be logged, but this will persist v2
. Is it possible that we'd lose the annotations in that case?
I'm probably missing some invariant that guarantees that this is safe. Can we bolster the comment with this proof in that case?
internal/manifest/btree.go, line 593 at r8 (raw file):
if !n.leaf { for i := int16(0); i <= n.count; i++ { n.children[i].invalidateAnnotation(a)
I take it the recursion here is safe to perform - i.e. bounded depth, etc. Or can Go optimize this as a tail call (I couldn't find a good answer in the quick googling I just did).
internal/manifest/level_metadata.go, line 74 at r8 (raw file):
// key-sorted (eg, non-L0). func (lm *LevelMetadata) Find(cmp base.Compare, m *FileMetadata) *LevelFile { // TODO(jackson): Add an assertion that lm is key-sorted.
Did we change something elsewhere makes this TODO about lm
being key-sortable obsolete?
internal/manifest/level_metadata.go, line 107 at r8 (raw file):
// InvalidateAnnotation clears any cached annotations defined by Annotator. The // Annotator is used as the key for pre-calculated values, so equal Annotators
Is "equal" here taken to mean equality of the underlying dynamic type (i.e. same instance)?
testdata/split_user_key_migration, line 56 at r8 (raw file):
006 # Ensure the files previously marked for compaction are still marked for
Worth doing something like a DB close / open here too to prove that we don't lose the marked files? Not sure it really tests persistence, just that it's recomputed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 25 files at r6, 1 of 7 files at r7, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @itsbilal, @jbowens, @nicktrav, and @sumeerbhola)
Previously, nicktrav (Nick Travers) wrote…
but is repurposed here
No objection, just curious if we're closing the door here to using that field in the future if we're repurposing it here for this specific type of migration that happens once.
I suppose we could reuse it post-FormatMarkCompacted.
compaction_picker.go, line 1185 at r8 (raw file):
// If we haven't accumulated an eligible file yet, or f's LargestSeqNum is // less than the accumulated file's, use f. if accum == nil {
nit: can we factor out
// REQUIRES: f is non-nil, and f.MarkedForCompaction=true.
func mergeHelper(f *FileMetadata, dst interface{}) (interface{}, bool)
and then do
func (a markedForCompactionAnnotator) Accumulate(
f *fileMetadata, dst interface{},
) (interface{}, bool) {
if !f.MarkedForCompaction {
// Not marked for compaction; return dst.
return dst, true
}
return mergeHelper(f, dst)
}
func (a markedForCompactionAnnotator) Merge(v interface{}, accum interface{}) interface{} {
if v == nil {
return accum
}
accum, _ = mergeHelper(v.(*fileMetadata), accum)
return accum
}
compaction_picker.go, line 1253 at r8 (raw file):
} inputs := lf.Slice()
How about a comment like the following:
// L0 files generated by a flush have never been split such that adjacent files can contain the same user key. So we do not need to rewrite an atomic compaction unit for L0. Note that there is nothing preventing two different flushes from producing files that are non-overlapping from an InternalKey perspective, but span the same user key. However, such files cannot be in the same L0 sublevel, since each sublevel requires non-overlapping user keys (unlike other levels).
format_major_version.go, line 190 at r8 (raw file):
FormatSplitUserKeysMarked: func(d *DB) error { // Mark any unmarked files with split-user keys. Note all format major // versions migrations are invoked with DB.mu locked.
Should we move the formatVers out of DB.mu and give it its own mutex? I am worried about hiccups in normal writes if we hold this lock for too long.
With that change we could have the implementation acquire and release DB.mu periodically, say after scanning each level to mark, and we would not be risking that a second migration could start concurrently.
format_major_version.go, line 320 at r8 (raw file):
d.maybeScheduleCompactionPicker(func(picker compactionPicker, env compactionEnv) *pickedCompaction { return picker.pickRewriteCompaction(env) })
How will a compaction actually run while this method holds onto DB.mu?
format_major_version.go, line 424 at r8 (raw file):
// If we marked any files for compaction, clear the compaction-picking // annotation that caches files marked-for-compaction, as it's now // out-of-date.
I don't understand what is out of date? Could you elaborate?
268f7c3
to
50bf8ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 27 of 30 files reviewed, 14 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
Previously, sumeerbhola wrote…
I suppose we could reuse it post-FormatMarkCompacted.
Yeah, that's right. I think we're also free to use it before then as a means of triggering explicit compaction of files (eg, non-inplace rewrite compactions) if we want to.
I was imagining it could be reused for cockroachdb/cockroach#74804 as well.
compaction.go, line 1451 at r5 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Did this removal creep in from a rebase?
What removal? I'm not sure if I'm just navigating reviewable poorly, but there's no diff here.
compaction_picker.go, line 1185 at r8 (raw file):
Previously, sumeerbhola wrote…
nit: can we factor out
// REQUIRES: f is non-nil, and f.MarkedForCompaction=true. func mergeHelper(f *FileMetadata, dst interface{}) (interface{}, bool)
and then do
func (a markedForCompactionAnnotator) Accumulate( f *fileMetadata, dst interface{}, ) (interface{}, bool) { if !f.MarkedForCompaction { // Not marked for compaction; return dst. return dst, true } return mergeHelper(f, dst) } func (a markedForCompactionAnnotator) Merge(v interface{}, accum interface{}) interface{} { if v == nil { return accum } accum, _ = mergeHelper(v.(*fileMetadata), accum) return accum }
done.
compaction_picker.go, line 1253 at r8 (raw file):
Previously, sumeerbhola wrote…
How about a comment like the following:
// L0 files generated by a flush have never been split such that adjacent files can contain the same user key. So we do not need to rewrite an atomic compaction unit for L0. Note that there is nothing preventing two different flushes from producing files that are non-overlapping from an InternalKey perspective, but span the same user key. However, such files cannot be in the same L0 sublevel, since each sublevel requires non-overlapping user keys (unlike other levels).
Done.
format_major_version.go, line 190 at r8 (raw file):
Previously, sumeerbhola wrote…
Should we move the formatVers out of DB.mu and give it its own mutex? I am worried about hiccups in normal writes if we hold this lock for too long.
With that change we could have the implementation acquire and release DB.mu periodically, say after scanning each level to mark, and we would not be risking that a second migration could start concurrently.
I'm wary of a second mutex. We need to read the format version during normal database operation, typically when we already have d.mu
held, like starting a compaction. If we have a second mutex, we need to dictate a new lock ordering, which is easy to forget.
I'd prefer we dictate that format major version migrations must be careful with holding the mutex for long-running operations, like we do elsewhere in the DB. In this case, we don't do much with the mutex because we drop it before scanning all the file metadata. I don't think even scanning all the file metadata is that expensive. Before the introduction of the b-tree, we would perform O(n) key comparisons during compaction picking while picking a file within a level (and RocksDB still performs these O(# sstables) key comparisons during its compaction picking).
Can prevent a second migration from running concurrently with a d.mu.formatVers.ratcheting
bool. I'll add that here.
format_major_version.go, line 320 at r8 (raw file):
Previously, sumeerbhola wrote…
How will a compaction actually run while this method holds onto DB.mu?
We wait on the d.mu.compact.cond
condition variable, which unlocks d.mu
while it waits. Added a comment there.
format_major_version.go, line 379 at r8 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Given that we're dropping and re-acquiring the lock, is there any issue with a race where the version passed into this function is different to the version of the DB that exists when the we log the version edit in the calling function (above)?
I'm thinking of a situation where version
v1
is passed in, we find a set of files to mark for compaction / annotate on one or more levels ofv1
. However, above us, the version moved forward tov2
(>v1
), so we're annotating a stale version. When we exit this function, we returntrue
to say that we marked some files onv1
, triggering a version edit to be logged, but this will persistv2
. Is it possible that we'd lose the annotations in that case?I'm probably missing some invariant that guarantees that this is safe. Can we bolster the comment with this proof in that case?
You're right, there was a race here. Marking files as marked on the *FileMetadata
is okay, because if the file metadata still exists in the LSM, both the new and the old versions hold pointers to the same FileMetadata
struct. The vers.Stats.MarkedForCompaction
counts however would not be right.
I updated this to collect all the split user-key files, and then re-find them in the new version once we reacquire the mutex.
format_major_version.go, line 424 at r8 (raw file):
Previously, sumeerbhola wrote…
I don't understand what is out of date? Could you elaborate?
Elaborated the comment, lemme know if still unclear
internal/manifest/btree.go, line 593 at r8 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I take it the recursion here is safe to perform - i.e. bounded depth, etc. Or can Go optimize this as a tail call (I couldn't find a good answer in the quick googling I just did).
It's bounded by the depth of the b-tree (which is log16(# files in the level))
internal/manifest/level_metadata.go, line 74 at r8 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Did we change something elsewhere makes this TODO about
lm
being key-sortable obsolete?
The change to this function removed that requirement.
Level 0 files are not key-sortable, so previously Find
only worked with L1+ because it assumed lm
is key-sorted.
internal/manifest/level_metadata.go, line 107 at r8 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Is "equal" here taken to mean equality of the underlying dynamic type (i.e. same instance)?
Yes, although in practice all of B-Tree annotator implementations are empty struct{}
s, implementing Annotator
implemented on the non-pointer receiver, so it's really just a type comparison.
testdata/split_user_key_migration, line 56 at r8 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Worth doing something like a DB close / open here too to prove that we don't lose the marked files? Not sure it really tests persistence, just that it's recomputed.
I might be misunderstanding, but just above this is a reopen
. The above format-major-version
call is only reading the current format major version.
7fd4e3e
to
7dc3d8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All my comments have been addressed. This looks great!
Reviewed 3 of 3 files at r9, 11 of 11 files at r10, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
compaction.go, line 1451 at r5 (raw file):
Previously, jbowens (Jackson Owens) wrote…
What removal? I'm not sure if I'm just navigating reviewable poorly, but there's no diff here.
Weird. At r5 there was a removal, but it's gone now. Disregard.
testdata/split_user_key_migration, line 56 at r8 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I might be misunderstanding, but just above this is a
reopen
. The aboveformat-major-version
call is only reading the current format major version.
Urgh - reviewable was hiding it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola any more comments?
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @itsbilal and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I didn't look at the tests since @nicktrav was doing a careful review)
Reviewed 1 of 25 files at r6, 2 of 3 files at r9, 1 of 11 files at r10, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
db.go, line 326 at r10 (raw file):
// currently in the process of ratcheting the format major version // to vers + 1. As a part of ratcheting the format major version, // migrations may may drop and re-acquire the mutex.
may drop
Add two new format major versions FormatSplitUserKeysMarked and FormatMarkCompacted that together may be used to guarantee that all sstables within the table form their own atomic compaction unit. Previous versions of Pebble (before cockroachdb#1470) and RocksDB allowed keys with identical user keys to be split across multiple tables within a level. This produced a set of files known as an 'atomic compaction unit' that must be compacted together to preserve the LSM invariant. The first format major version FormatSplitUserKeysMarked may be used to guarantee that every file that is member of a multi-file atomic compaction unit is marked for compaction. The 'marked-for-compaction' file metadata field is currently unused by Pebble, but is repurposed here to record the intent to recompact these files with split user keys. If ratcheting to FormatSplitUserKeysMarked discovers files with split user keys, it marks them for compaction and then rotates the manifest to ensure that the updated file metadata is persisted durably. This commit introduces a new rewrite compaction type. During compaction picking if no other productive compaction is picked, the compaction picker looks for files marked for compaction. It uses a manifest.Annotator to avoid a linear search through the file metadata. If a marked file exists, it picks a compaction that outputs into the file's existing level, pulling in its atomic compaction unit if necessary. The second format major version FormatMarkCompacted is used to guarantee that no files that are marked for compaction exist. This may be used in a subequent CockroachDB release (22.2) to ensure that all files marked for compaction by the FormatSplitUserKeysMarked format major version have been compacted away. Ratcheting to this format major version blocks until all the marked files are compacted. Together these format major versions will allow us to remove code necessary to handle these atomic compaction units, when we increase the minimum format major version supported by Pebble. Close cockroachdb#1495.
7dc3d8d
to
16bccdb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: 28 of 30 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
db.go, line 326 at r10 (raw file):
Previously, sumeerbhola wrote…
may drop
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status: 28 of 30 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)
db.go, line 326 at r10 (raw file):
Previously, sumeerbhola wrote…
may drop
Done.
Add two new format major versions FormatSplitUserKeysMarked and
FormatMarkCompacted that together may be used to guarantee that all sstables
within the table form their own atomic compaction unit. Previous versions of
Pebble (before #1470) and RocksDB allowed keys with identical user keys to be
split across multiple tables within a level. This produced a set of files known
as an 'atomic compaction unit' that must be compacted together to preserve the
LSM invariant.
The first format major version FormatSplitUserKeysMarked may be used to
guarantee that every file that is member to a multi-file atomic compaction unit
is marked for compaction. The 'marked-for-compaction' file metadata field is
currently unused by Pebble, but is repurposed here to record the intent to
recompact these files with split user keys. If ratcheting to
FormatSplitUserKeysMarked discovers files with split user keys, it marks them
for compaction and then rotates the manifest to ensure that the updated file
metadata is persisted durably.
This commit introduces a new rewrite compaction type. During compaction picking
if no other productive compaction is picked, the compaction picker looks for
files marked for compaction. It uses a manifest.Annotator to avoid a linear
search through the file metadata. If a marked file exists, it picks a
compaction that outputs into the file's existing level, pulling in its atomic
compaction unit if necessary.
The second format major version FormatMarkCompacted is used to guarantee that
no files that are marked for compaction exist. This may be used in a subequent
CockroachDB release (22.2) to ensure that all files marked for compaction by
the FormatSplitUserKeysMarked format major version have been compacted away.
Ratcheting to this format major version blocks until all the marked files are
compacted.
Together these format major versions will allow us to remove code necessary to
handle these atomic compaction units, when we increase the minimum format major
version supported by Pebble.
Close #1495.