-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent sstable boundaries from expanding during compaction #67
Conversation
This change is so small that I feel I'm missing something. I believe RocksDB achieves the same invariant by truncating range tombstones at read time during compaction. For Pebble I think that truncation is unnecessary as we get implicit truncation to sstable boundaries in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want
can't be larger
?
compaction.go
Outdated
// Determine the largest key from the inputs in order to maintain the | ||
// invariant that the outputs from a compaction have bounds that are within | ||
// the bounds of the input. This is necessary because range tombstones are | ||
// not truncated to sstable boundaries. A compaction could this result in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thus
compaction.go
Outdated
// Bound the largest key of the sstable by the largest key of the | ||
// compaction inputs. See the comment on largestInputKey for why this is | ||
// necessary. | ||
if db.InternalCompare(d.cmp, meta.largest, largestInputKey) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What allows meta.largest
to exceed largestInputKey
in the first place? Is it always a range tombstone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the comment on largestInputKey
could elucidate this more and explain that naively we'd end up with the tombstone which reaches outside of the sstable creating a wider sstable as a result, but we don't want that.
Will try writing some test cases. The cases I'm worried about are when a compaction contains multiple atomic compaction units (in one level). Could be caused by This case (facebook/rocksdb#4432 (comment)) is close to what I have in mind but needs to be modified. The final L5->L6 couldn't cause the problem anymore due to how you mentioned we stop paying attention to a file's range tombstones when that file is closed. But I am imagining moving everything up a level, so that final step becomes an L4->L5. At that time the range tombstone (and its containing file) should expand such that a later L5->L6 could delete keys newer than the range tombstone. |
Yeah, there may be a problem here. What if L1 also contained an sstable with the range [d-d)? We’d generate a new sstable in L2 with the bounds [a-d). I’m AFK for a bit, but I can’t see why the approach in this PR would work. |
The range tombstones for compaction are loaded by a specially configured levelIter. I bet we could wrap the rangeDelIter to truncate the tombstones to atomic compaction unit boundaries. This would be almost entirely contained in compaction.newInputIter. Let me try that and see how it works. |
@ajkr You are right. Here is a slight variation on the test case present in this PR (which was copied from #27) which fails:
That final compaction should be production L2 sstables with the bounds |
Thanks, I was also just trying the same thing you suggested: ajkr@4646421. (Side note: if one accidentally uses |
BTW, I've already written code that pushes atomic compaction unit boundaries down to the specially configured level iterator. I just hadn't decided yet where to do the actual truncation. Here it is in case any of it is useful: ajkr@ba9e6a5 |
1d15831
to
c805f3a
Compare
Heh, I didn't see your messages until just now. I just updated this commit to do the truncation to the atomic compaction unit upper bound by wrapping the rangeDelIter used during compaction. Do we have to truncate both the upper and lower bound, or just the upper bound? I think the lower bound truncation comes for free by the way tombstones are fragmented during compaction. Let me take a look at your commits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
compaction.go
Outdated
if rangeDelIter != nil { | ||
// Truncate the range tombstones returned by the iterator to the upper | ||
// bound of the atomic compaction unit. | ||
if upperBound := c.atomicCompactionUnitUpperBound(f); upperBound != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I didn't think of computing this as the compaction is ongoing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We could precompute this in newInputIter
and store the results in a map keyed by *fileMetadata
, but that seems like overkill when compactions usually involve only a few inputs.
@petermattis I have a vague memory that old versions of RocksDB don't even truncate their range tombstones if they extend to the left of the file, even though they could because truncating to the start key doesn't lose any information about whether the start key is covered. So maybe. It'd be hard to write a test because the input data would have to be generated by old RocksDB. |
Ok. I think it is good to be paranoid here and truncate to the I'll be AFK the rest of the afternoon and most of tomorrow. Let me know if you want to take over this PR. Otherwise I'll polish it off when I find time. |
compaction.go
Outdated
if f == &files[j] { | ||
upperBound := f.largest.UserKey | ||
for j = j + 1; j < len(files); j++ { | ||
cur := &files[j-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, it was also clever to operate just on the compaction input files unlike my implementation, which copied expandInputs
in that it referenced files in version
.
During compaction, truncate range tombstones at read time to atomic compaction unit boundaries. Truncating to atomic compaction unit boundaries enforces the invariant that the boundaries of the output of a compaction cannot be larger than the boundaries of the input. It also enforces a tighter invariant regarding range tombstones not enlarging beyond the boundaries of their atomic compaction unit during compaction. This is necessary to prevent a range tombstone from expanding after a compaction to cover a newer key at a lower level in the LSM. Fixes #26 Closes #27
c805f3a
to
4b58d9c
Compare
@ajkr I've updated this PR and replaced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r2, 8 of 8 files at r3.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @petermattis)
testdata/compaction_atomic_unit_bounds, line 1 at r3 (raw file):
define
Thanks for the good test coverage.
TFTR! |
Enforce the invariant that the boundaries of the output of a compaction
can be larger than the boundaries of the input. This fixes a problem
where a compaction can cause a range tombstone to expand to cover a
newer key at a lower level in the LSM.
Fixes #26
Closes #27