Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent sstable boundaries from expanding during compaction #67

Merged
merged 1 commit into from
Apr 17, 2019

Conversation

petermattis
Copy link
Collaborator

Enforce the invariant that the boundaries of the output of a compaction
can be larger than the boundaries of the input. This fixes a problem
where a compaction can cause a range tombstone to expand to cover a
newer key at a lower level in the LSM.

Fixes #26
Closes #27

@petermattis petermattis requested a review from ajkr April 6, 2019 15:13
@petermattis
Copy link
Collaborator Author

This change is Reviewable

@petermattis
Copy link
Collaborator Author

This change is so small that I feel I'm missing something. I believe RocksDB achieves the same invariant by truncating range tombstones at read time during compaction. For Pebble I think that truncation is unnecessary as we get implicit truncation to sstable boundaries in mergingIter. The part we were getting wrong is allowing the sstable boundaries to expand during compaction. Please give this careful scrutiny. As we're both well aware: range tombstones have super subtle failure scenarios.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want

can't be larger

?

compaction.go Outdated
// Determine the largest key from the inputs in order to maintain the
// invariant that the outputs from a compaction have bounds that are within
// the bounds of the input. This is necessary because range tombstones are
// not truncated to sstable boundaries. A compaction could this result in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thus

compaction.go Outdated
// Bound the largest key of the sstable by the largest key of the
// compaction inputs. See the comment on largestInputKey for why this is
// necessary.
if db.InternalCompare(d.cmp, meta.largest, largestInputKey) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What allows meta.largest to exceed largestInputKey in the first place? Is it always a range tombstone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the comment on largestInputKey could elucidate this more and explain that naively we'd end up with the tombstone which reaches outside of the sstable creating a wider sstable as a result, but we don't want that.

@ajkr
Copy link
Contributor

ajkr commented Apr 6, 2019

Will try writing some test cases. The cases I'm worried about are when a compaction contains multiple atomic compaction units (in one level). Could be caused by compaction::grow() on the input level, or just the chosen input files spanned multiple output level compaction units. In those cases, with this PR I believe we can still write out files whose endpoints and range tombstones extend outside their original compaction unit.

This case (facebook/rocksdb#4432 (comment)) is close to what I have in mind but needs to be modified. The final L5->L6 couldn't cause the problem anymore due to how you mentioned we stop paying attention to a file's range tombstones when that file is closed. But I am imagining moving everything up a level, so that final step becomes an L4->L5. At that time the range tombstone (and its containing file) should expand such that a later L5->L6 could delete keys newer than the range tombstone.

@petermattis
Copy link
Collaborator Author

Yeah, there may be a problem here. What if L1 also contained an sstable with the range [d-d)? We’d generate a new sstable in L2 with the bounds [a-d). I’m AFK for a bit, but I can’t see why the approach in this PR would work.

@petermattis
Copy link
Collaborator Author

The range tombstones for compaction are loaded by a specially configured levelIter. I bet we could wrap the rangeDelIter to truncate the tombstones to atomic compaction unit boundaries. This would be almost entirely contained in compaction.newInputIter. Let me try that and see how it works.

@petermattis
Copy link
Collaborator Author

@ajkr You are right. Here is a slight variation on the test case present in this PR (which was copied from #27) which fails:

define target-file-sizes=(100, 1) snapshots=(1)
L0
  a.RANGEDEL.1:e
L0
  a.SET.2:v
L0
  c.SET.3:v
L0
  f.SET.4:v
L2
  d.SET.0:v
----
mem: 1
0: a-e a-a c-c f-f
2: d-d

compact a-b
----
0: f-f
1: a-c c-e
2: d-d

compact d-e
----
0: f-f
1: a-c
3: c-d d-e

get seq=4
c
----
v

compact f-f L0
----
1: a-c f-f
3: c-d d-e

compact a-f L1
----
2: a-e e-f
3: c-d d-e

get seq=4
c
----
v

That final compaction should be production L2 sstables with the bounds a-c and f-f.

ajkr added a commit to ajkr/pebble that referenced this pull request Apr 6, 2019
@ajkr
Copy link
Contributor

ajkr commented Apr 6, 2019

Thanks, I was also just trying the same thing you suggested: ajkr@4646421. (Side note: if one accidentally uses --- instead of ---- in the data-driven tests it's kind of hard to debug.)

@ajkr
Copy link
Contributor

ajkr commented Apr 6, 2019

The range tombstones for compaction are loaded by a specially configured levelIter. I bet we could wrap the rangeDelIter to truncate the tombstones to atomic compaction unit boundaries. This would be almost entirely contained in compaction.newInputIter. Let me try that and see how it works.

BTW, I've already written code that pushes atomic compaction unit boundaries down to the specially configured level iterator. I just hadn't decided yet where to do the actual truncation. Here it is in case any of it is useful: ajkr@ba9e6a5

@petermattis
Copy link
Collaborator Author

Heh, I didn't see your messages until just now. I just updated this commit to do the truncation to the atomic compaction unit upper bound by wrapping the rangeDelIter used during compaction. Do we have to truncate both the upper and lower bound, or just the upper bound? I think the lower bound truncation comes for free by the way tombstones are fragmented during compaction.

Let me take a look at your commits

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

compaction.go Outdated
if rangeDelIter != nil {
// Truncate the range tombstones returned by the iterator to the upper
// bound of the atomic compaction unit.
if upperBound := c.atomicCompactionUnitUpperBound(f); upperBound != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I didn't think of computing this as the compaction is ongoing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We could precompute this in newInputIter and store the results in a map keyed by *fileMetadata, but that seems like overkill when compactions usually involve only a few inputs.

@ajkr
Copy link
Contributor

ajkr commented Apr 6, 2019

@petermattis I have a vague memory that old versions of RocksDB don't even truncate their range tombstones if they extend to the left of the file, even though they could because truncating to the start key doesn't lose any information about whether the start key is covered. So maybe. It'd be hard to write a test because the input data would have to be generated by old RocksDB.

@petermattis
Copy link
Collaborator Author

I have a vague memory that old versions of RocksDB don't even truncate their range tombstones if they extend to the left of the file, even though they could because truncating to the start key doesn't lose any information about whether the start key is covered. So maybe. It'd be hard to write a test because the input data would have to be generated by old RocksDB.

Ok. I think it is good to be paranoid here and truncate to the lowerBound as well. Re testing: we can generate an sstable and then whack it into the LSM in a similar manner to what runDBDefineCmd does.

I'll be AFK the rest of the afternoon and most of tomorrow. Let me know if you want to take over this PR. Otherwise I'll polish it off when I find time.

compaction.go Outdated
if f == &files[j] {
upperBound := f.largest.UserKey
for j = j + 1; j < len(files); j++ {
cur := &files[j-1]
Copy link
Contributor

@ajkr ajkr Apr 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, it was also clever to operate just on the compaction input files unlike my implementation, which copied expandInputs in that it referenced files in version.

During compaction, truncate range tombstones at read time to atomic
compaction unit boundaries. Truncating to atomic compaction unit
boundaries enforces the invariant that the boundaries of the output of a
compaction cannot be larger than the boundaries of the input. It also
enforces a tighter invariant regarding range tombstones not enlarging
beyond the boundaries of their atomic compaction unit during
compaction. This is necessary to prevent a range tombstone from
expanding after a compaction to cover a newer key at a lower level in
the LSM.

Fixes #26
Closes #27
@petermattis
Copy link
Collaborator Author

@ajkr I've updated this PR and replaced truncatingRangeDelIter with a rangedel.Truncate method. This approach is more obviously correct and was somewhat easier to test exhaustively and supports truncation to both the lower and upper boundaries. I've also added a test for the atomic compaction unit boundaries computation. PTAL.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 1 of 3 files at r2, 8 of 8 files at r3.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @petermattis)


testdata/compaction_atomic_unit_bounds, line 1 at r3 (raw file):

define

Thanks for the good test coverage.

@petermattis
Copy link
Collaborator Author

TFTR!

@petermattis petermattis merged commit 75b1248 into master Apr 17, 2019
@petermattis petermattis deleted the pmattis/compaction-bounds branch April 17, 2019 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DeleteRange truncation issues
3 participants