Prevent sstable boundaries from expanding during compaction #67

petermattis · 2019-04-06T15:13:01Z

Enforce the invariant that the boundaries of the output of a compaction
can be larger than the boundaries of the input. This fixes a problem
where a compaction can cause a range tombstone to expand to cover a
newer key at a lower level in the LSM.

Fixes #26
Closes #27

petermattis · 2019-04-06T15:13:06Z

This change is

petermattis · 2019-04-06T15:17:22Z

This change is so small that I feel I'm missing something. I believe RocksDB achieves the same invariant by truncating range tombstones at read time during compaction. For Pebble I think that truncation is unnecessary as we get implicit truncation to sstable boundaries in mergingIter. The part we were getting wrong is allowing the sstable boundaries to expand during compaction. Please give this careful scrutiny. As we're both well aware: range tombstones have super subtle failure scenarios.

tbg

Don't you want

can't be larger

?

tbg · 2019-04-06T16:06:40Z

compaction.go

+	// Determine the largest key from the inputs in order to maintain the
+	// invariant that the outputs from a compaction have bounds that are within
+	// the bounds of the input. This is necessary because range tombstones are
+	// not truncated to sstable boundaries. A compaction could this result in the


tbg · 2019-04-06T16:09:47Z

compaction.go

+		// Bound the largest key of the sstable by the largest key of the
+		// compaction inputs. See the comment on largestInputKey for why this is
+		// necessary.
+		if db.InternalCompare(d.cmp, meta.largest, largestInputKey) > 0 {


What allows meta.largest to exceed largestInputKey in the first place? Is it always a range tombstone?

Maybe the comment on largestInputKey could elucidate this more and explain that naively we'd end up with the tombstone which reaches outside of the sstable creating a wider sstable as a result, but we don't want that.

ajkr · 2019-04-06T16:55:51Z

Will try writing some test cases. The cases I'm worried about are when a compaction contains multiple atomic compaction units (in one level). Could be caused by compaction::grow() on the input level, or just the chosen input files spanned multiple output level compaction units. In those cases, with this PR I believe we can still write out files whose endpoints and range tombstones extend outside their original compaction unit.

This case (facebook/rocksdb#4432 (comment)) is close to what I have in mind but needs to be modified. The final L5->L6 couldn't cause the problem anymore due to how you mentioned we stop paying attention to a file's range tombstones when that file is closed. But I am imagining moving everything up a level, so that final step becomes an L4->L5. At that time the range tombstone (and its containing file) should expand such that a later L5->L6 could delete keys newer than the range tombstone.

petermattis · 2019-04-06T17:13:57Z

Yeah, there may be a problem here. What if L1 also contained an sstable with the range [d-d)? We’d generate a new sstable in L2 with the bounds [a-d). I’m AFK for a bit, but I can’t see why the approach in this PR would work.

petermattis · 2019-04-06T17:36:04Z

The range tombstones for compaction are loaded by a specially configured levelIter. I bet we could wrap the rangeDelIter to truncate the tombstones to atomic compaction unit boundaries. This would be almost entirely contained in compaction.newInputIter. Let me try that and see how it works.

petermattis · 2019-04-06T18:23:17Z

@ajkr You are right. Here is a slight variation on the test case present in this PR (which was copied from #27) which fails:

define target-file-sizes=(100, 1) snapshots=(1)
L0
  a.RANGEDEL.1:e
L0
  a.SET.2:v
L0
  c.SET.3:v
L0
  f.SET.4:v
L2
  d.SET.0:v
----
mem: 1
0: a-e a-a c-c f-f
2: d-d

compact a-b
----
0: f-f
1: a-c c-e
2: d-d

compact d-e
----
0: f-f
1: a-c
3: c-d d-e

get seq=4
c
----
v

compact f-f L0
----
1: a-c f-f
3: c-d d-e

compact a-f L1
----
2: a-e e-f
3: c-d d-e

get seq=4
c
----
v

That final compaction should be production L2 sstables with the bounds a-c and f-f.

ajkr · 2019-04-06T18:31:15Z

Thanks, I was also just trying the same thing you suggested: ajkr@4646421. (Side note: if one accidentally uses --- instead of ---- in the data-driven tests it's kind of hard to debug.)

ajkr · 2019-04-06T18:48:46Z

The range tombstones for compaction are loaded by a specially configured levelIter. I bet we could wrap the rangeDelIter to truncate the tombstones to atomic compaction unit boundaries. This would be almost entirely contained in compaction.newInputIter. Let me try that and see how it works.

BTW, I've already written code that pushes atomic compaction unit boundaries down to the specially configured level iterator. I just hadn't decided yet where to do the actual truncation. Here it is in case any of it is useful: ajkr@ba9e6a5

petermattis · 2019-04-06T19:00:23Z

Heh, I didn't see your messages until just now. I just updated this commit to do the truncation to the atomic compaction unit upper bound by wrapping the rangeDelIter used during compaction. Do we have to truncate both the upper and lower bound, or just the upper bound? I think the lower bound truncation comes for free by the way tombstones are fragmented during compaction.

Let me take a look at your commits

ajkr

LGTM.

ajkr · 2019-04-06T19:02:17Z

compaction.go

+		if rangeDelIter != nil {
+			// Truncate the range tombstones returned by the iterator to the upper
+			// bound of the atomic compaction unit.
+			if upperBound := c.atomicCompactionUnitUpperBound(f); upperBound != nil {


Cool, I didn't think of computing this as the compaction is ongoing.

Yeah. We could precompute this in newInputIter and store the results in a map keyed by *fileMetadata, but that seems like overkill when compactions usually involve only a few inputs.

ajkr · 2019-04-06T19:04:55Z

@petermattis I have a vague memory that old versions of RocksDB don't even truncate their range tombstones if they extend to the left of the file, even though they could because truncating to the start key doesn't lose any information about whether the start key is covered. So maybe. It'd be hard to write a test because the input data would have to be generated by old RocksDB.

petermattis · 2019-04-06T19:09:44Z

I have a vague memory that old versions of RocksDB don't even truncate their range tombstones if they extend to the left of the file, even though they could because truncating to the start key doesn't lose any information about whether the start key is covered. So maybe. It'd be hard to write a test because the input data would have to be generated by old RocksDB.

Ok. I think it is good to be paranoid here and truncate to the lowerBound as well. Re testing: we can generate an sstable and then whack it into the LSM in a similar manner to what runDBDefineCmd does.

I'll be AFK the rest of the afternoon and most of tomorrow. Let me know if you want to take over this PR. Otherwise I'll polish it off when I find time.

ajkr · 2019-04-06T19:11:30Z

compaction.go

+			if f == &files[j] {
+				upperBound := f.largest.UserKey
+				for j = j + 1; j < len(files); j++ {
+					cur := &files[j-1]


Oh yeah, it was also clever to operate just on the compaction input files unlike my implementation, which copied expandInputs in that it referenced files in version.

During compaction, truncate range tombstones at read time to atomic compaction unit boundaries. Truncating to atomic compaction unit boundaries enforces the invariant that the boundaries of the output of a compaction cannot be larger than the boundaries of the input. It also enforces a tighter invariant regarding range tombstones not enlarging beyond the boundaries of their atomic compaction unit during compaction. This is necessary to prevent a range tombstone from expanding after a compaction to cover a newer key at a lower level in the LSM. Fixes #26 Closes #27

petermattis · 2019-04-12T19:10:46Z

@ajkr I've updated this PR and replaced truncatingRangeDelIter with a rangedel.Truncate method. This approach is more obviously correct and was somewhat easier to test exhaustively and supports truncation to both the lower and upper boundaries. I've also added a test for the atomic compaction unit boundaries computation. PTAL.

ajkr

Reviewed 1 of 3 files at r2, 8 of 8 files at r3.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @petermattis)

testdata/compaction_atomic_unit_bounds, line 1 at r3 (raw file):

define

Thanks for the good test coverage.

petermattis · 2019-04-17T21:39:46Z

TFTR!

petermattis requested a review from ajkr April 6, 2019 15:13

tbg reviewed Apr 6, 2019

View reviewed changes

ajkr added a commit to ajkr/pebble that referenced this pull request Apr 6, 2019

pr cockroachdb#67

46ac8ca

petermattis force-pushed the pmattis/compaction-bounds branch from 1d15831 to c805f3a Compare April 6, 2019 18:57

ajkr approved these changes Apr 6, 2019

View reviewed changes

ajkr reviewed Apr 6, 2019

View reviewed changes

petermattis mentioned this pull request Apr 9, 2019

sstable: convert v1 range-del blocks to v2 on-the-fly #68

Closed

petermattis force-pushed the pmattis/compaction-bounds branch from c805f3a to 4b58d9c Compare April 12, 2019 19:08

ajkr approved these changes Apr 17, 2019

View reviewed changes

petermattis merged commit 75b1248 into master Apr 17, 2019

petermattis deleted the pmattis/compaction-bounds branch April 17, 2019 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent sstable boundaries from expanding during compaction #67

Prevent sstable boundaries from expanding during compaction #67

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

tbg left a comment

tbg Apr 6, 2019

tbg Apr 6, 2019

tbg Apr 6, 2019

ajkr commented Apr 6, 2019 •

edited

Loading

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

ajkr commented Apr 6, 2019

ajkr commented Apr 6, 2019 •

edited

Loading

petermattis commented Apr 6, 2019

ajkr left a comment

ajkr Apr 6, 2019

petermattis Apr 6, 2019

ajkr commented Apr 6, 2019

petermattis commented Apr 6, 2019

ajkr Apr 6, 2019 •

edited

Loading

petermattis commented Apr 12, 2019

ajkr left a comment

petermattis commented Apr 17, 2019

Prevent sstable boundaries from expanding during compaction #67

Prevent sstable boundaries from expanding during compaction #67

Conversation

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

tbg left a comment

Choose a reason for hiding this comment

tbg Apr 6, 2019

Choose a reason for hiding this comment

tbg Apr 6, 2019

Choose a reason for hiding this comment

tbg Apr 6, 2019

Choose a reason for hiding this comment

ajkr commented Apr 6, 2019 • edited Loading

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

petermattis commented Apr 6, 2019

ajkr commented Apr 6, 2019

ajkr commented Apr 6, 2019 • edited Loading

petermattis commented Apr 6, 2019

ajkr left a comment

Choose a reason for hiding this comment

ajkr Apr 6, 2019

Choose a reason for hiding this comment

petermattis Apr 6, 2019

Choose a reason for hiding this comment

ajkr commented Apr 6, 2019

petermattis commented Apr 6, 2019

ajkr Apr 6, 2019 • edited Loading

Choose a reason for hiding this comment

petermattis commented Apr 12, 2019

ajkr left a comment

Choose a reason for hiding this comment

petermattis commented Apr 17, 2019

ajkr commented Apr 6, 2019 •

edited

Loading

ajkr commented Apr 6, 2019 •

edited

Loading

ajkr Apr 6, 2019 •

edited

Loading