Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate range tombstones by leveraging InternalKeys #4432

Closed

Conversation

abhimadan
Copy link
Contributor

@abhimadan abhimadan commented Sep 27, 2018

Summary: To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.

During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See #4432 (comment) and #4432 (comment)
for motivating examples.

Test Plan: make -j64 check ; make -j64 range_del_aggregator_bench && ./range_del_aggregator_bench (before and after PR)

Before:

=========================
Results: 
=========================
AddTombstones:           578.669 us
ShouldDelete (first):    0.32449 us

After:

=========================
Results:
=========================
AddTombstones:           649.471 us
ShouldDelete (first):    0.29454 us

So using InternalKeyComparator results in a slight perf regression.

@@ -40,6 +40,18 @@ enum class RangeDelPositioningMode {
kBinarySearch,
};

// TODO: think about whether this is the right thing to do here, or if having a completely different type
// works better
struct TruncatedRangeTombstone : public RangeTombstone {
Copy link
Contributor

@ajkr ajkr Sep 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want this to contain two ParsedInternalKeys to avoid copying during AddTombstones.

edit: also other places like ShouldDelete's argument and collapsed range tombstone map's key. Basically I think a good strategy to eliminate copying is use ParsedInternalKey everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto what @ajkr said. I was imagining we'd use ParsedInternalKey. The kind field won't be used, but that is ok.

@ajkr
Copy link
Contributor

ajkr commented Sep 27, 2018

also let's verify no perf regression compared to before this PR, and include results in "Test Plan". Here's a few commands I've used before:

  • build db_bench:
$ DEBUG_LEVEL=0 make -j64 db_bench
  • benchmark setup:
$ ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
  • benchmark random reads:
$ ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
  • benchmark iteration:
$ ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=seekrandom -seek_nexts=10 -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32

edit: or just use the microbenchmark tool, idk how I forgot about that.

@@ -40,6 +40,18 @@ enum class RangeDelPositioningMode {
kBinarySearch,
};

// TODO: think about whether this is the right thing to do here, or if having a completely different type
// works better
struct TruncatedRangeTombstone : public RangeTombstone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto what @ajkr said. I was imagining we'd use ParsedInternalKey. The kind field won't be used, but that is ok.

tombstone.start_key_ = iter_->first;
tombstone.end_key_ = std::next(iter_)->first;
tombstone.start_key_ = ExtractUserKey(iter_->first);
tombstone.end_key_ = ExtractUserKey(std::next(iter_)->first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems suspicious to me. Internally, the CollapsedRangeDelMap represents tombstones with internal-keys, but here you're truncating the tombstone again to be represented with user-keys. That will cause the start-key to cover more of the key space and end-key to cover less of it. Is this a problem? I'm not sure. I think this interface is only used when building an sstable and during compactions. During compactions, a truncated tombstone will only occur when the tombstone straddles two sstables. If the two sstables are an atomic unit (straddled by a user key) then the tombstone from the second sstable should blend with this one. If it is only the tombstone which straddles the sstable, then the truncation will have occurred on the tombstone sentinel key and everything should be ok as well. I think. This deserves additional thinking, testing, and commentary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making us think about this more. We think it'd be a problem cases like this:

  • SST1 at L1 contains Puts at a@50, b@40, and b@30
  • SST2 at L1 contains Puts at b@20 and c@10
  • A DeleteRange from a to c at seqnum 45 is stored (untruncated) in SST1 and SST2
  • SST1 and SST2 are picked for L1->L2 compaction, along with some L2 files.
  • The first output is SST3, which ends at b@40. At this point the CompactionIterator's L1 LevelIterator is positioned at b@30, which is still in SST1. That means SST2 hasn't been opened so its range tombstone haven't been added to the RangeDelAggregator. Then we would simply be writing out a range tombstone with end_key b to SST3, which would not cover b@40, causing that key to reappear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we are learning the appeal of passing the atomic compaction unit boundaries as smallest and largest keys for AddTombstones :(.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can combine the two approaches. Use internal-keys for the endpoints of range tombstones in RangeDelAggregator and pass the compaction boundaries as the largest and smallest keys to AddTombstones when creating the RangeDelAggregator used for compaction. Looking at the code, it appears to be a bit irritating to plumb the compaction boundaries down to the AddTombstones call, but CompactionJob creates the RangeDelAggregator and it could call a constructor or some other method that says "uses these boundaries for truncation and ignore the boundaries passed to AddTombstones".

Copy link
Contributor

@ajkr ajkr Sep 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petermattis - yup, we were planning a hybrid approach, too. Should've written it here, sorry.

I also thought a bit about having overrides in RangeDelAggregator. My feeling is the boundaries we pass to AddTombstones should be specific to the file's level, though. The high-level concern with compaction-wide overrides is when the compaction's output level extends the compaction's range far to the right, it could cause range tombstones from the input level to extend arbitrary rightwards after compaction happens. I should come up with a specific example though.

Copy link
Contributor

@ajkr ajkr Sep 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now I am seeing problems even with using the level's compaction boundaries as smallest/largest. Will try to make this more concrete after dinner :p, but the high-level problem is when a level's files do not all have overlapping endpoints, then clamping based on atomic unit boundaries would give tighter tombstones than clamping based on compaction level boundaries.

Copy link
Contributor

@ajkr ajkr Sep 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example as promised:

  • LSM has 7 levels, so L6 is bottommost
  • L5 has two SSTs: SST1 has Puts from a to c. SST2 has Puts from d to f. A range tombstone from b to e is stored in both SSTs. Note SST1's largest_key is extended to d@kMaxSeqnum due to this range tombstone.
  • L6 has one SST: SST3 has Puts from b to f.
  • L5 needs compaction. An L5->L6 compaction is chosen. Let's say SST2 is chosen from the input level, so the overlapping file chosen from the output level is SST3. Since this is a compaction to the bottommost level, the keys from SST2 in the range [d, e) that are newer than the range tombstone will be written out with seqnum zero.
  • The L5->L6 compaction finishes, creating SST4 spanning b to f.
  • New data comes in the range [d, f], making its way down through memtable, L0, ..., L5. Now we have SST5 spanning d to f in L5.
  • L5 needs compaction again. An L5->L6 compaction is chosen. This time, let's say SST1 is chosen. It overlaps with SST4 in L6. Now we have an optimization that goes back to the input level and expands the compaction to include any other files that are fully contained in the compaction's key-range. The compaction's key-range so far is [a, f], and SST5's key-range is [d, f], so SST 5 is pulled in.
  • The compaction runs, the keys in SST4 in the range [d, e) that were originally newer than the range tombstone get nuked, and now we have data loss.

Sorry, @abhimadan, for advocating solving this problem in RangeDelAggregator -- I wasn't able to foresee (and am still quite surprised by) all these intricate edge cases. My feeling now is our best hope is Peter's original proposal of precomputing atomic compaction unit boundaries, and using those for AddTombstones's smallest/largest. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a complex edge case. Thanks for catching that!

Actually, I'm going to propose another hybrid approach here: rather than precomputing atomic compaction boundaries, we compute the atomic compaction boundaries at compaction time. Then we won't have to worry about extra bookkeeping in FileMetaData. After these boundaries are computed, plumbing them to AddTombstones is quite similar to the work I've already done on using "full" compaction boundaries. While we could potentially cache them in FileMetaData as suggested, I'm going to argue that it's not necessary, because the files in the input level will no longer exist after the compaction, and the files in the output level will have changed after the compaction, and so their FileMetaData will need to be updated accordingly.

I like this idea since it's simple and (hopefully) cheap to compute on every compaction, since we can just do this work while we iterate through the FileMetaData in a level. (On the other hand it's quite late at the time of writing and I may have missed something important.) Thoughts?

Summary: To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.

During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See
facebook#4432 (comment)
and facebook#4432 (comment)
for motivating examples.

Test Plan: make -j64 check
@abhimadan abhimadan force-pushed the range-del-agg-trunc-fix-intkey branch from e6c4c08 to f502dd6 Compare October 1, 2018 20:39
@abhimadan
Copy link
Contributor Author

I've revamped this PR after all the discussion above to avoid extraneous string copying, and to be more generous with L1+ tombstone truncation during compactions. I want to write tests for the atomic compaction unit boundary logic, but at least RangeDelAggregator should have more comprehensive tests now.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I can't think of any problems with this approach.

db/compaction.cc Outdated
const auto* f = inputs[i].files[j];
const Slice& start_user_key = f->smallest.user_key();
const Slice& end_user_key = f->largest.user_key();
if (first_atomic_idx == j) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only happen when j == 0, right?

return true;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the else { was accidentally added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this was part of some debug code that I removed (but I forgot to remove this extraneous else clause). Removed.

db/compaction.cc Outdated
// First file in an atomic compaction unit.
cur_boundary.smallest = start_user_key;
cur_boundary.largest = end_user_key;
} else if (ucmp->Compare(cur_boundary.largest, start_user_key) == 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible to reuse sstableKeyCompare here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is possible! Thanks for pointing that out.

end_key.user_key = parsed_largest.user_key;
if (parsed_largest.sequence != kMaxSequenceNumber) {
// The same user key straddles two adjacent sstables. To make sure we
// can truncate to a range that includes the largest range tombstone,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean truncate to a range that includes the largest point-key?

}
}
}

RangeDelAggregator range_del_agg(icmp, {} /* snapshots */,
false /* collapse_deletions */);
AddTombstones(&range_del_agg, range_dels_in);
for (const auto& args : all_args) {
AddTombstones(&range_del_agg, args.tombstones);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also pass smallest and largest here?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

If SSTs overlap, we need these tombstones to cover keys in the current SST.
RangeDelAggregator's truncation will deal with extraneous tombstones added this
way by truncating them to be empty.
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abhimadan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@abhimadan abhimadan deleted the range-del-agg-trunc-fix-intkey branch October 9, 2018 22:35
@ajkr
Copy link
Contributor

ajkr commented Oct 10, 2018

I forgot something - we should mention the bug fix in HISTORY.md.

// An UncollapsedRangeDelMap is quick to create but slow to answer ShouldDelete
// queries.
class UncollapsedRangeDelMap : public RangeDelMap {
typedef std::multiset<RangeTombstone, TombstoneStartKeyComparator> Rep;
typedef std::multiset<TruncatedRangeTombstone, TombstoneStartKeyComparator>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rocksdb-contrun-unity is complaining about TombstoneStartKeyComparator being defined in anonymous namespace but used here. Is there anything special about this implementation of TombstoneStartKeyComparator that you want to keep it in anonymous namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting a class shadowing warning from the linter, so I added an anonymous namespace in this class (but now that you mention this warning, I think I only saw it the warning for ParsedInternalKeyComparator and not for TombstoneStartKeyComparator...). I'll make a new PR to address this (and update HISTORY.md).

abhimadan added a commit to abhimadan/rocksdb that referenced this pull request Oct 10, 2018
Follow-up to facebook#4432.

Test Plan: make unity_test
benesch pushed a commit to cockroachdb/rocksdb that referenced this pull request Oct 10, 2018
Summary:
To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.

During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See facebook#4432 (comment) and facebook#4432 (comment)
for motivating examples.
Pull Request resolved: facebook#4432

Differential Revision: D10263952

Pulled By: abhimadan

fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579
facebook-github-bot pushed a commit that referenced this pull request Oct 10, 2018
Summary:
Follow-up to #4432.
Pull Request resolved: #4479

Differential Revision: D10304151

Pulled By: abhimadan

fbshipit-source-id: 3608b95c324702ca26791f95cb26dae1d49efbe7
petermattis added a commit to cockroachdb/rocksdb that referenced this pull request Oct 17, 2018
// Tombstones starting at upper_bound or later only need to be included
// in the next table. Break because subsequent tombstones will start
// even later.
ucmp->Compare(*upper_bound, tombstone.start_key_) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhimadan @ajkr This change broke one of the tests in CockroachDB. The test sets up a scenario where there are 4 sstables:

0: "a000000000" - "c000010000"
6: "a000000000" - "a000009999"
6: "b000000000" - "b000009999"
6: "c000000000" - "c000009999"

The L0 sstable has a single key at a000000000 and a range deletion from [c000000000,c000010000). The test then performs a manual compaction from [c000000000,c000010000) (i.e. the same keys as the range deletion). The first step in the manual compaction compacts the L0 sstable to L5 which results in two sstables at L5:

#12: a000000000#1,1 c000000000#72057594037927935,15
#13: c000000000#3,0 c000010000#72057594037927935,15

This is where the problem occurs for the test. Prior to this line of code changing, the compaction generated L5 sstables that looked like:

#12: a000000000#1,1 a000000000#1,1
#13: c000000000#3,0 c000010000#72057594037927935,15

It looks like the range tombstone is being included in sstable #12 unnecessarily. I see that this change was made in order to fix the problem tested by KeyAtOverlappingEndpointReappears, though it looks to be pessimistic. In my test, upper_bound is being set to the min-key in the following table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petermattis - thanks for pointing this out. Just to confirm, the problem here is performance (we store an extra range tombstone and unnecessarily extend an SST file), not correctness, right?

Understood what you meant about comparing against next file's start key being pessimistic. Let's see if there's a way to compare against the current file's end key instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, then we'd need additional logic to handle tombstones fully in the gap between files. Will think a bit more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problem is performance, not correctness. In my test, when #12 is compacted from L5 to L6, it will compact against 2 sstables in L6 instead of 1. I think we can generate scenarios which produce arbitrarily large compactions due to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can probably revert this change, and fix KeyAtOverlappingEndpointReappears by having an additional condition that tombstones we skip should start strictly after the file's last point key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposed fix: #4592. BTW, if your test exercises RocksDB APIs directly, I'd be interested in copying it :P.

petermattis pushed a commit to cockroachdb/rocksdb that referenced this pull request Dec 5, 2018
Summary:
To more accurately truncate range tombstones at SST boundaries,
we now represent them in RangeDelAggregator using InternalKeys, which
are end-key-exclusive as they were before this change.

During compaction, "atomic compaction unit boundaries" (the range of
keys contained in neighbouring and overlaping SSTs) are propagated down
to RangeDelAggregator to truncate range tombstones at those boundariies
instead. See facebook#4432 (comment) and facebook#4432 (comment)
for motivating examples.
Pull Request resolved: facebook#4432

Differential Revision: D10263952

Pulled By: abhimadan

fbshipit-source-id: 2fe85ff8a02b3a6a2de2edfe708012797a7bd579
@ajkr ajkr mentioned this pull request Jun 26, 2019
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants