Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storageccl: scan all versions of changed key in MVCCIncrementalIterator #45785

Merged
merged 2 commits into from
Mar 9, 2020

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Mar 6, 2020

This PR re-introduces #45163. It needs to be merged after #45759.

This latest implementation of MVCCIncrementalIterator updates its use of
the time-bound iterator to only be used to skip over keys that were not
modified at all within the time bound limits.

Previously, the incremental iterator primarily used the time-bound
iterator and then used a non-time-bound iterator (sanity_iter) to check
that meta keys were properly read. See #43799 for discussion around why
this may not be sufficient.

This commit updates the implementation to use the time-bound iterator
more conservatively. The 2 iterators that it now uses are iter (a
non-time-bound iterator and timeBoundIter). The main iteration is now
done with iter. When iter sees a new key, the timeBoundIter is
incremented and if it goes further than iter, iter is Seeked
forward to the meta key of the new key. This means that iter should
visit every version of every key that timeBoundIter visits.

It should be noted that if many adjacent keys were modified, the above
logic tries to avoid to do a Seek to the next key as a Seek is ~an
order of magnitude more expensive than a Next.

In cases where a lot of keys were modified, we expect to do ~double the
work since 2 iterators would need to seek to every key. However, the
expected use case for the MVCCIncrementalIterator is such that the
vast majority of the data is excluded from the range (as should be the
case for incremental backups), since there is potentially years worth of
data that can be skipped over, versus hours worth of data to include.

Release note: None

@pbardea pbardea requested review from miretskiy and sumeerbhola March 6, 2020 07:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pbardea
Copy link
Contributor Author

pbardea commented Mar 6, 2020

Note that this PR differs slightly from the first implementation:

diff --git a/c-deps/libroach/incremental_iterator.cc b/c-deps/libroach/incremental_iterator.cc
index 837f2bffa8..b613dfbad3 100644
--- a/c-deps/libroach/incremental_iterator.cc
+++ b/c-deps/libroach/incremental_iterator.cc
@@ -79,7 +79,7 @@ void DBIncrementalIterator::maybeSkipKeys() {
     return;
   }
   rocksdb::Slice iter_key;
-  if(!extractKey(time_bound_iter->rep->key(), &iter_key)) {
+  if(!extractKey(iter->rep->key(), &iter_key)) {
     return;
   }

This should be followed up with another test that ensures that we're actually skipping some keys with this optimization.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having problems submitting my pr due to flaky test -- so make sure you wait for that pr to hit the master.

// This means that for the incremental iterator to perform a Next or NextKey
// will require only 1 extra NextKey invocation while they remain in
// lockstep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).

Also, I think it's easier to read if you use e.g. "iter_key == tbi_key" in the comment as opposed to specifics (Compare)

// range tombstones. In this case we can SeekGE() the TBI to the main iterator.

// NB: Seek() is expensive, and this is only acceptable since we expect
// this case to rarely occur.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this important note belongs right at the beginning of this if statement -- it explains why we call Next, as opposed to seeking right away.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)


c-deps/libroach/incremental_iterator.cc, line 105 at r2 (raw file):

Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).

I think there are three cases (if we have any incremental backup metrics we should keep a counter for each. Or maybe just log?):

  • tbi was one behind the iter: The next call on tbi will make them equal. This is the fast path mentioned here. I think this comment could be elevated just before the call to DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */); since it motivates calling next instead of seek. This could be common if most keys are modified or the modifications are clustered in the key space.
  • the next call on tbi moves it far ahead of iter. This will be the common case when modifications are sparse in the key space. This may turn out to be the real common case -- in which case the fast path did not help.
  • the next call on tbi keeps it behind the iter (should be very rare).

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this issue: I've moved the comments around based on the comments, but I am still seeing the libroach assertion failing on this branch. Investigating further.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


c-deps/libroach/incremental_iterator.cc, line 113 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I feel like this important note belongs right at the beginning of this if statement -- it explains why we call Next, as opposed to seeking right away.

Done.

@pbardea pbardea force-pushed the tbi branch 2 times, most recently from 9b5e934 to d117fce Compare March 9, 2020 13:19
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)


c-deps/libroach/incremental_iterator.h, line 22 at r3 (raw file):

which is used as an oracle to test DBIncrementalIterator.

Is it going to be used only as an oracle? Or, eventually, it will be the iterator to be used w/ pebble?


c-deps/libroach/incremental_iterator.cc, line 49 at r4 (raw file):

// boolean indicating whether t1 is less than t2.
bool DBIncrementalIterator::legacyTimestampIsLess(const cockroach::util::hlc::LegacyTimestamp& t1,
                                                  const cockroach::util::hlc::LegacyTimestamp& t2) {

might wrapping these long lines? c++ usually tries to stay under 80 cols


c-deps/libroach/incremental_iterator.cc, line 56 at r4 (raw file):

// extractKey extracts the key portion of the mvcc_key and put it in key. It
// returns a validity indicator.
WARN_UNUSED_RESULT bool DBIncrementalIterator::extractKey(rocksdb::Slice mvcc_key, rocksdb::Slice *key) {

wrap?


c-deps/libroach/incremental_iterator.cc, line 250 at r4 (raw file):

    }
    const rocksdb::Slice tbi_key(time_bound_iter->rep->key());
    const rocksdb::Slice iter_key(ToSlice(key.key));

This line deserves an "NB: comment" ...

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


c-deps/libroach/incremental_iterator.h, line 22 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
which is used as an oracle to test DBIncrementalIterator.

Is it going to be used only as an oracle? Or, eventually, it will be the iterator to be used w/ pebble?

It seems reasonable that it'll be used as the iterator when we move over to Pebble. I added a note which mentions that this iterator may be useful when Pebble becomes the default store.


c-deps/libroach/incremental_iterator.cc, line 49 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

might wrapping these long lines? c++ usually tries to stay under 80 cols

Done. I ran the formatter (https://github.com/cockroachdb/cockroach/blob/master/c-deps/libroach/README.md). I saw that there were formatting changes to be made to other files that I didn't touch - it may be worth doing another PR with those changes.


c-deps/libroach/incremental_iterator.cc, line 56 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

wrap?

Same as previous comment.


c-deps/libroach/incremental_iterator.cc, line 250 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

This line deserves an "NB: comment" ...

Done.

pbardea added 2 commits March 9, 2020 11:52
This latest implementation of MVCCIncrementalIterator updates its use of
the time-bound iterator to only be used to skip over keys that were not
modified at all within the time bound limits.

Previously, the incremental iterator primarily used the time-bound
iterator and then used a non-time-bound iterator (sanity_iter) to check
that meta keys were properly read. See cockroachdb#43799 for discussion around why
this may not be sufficient.

This commit updates the implementation to use the time-bound iterator
more conservatively. The 2 iterators that it now uses are `iter` (a
non-time-bound iterator and `timeBoundIter`). The main iteration is now
done with `iter`. When `iter` sees a new key, the `timeBoundIter` is
incremented and if it goes further than `iter`, `iter` is `Seek`ed
forward to the meta key of the new key. This means that `iter` should
visit every version of every key that `timeBoundIter` visits.

It should be noted that if many adjacent keys were modified, the above
logic tries to avoid to do a `Seek` to the next key as a `Seek` is ~an
order of magnitude more expensive than a `Next`.

In cases where a lot of keys were modified, we expect to do ~double the
work since 2 iterators would need to seek to every key. However, the
expected use case for the MVCCIncrementalIterator is such that the
_vast_ majority of the data is excluded from the range (as should be the
case for incremental backups), since there is potentially years worth of
data that can be skipped over, versus hours worth of data to include.

Release note: None
These tests exercise the Go MVCCIncrementalIterator which is no longer
used directly in CockroachDB. However, the MVCCIncrementalIterator is
used as an oracle to test the C++ implementation. Therefore, it would
still be beneficial to have these tests present as long as we plan to
keep these 2 implementations in sync.

Release note: None
Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


c-deps/libroach/incremental_iterator.cc, line 105 at r2 (raw file):

Previously, sumeerbhola wrote…

Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).

I think there are three cases (if we have any incremental backup metrics we should keep a counter for each. Or maybe just log?):

  • tbi was one behind the iter: The next call on tbi will make them equal. This is the fast path mentioned here. I think this comment could be elevated just before the call to DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */); since it motivates calling next instead of seek. This could be common if most keys are modified or the modifications are clustered in the key space.
  • the next call on tbi moves it far ahead of iter. This will be the common case when modifications are sparse in the key space. This may turn out to be the real common case -- in which case the fast path did not help.
  • the next call on tbi keeps it behind the iter (should be very rare).

Updated the comments based on these suggestions.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

@pbardea pbardea merged commit 6932a18 into cockroachdb:master Mar 9, 2020
@pbardea pbardea deleted the tbi branch March 9, 2020 18:04
dt added a commit to dt/cockroach that referenced this pull request Mar 14, 2020
Within the current BACKUP/RESTORE feature offering, the way to reduce
RPO is to backup more often. For this to work for more frequent
durations, the cost of backing up what changed must scale with the size
what changed, as opposed to the total data size. This is exactly what
the time-bound iterator optimization is supposed to deliver -- by
recording the span of timestamps that appear in any given SSTable and
then installing a filter to only open sstables that contain relevant
times, we can ignore irrelevant data very cheaply.

However over 2017 and 2018, we encountered a few tricky correctness
issues in the interaction of the time-bound iterator and our MVCC logic
and handling intents, and generally lost confidence in this code,
ultimately disabling its use in the scans used by BACKUP. However cockroachdb#45785
resolved those concerns by ensuring that every key emitted by the
IncrementalIterator was actually read by a normal, non-TBI iterator even
when TBI is used. Thus it is now believed to be safe to re-enable TBI
for incremental backups.

Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency.

Release justification: low-risk and high-impact.
craig bot pushed a commit that referenced this pull request Mar 16, 2020
46108: backup: re-enable fast incremental BACKUP via TBI r=dt a=dt

Within the current BACKUP/RESTORE feature offering, the way to reduce
RPO is to backup more often. For this to work for more frequent
durations, the cost of backing up what changed must scale with the size
what changed, as opposed to the total data size. This is exactly what
the time-bound iterator optimization is supposed to deliver -- by
recording the span of timestamps that appear in any given SSTable and
then installing a filter to only open sstables that contain relevant
times, we can ignore irrelevant data very cheaply.

However over 2017 and 2018, we encountered a few tricky correctness
issues in the interaction of the time-bound iterator and our MVCC logic
and handling intents, and generally lost confidence in this code,
ultimately disabling its use in the scans used by BACKUP. However #45785
resolved those concerns by ensuring that every key emitted by the
IncrementalIterator was actually read by a normal, non-TBI iterator even
when TBI is used. Thus it is now believed to be safe to re-enable TBI
for incremental backups.

Closes #43799.

Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency.

Release justification: low-risk and high-impact.

Co-authored-by: David Taylor <[email protected]>
dt added a commit to dt/cockroach that referenced this pull request Mar 21, 2020
Within the current BACKUP/RESTORE feature offering, the way to reduce
RPO is to backup more often. For this to work for more frequent
durations, the cost of backing up what changed must scale with the size
what changed, as opposed to the total data size. This is exactly what
the time-bound iterator optimization is supposed to deliver -- by
recording the span of timestamps that appear in any given SSTable and
then installing a filter to only open sstables that contain relevant
times, we can ignore irrelevant data very cheaply.

However over 2017 and 2018, we encountered a few tricky correctness
issues in the interaction of the time-bound iterator and our MVCC logic
and handling intents, and generally lost confidence in this code,
ultimately disabling its use in the scans used by BACKUP. However cockroachdb#45785
resolved those concerns by ensuring that every key emitted by the
IncrementalIterator was actually read by a normal, non-TBI iterator even
when TBI is used. Thus it is now believed to be safe to re-enable TBI
for incremental backups.

Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency.

Release justification: low-risk and high-impact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants