-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storageccl: scan all versions of changed key in MVCCIncrementalIterator #45785
Conversation
Note that this PR differs slightly from the first implementation:
This should be followed up with another test that ensures that we're actually skipping some keys with this optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having problems submitting my pr due to flaky test -- so make sure you wait for that pr to hit the master.
// This means that for the incremental iterator to perform a Next or NextKey | ||
// will require only 1 extra NextKey invocation while they remain in | ||
// lockstep. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).
Also, I think it's easier to read if you use e.g. "iter_key == tbi_key" in the comment as opposed to specifics (Compare)
// range tombstones. In this case we can SeekGE() the TBI to the main iterator. | ||
|
||
// NB: Seek() is expensive, and this is only acceptable since we expect | ||
// this case to rarely occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this important note belongs right at the beginning of this if statement -- it explains why we call Next, as opposed to seeking right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
c-deps/libroach/incremental_iterator.cc, line 105 at r2 (raw file):
Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).
I think there are three cases (if we have any incremental backup metrics we should keep a counter for each. Or maybe just log?):
- tbi was one behind the iter: The next call on tbi will make them equal. This is the fast path mentioned here. I think this comment could be elevated just before the call to
DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */);
since it motivates calling next instead of seek. This could be common if most keys are modified or the modifications are clustered in the key space. - the next call on tbi moves it far ahead of iter. This will be the common case when modifications are sparse in the key space. This may turn out to be the real common case -- in which case the fast path did not help.
- the next call on tbi keeps it behind the iter (should be very rare).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on this issue: I've moved the comments around based on the comments, but I am still seeing the libroach assertion failing on this branch. Investigating further.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
c-deps/libroach/incremental_iterator.cc, line 113 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I feel like this important note belongs right at the beginning of this if statement -- it explains why we call Next, as opposed to seeking right away.
Done.
9b5e934
to
d117fce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)
c-deps/libroach/incremental_iterator.h, line 22 at r3 (raw file):
which is used as an oracle to test DBIncrementalIterator.
Is it going to be used only as an oracle? Or, eventually, it will be the iterator to be used w/ pebble?
c-deps/libroach/incremental_iterator.cc, line 49 at r4 (raw file):
// boolean indicating whether t1 is less than t2. bool DBIncrementalIterator::legacyTimestampIsLess(const cockroach::util::hlc::LegacyTimestamp& t1, const cockroach::util::hlc::LegacyTimestamp& t2) {
might wrapping these long lines? c++ usually tries to stay under 80 cols
c-deps/libroach/incremental_iterator.cc, line 56 at r4 (raw file):
// extractKey extracts the key portion of the mvcc_key and put it in key. It // returns a validity indicator. WARN_UNUSED_RESULT bool DBIncrementalIterator::extractKey(rocksdb::Slice mvcc_key, rocksdb::Slice *key) {
wrap?
c-deps/libroach/incremental_iterator.cc, line 250 at r4 (raw file):
} const rocksdb::Slice tbi_key(time_bound_iter->rep->key()); const rocksdb::Slice iter_key(ToSlice(key.key));
This line deserves an "NB: comment" ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
c-deps/libroach/incremental_iterator.h, line 22 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
which is used as an oracle to test DBIncrementalIterator.
Is it going to be used only as an oracle? Or, eventually, it will be the iterator to be used w/ pebble?
It seems reasonable that it'll be used as the iterator when we move over to Pebble. I added a note which mentions that this iterator may be useful when Pebble becomes the default store.
c-deps/libroach/incremental_iterator.cc, line 49 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
might wrapping these long lines? c++ usually tries to stay under 80 cols
Done. I ran the formatter (https://github.com/cockroachdb/cockroach/blob/master/c-deps/libroach/README.md). I saw that there were formatting changes to be made to other files that I didn't touch - it may be worth doing another PR with those changes.
c-deps/libroach/incremental_iterator.cc, line 56 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
wrap?
Same as previous comment.
c-deps/libroach/incremental_iterator.cc, line 250 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
This line deserves an "NB: comment" ...
Done.
This latest implementation of MVCCIncrementalIterator updates its use of the time-bound iterator to only be used to skip over keys that were not modified at all within the time bound limits. Previously, the incremental iterator primarily used the time-bound iterator and then used a non-time-bound iterator (sanity_iter) to check that meta keys were properly read. See cockroachdb#43799 for discussion around why this may not be sufficient. This commit updates the implementation to use the time-bound iterator more conservatively. The 2 iterators that it now uses are `iter` (a non-time-bound iterator and `timeBoundIter`). The main iteration is now done with `iter`. When `iter` sees a new key, the `timeBoundIter` is incremented and if it goes further than `iter`, `iter` is `Seek`ed forward to the meta key of the new key. This means that `iter` should visit every version of every key that `timeBoundIter` visits. It should be noted that if many adjacent keys were modified, the above logic tries to avoid to do a `Seek` to the next key as a `Seek` is ~an order of magnitude more expensive than a `Next`. In cases where a lot of keys were modified, we expect to do ~double the work since 2 iterators would need to seek to every key. However, the expected use case for the MVCCIncrementalIterator is such that the _vast_ majority of the data is excluded from the range (as should be the case for incremental backups), since there is potentially years worth of data that can be skipped over, versus hours worth of data to include. Release note: None
These tests exercise the Go MVCCIncrementalIterator which is no longer used directly in CockroachDB. However, the MVCCIncrementalIterator is used as an oracle to test the C++ implementation. Therefore, it would still be beneficial to have these tests present as long as we plan to keep these 2 implementations in sync. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
c-deps/libroach/incremental_iterator.cc, line 105 at r2 (raw file):
Previously, sumeerbhola wrote…
Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).I think there are three cases (if we have any incremental backup metrics we should keep a counter for each. Or maybe just log?):
- tbi was one behind the iter: The next call on tbi will make them equal. This is the fast path mentioned here. I think this comment could be elevated just before the call to
DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */);
since it motivates calling next instead of seek. This could be common if most keys are modified or the modifications are clustered in the key space.- the next call on tbi moves it far ahead of iter. This will be the common case when modifications are sparse in the key space. This may turn out to be the real common case -- in which case the fast path did not help.
- the next call on tbi keeps it behind the iter (should be very rare).
Updated the comments based on these suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
Within the current BACKUP/RESTORE feature offering, the way to reduce RPO is to backup more often. For this to work for more frequent durations, the cost of backing up what changed must scale with the size what changed, as opposed to the total data size. This is exactly what the time-bound iterator optimization is supposed to deliver -- by recording the span of timestamps that appear in any given SSTable and then installing a filter to only open sstables that contain relevant times, we can ignore irrelevant data very cheaply. However over 2017 and 2018, we encountered a few tricky correctness issues in the interaction of the time-bound iterator and our MVCC logic and handling intents, and generally lost confidence in this code, ultimately disabling its use in the scans used by BACKUP. However cockroachdb#45785 resolved those concerns by ensuring that every key emitted by the IncrementalIterator was actually read by a normal, non-TBI iterator even when TBI is used. Thus it is now believed to be safe to re-enable TBI for incremental backups. Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency. Release justification: low-risk and high-impact.
46108: backup: re-enable fast incremental BACKUP via TBI r=dt a=dt Within the current BACKUP/RESTORE feature offering, the way to reduce RPO is to backup more often. For this to work for more frequent durations, the cost of backing up what changed must scale with the size what changed, as opposed to the total data size. This is exactly what the time-bound iterator optimization is supposed to deliver -- by recording the span of timestamps that appear in any given SSTable and then installing a filter to only open sstables that contain relevant times, we can ignore irrelevant data very cheaply. However over 2017 and 2018, we encountered a few tricky correctness issues in the interaction of the time-bound iterator and our MVCC logic and handling intents, and generally lost confidence in this code, ultimately disabling its use in the scans used by BACKUP. However #45785 resolved those concerns by ensuring that every key emitted by the IncrementalIterator was actually read by a normal, non-TBI iterator even when TBI is used. Thus it is now believed to be safe to re-enable TBI for incremental backups. Closes #43799. Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency. Release justification: low-risk and high-impact. Co-authored-by: David Taylor <[email protected]>
Within the current BACKUP/RESTORE feature offering, the way to reduce RPO is to backup more often. For this to work for more frequent durations, the cost of backing up what changed must scale with the size what changed, as opposed to the total data size. This is exactly what the time-bound iterator optimization is supposed to deliver -- by recording the span of timestamps that appear in any given SSTable and then installing a filter to only open sstables that contain relevant times, we can ignore irrelevant data very cheaply. However over 2017 and 2018, we encountered a few tricky correctness issues in the interaction of the time-bound iterator and our MVCC logic and handling intents, and generally lost confidence in this code, ultimately disabling its use in the scans used by BACKUP. However cockroachdb#45785 resolved those concerns by ensuring that every key emitted by the IncrementalIterator was actually read by a normal, non-TBI iterator even when TBI is used. Thus it is now believed to be safe to re-enable TBI for incremental backups. Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency. Release justification: low-risk and high-impact.
This PR re-introduces #45163. It needs to be merged after #45759.
This latest implementation of MVCCIncrementalIterator updates its use of
the time-bound iterator to only be used to skip over keys that were not
modified at all within the time bound limits.
Previously, the incremental iterator primarily used the time-bound
iterator and then used a non-time-bound iterator (sanity_iter) to check
that meta keys were properly read. See #43799 for discussion around why
this may not be sufficient.
This commit updates the implementation to use the time-bound iterator
more conservatively. The 2 iterators that it now uses are
iter
(anon-time-bound iterator and
timeBoundIter
). The main iteration is nowdone with
iter
. Wheniter
sees a new key, thetimeBoundIter
isincremented and if it goes further than
iter
,iter
isSeek
edforward to the meta key of the new key. This means that
iter
shouldvisit every version of every key that
timeBoundIter
visits.It should be noted that if many adjacent keys were modified, the above
logic tries to avoid to do a
Seek
to the next key as aSeek
is ~anorder of magnitude more expensive than a
Next
.In cases where a lot of keys were modified, we expect to do ~double the
work since 2 iterators would need to seek to every key. However, the
expected use case for the MVCCIncrementalIterator is such that the
vast majority of the data is excluded from the range (as should be the
case for incremental backups), since there is potentially years worth of
data that can be skipped over, versus hours worth of data to include.
Release note: None