-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage,libroach: update MVCCIncrementalIterator to look at every updated key #45163
Conversation
523d394
to
2f43ce5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about the delayed review
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pbardea and @sumeerbhola)
pkg/storage/engine/mvcc_incremental_iterator.go, line 56 at r2 (raw file):
// fields used for a workaround for a bug in the time-bound iterator // (#28358)
can you update the comment to also mention #43799
pkg/storage/engine/mvcc_incremental_iterator.go, line 85 at r2 (raw file):
var timeBoundIter Iterator if !opts.IterOptions.MinTimestampHint.IsEmpty() && !opts.IterOptions.MaxTimestampHint.IsEmpty() { // It is necessary for correctness that sanityIter be created before iter.
I don't think the ordering matters anymore.
Also, I think it would be good if there was a file level comment justifying the correctness, say some revised version of the following:
When using (t_s, t_e], say there is a version (committed or provisional) k@t where t is in that interval, that is visible to iter
. All sstables containing k@t will be included in timeBoundIter
. Note that there may be multiple sequence numbers for the key k@t
at the storage layer, say k@t#n1
, k@t#n2
, ... where n1 > n2 ... some of which may be deleted, but the latest sequence number will be visible using iter
(since not being visible would be a contradiction of the initial assumption). Since there is no delete across all sstables that deletes k@t#n1
, there is no delete in the subset of sstables used by timeBoundIter
that deletes k@t#n1
, so the timeBoundIter will see k@t
.
pkg/storage/engine/mvcc_incremental_iterator.go, line 117 at r2 (raw file):
func (i *MVCCIncrementalIterator) SeekGE(startKey MVCCKey) { if i.timeBoundIter != nil { i.timeBoundIter.SeekGE(startKey)
this needs to call i.timeBoundIter.Valid()
before calling i.timeBoundIter.Key()
pkg/storage/engine/mvcc_incremental_iterator.go, line 144 at r2 (raw file):
// key. func (i *MVCCIncrementalIterator) Next() { i.iter.Next()
the old code had iter
as the one constrained by the time bounds, and the new code has timeBoundIter
in that role. I think both Next()
and NextKey()
need to operate on timeBoundIter
and then call advance()
.
Given that MVCCIncrementalIterator
is trying to work with both time-bounds and non-time bounds maybe the original structuring of using iter
as either the time bounded one or not was the easier structure. Though I am wondering why it has to work in both cases. Can we just return a raw engine.Iterator
from NewMVCCIncrementalIterator()
if there are no time bounds? We could document as a code comment that the engine.Iterator
returned from this "new" function only supports a subset of the engine.Iterator
interface or explicitly make MVCCIncremetalIterator
an interface. Is it the special handling of intents that prevents this? If so, it would be good to document more completely the behavioral contract of this iterator.
pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r2 (raw file):
// Advance // Pre: TBI and Iter are pointing to the same key, but perhaps different versions.
I don't see how this precondition is satisfied by the call sites.
I think the precondition that we desire is that TBI is at the next key we could possibly return, but we have to validate using Iter and keep moving forward if the validation fails.
34f120d
to
159b5d9
Compare
836b509
to
d3dab9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the first look! I've fixed up some of the algorithm issues and added more documentation to explain what the iterator is doing and what guarantees it makes. A lot of this was based on your first pass, so thanks for that review.
I think that this is RFAL. I've also request @miretskiy take a look to double check my C++. :)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)
pkg/storage/engine/mvcc_incremental_iterator.go, line 56 at r2 (raw file):
Previously, sumeerbhola wrote…
can you update the comment to also mention #43799
Done.
pkg/storage/engine/mvcc_incremental_iterator.go, line 85 at r2 (raw file):
Previously, sumeerbhola wrote…
I don't think the ordering matters anymore.
Also, I think it would be good if there was a file level comment justifying the correctness, say some revised version of the following:
When using (t_s, t_e], say there is a version (committed or provisional) k@t where t is in that interval, that is visible to
iter
. All sstables containing k@t will be included intimeBoundIter
. Note that there may be multiple sequence numbers for the keyk@t
at the storage layer, sayk@t#n1
,k@t#n2
, ... where n1 > n2 ... some of which may be deleted, but the latest sequence number will be visible usingiter
(since not being visible would be a contradiction of the initial assumption). Since there is no delete across all sstables that deletesk@t#n1
, there is no delete in the subset of sstables used bytimeBoundIter
that deletesk@t#n1
, so the timeBoundIter will seek@t
.
Done.
pkg/storage/engine/mvcc_incremental_iterator.go, line 117 at r2 (raw file):
Previously, sumeerbhola wrote…
this needs to call
i.timeBoundIter.Valid()
before callingi.timeBoundIter.Key()
Done.
pkg/storage/engine/mvcc_incremental_iterator.go, line 144 at r2 (raw file):
Previously, sumeerbhola wrote…
the old code had
iter
as the one constrained by the time bounds, and the new code hastimeBoundIter
in that role. I think bothNext()
andNextKey()
need to operate ontimeBoundIter
and then calladvance()
.Given that
MVCCIncrementalIterator
is trying to work with both time-bounds and non-time bounds maybe the original structuring of usingiter
as either the time bounded one or not was the easier structure. Though I am wondering why it has to work in both cases. Can we just return a rawengine.Iterator
fromNewMVCCIncrementalIterator()
if there are no time bounds? We could document as a code comment that theengine.Iterator
returned from this "new" function only supports a subset of theengine.Iterator
interface or explicitly makeMVCCIncremetalIterator
an interface. Is it the special handling of intents that prevents this? If so, it would be good to document more completely the behavioral contract of this iterator.
So, I believe that only iter
is needed to be explicitly incremented before the advance function. The time bound iterator, as I attempted to explain in the updated comments, is only updated in maybeSkipKeys
. In particular, I expect that iter
be incremented by either Next
or NextKey
exactly once between invocations of maybeSkipKeys
. iter
will either be incremented in Next
or NextKey
, or at the end of advance
in the case that we end up on a key outside the desired time bounds. The timeBoundIter
is only ever advanced by NextKey
(and so only visits one version of any key) when iter
gets ahead of it.
MVCCIncrementalIterator
behaves similarly to an engine.Iterator
with the exception that the MVCCIncrementalIterator will return an error if it encounters an inline value or an intent within the given time bounds. I've added comments to note that these errors are returned from MVCCIncrementalIterator.
pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r2 (raw file):
Previously, sumeerbhola wrote…
I don't see how this precondition is satisfied by the call sites.
I think the precondition that we desire is that TBI is at the next key we could possibly return, but we have to validate using Iter and keep moving forward if the validation fails.
Yes, this precondition wasn't quite correct and was left over from working through the initial implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
c-deps/libroach/incremental_iterator.h, line 32 at r3 (raw file):
large amount of keys
large number of keys?
c-deps/libroach/incremental_iterator.cc, line 28 at r3 (raw file):
write_intent(write_intent) { time_bound_iter.reset(nullptr);
Not needed. If you prefer to be explicit, feel free to initialize inside argument list..
Or, if you want to be explicit, initialize it above (before "engine(engine)") or set
time_bound_iter = nullptr right in the header file.
c-deps/libroach/incremental_iterator.cc, line 56 at r3 (raw file):
} // getKey returns the
we would never know what getKey returns... :)
c-deps/libroach/incremental_iterator.cc, line 82 at r3 (raw file):
// possibly different timestamp. void DBIncrementalIterator::maybeSkipKeys() { if (time_bound_iter != nullptr) {
I would do
if (tbi == nullptr) return
and de-indent the rest of the code.
c-deps/libroach/incremental_iterator.cc, line 86 at r3 (raw file):
// keys. std::string tbi_key = getKey(time_bound_iter->rep->key()); std::string iter_key = getKey(iter->rep->key());
It appears that getKey operates only on the rocksdb slices returned by the underlying rocks iterators. As such, I think it makes sense to have getKey helper return a slice instead of making
string keys. Besides the issues of unnecessary copies, strings are really bad at handling data
which might contain \0 characters (so I'm not even sure about the correctness of the code using strings instead of slices)
c-deps/libroach/incremental_iterator.cc, line 92 at r3 (raw file):
Quoted 4 lines of code…
DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */); if (!valid) { return; }
This doesn't look right to me
We call (our) DBIterNext helper, which happens to return a value which we ignore.
Then we check for !valid. Clearly, the underlying iterator might have some issues, but those are never reflected in the "valid" field of this incremental iterator.
It's kinda confusing, particularly, since we check for time_bound_iter->rep->Valid below.
What's also a bit confusing is that on one hand we call our helper (DBIterNext), but on the other hand we peak inside internal representation of the underlying iterator to see if the operation succeeded.
I think we should do one thing or the other... but not both.
I think we should do something like:
auto res = DBIterNext(...)
if (!res.valid) {
// handle error
status = res.status
valid = false
return
}
c-deps/libroach/incremental_iterator.cc, line 112 at r3 (raw file):
Quoted 4 lines of code…
// same as the main iterator, we may be able to skip over a large group // of keys. The main iterator is seeked to the TBI in hopes that many // keys were skipped. Note that a Seek is an order of magnitude more // expensive than a Next call.
tabs?
c-deps/libroach/incremental_iterator.cc, line 116 at r3 (raw file):
!iter_rep->status().ok()) {
any benefit to this vs
iter_rep->Valid() check?
c-deps/libroach/incremental_iterator.cc, line 224 at r3 (raw file):
DBIterState DBIncrementalIterator::seek(DBKey key) { if (time_bound_iter != nullptr) { DBIterSeek(time_bound_iter.get(), key);
Same comment as above:
auto res = DBIterSeek(...)
if (!res.valid) {
// handle error
}
c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):
const rocksdb::Slice iter_key(key.key.data, key.key.len); if (tbi_key.compare(iter_key) > 0) { key = ToDBKey(rocksdb::Slice(tbi_key));
tbi_key is already a slice -- no need to convert.
87c4c60
to
c618471
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
c-deps/libroach/incremental_iterator.h, line 32 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
large amount of keys
large number of keys?
Done.
c-deps/libroach/incremental_iterator.cc, line 28 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Not needed. If you prefer to be explicit, feel free to initialize inside argument list..
Or, if you want to be explicit, initialize it above (before "engine(engine)") or set
time_bound_iter = nullptr right in the header file.
Done. Moved to header.
c-deps/libroach/incremental_iterator.cc, line 56 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
we would never know what getKey returns... :)
It's a secret. ;)
Updated.
c-deps/libroach/incremental_iterator.cc, line 82 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I would do
if (tbi == nullptr) returnand de-indent the rest of the code.
Done.
c-deps/libroach/incremental_iterator.cc, line 86 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
It appears that getKey operates only on the rocksdb slices returned by the underlying rocks iterators. As such, I think it makes sense to have getKey helper return a slice instead of making
string keys. Besides the issues of unnecessary copies, strings are really bad at handling data
which might contain \0 characters (so I'm not even sure about the correctness of the code using strings instead of slices)
Done, there should be no more keys as strings.
c-deps/libroach/incremental_iterator.cc, line 92 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */); if (!valid) { return; }
This doesn't look right to me
We call (our) DBIterNext helper, which happens to return a value which we ignore.
Then we check for !valid. Clearly, the underlying iterator might have some issues, but those are never reflected in the "valid" field of this incremental iterator.It's kinda confusing, particularly, since we check for time_bound_iter->rep->Valid below.
What's also a bit confusing is that on one hand we call our helper (DBIterNext), but on the other hand we peak inside internal representation of the underlying iterator to see if the operation succeeded.
I think we should do one thing or the other... but not both.
I think we should do something like:
auto res = DBIterNext(...)
if (!res.valid) {
// handle error
status = res.status
valid = false
return
}
Agreed. I did a pass to clean up our error checking throughout this file.
c-deps/libroach/incremental_iterator.cc, line 112 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
// same as the main iterator, we may be able to skip over a large group // of keys. The main iterator is seeked to the TBI in hopes that many // keys were skipped. Note that a Seek is an order of magnitude more // expensive than a Next call.
tabs?
Oops, this is why one should not mix IDEs. Fixed.
c-deps/libroach/incremental_iterator.cc, line 116 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
!iter_rep->status().ok()) {
any benefit to this vs
iter_rep->Valid() check?
Updated in error checking audit.
c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
tbi_key is already a slice -- no need to convert.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)
c-deps/libroach/incremental_iterator.cc, line 224 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Same comment as above:
auto res = DBIterSeek(...)
if (!res.valid) {
// handle error
}
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):
y is already a slice -- no need to convert.
Can you pass tbi_key directly to ToDBKey?
key = ToDBKey(tbi_key)?
c-deps/libroach/incremental_iterator.cc, line 75 at r5 (raw file):
rocksdb::Slice DBIncrementalIterator::getKey(rocksdb::Slice mvcc_key) {
Looking more at getKey() calls, I think we should change its signature to:
bool DBIncrementalIterator::extractKey(rocksdb::Slice mvcc_key, rocksdb::Slice *key) WARN_UNUSED_RESULT
Pass the argument "*key' down to the SplitKey, and return valid indicator.
Basically, we're making multiple calls to getKey() which mutates internal validity state;
and we never check on that.
f0ef78d
to
e011e4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
pkg/storage/engine/mvcc_incremental_iterator.go, line 106 at r5 (raw file):
var iter Iterator var timeBoundIter Iterator if !opts.IterOptions.MinTimestampHint.IsEmpty() && !opts.IterOptions.MaxTimestampHint.IsEmpty() {
the MinTimestampHint
and MaxTimestampHint
need to be consistent with StartTime
and EndTime
, yes? This structure is odd, since it looks like the same information is being provided twice -- can you add a code comment.
pkg/storage/engine/mvcc_incremental_iterator.go, line 141 at r5 (raw file):
// Seek to the first version of the key that TBI provided. startKey = MakeMVCCMetadataKey(tbiKey) }
I am assuming the startKey
that SeekGE
is called with must be a metadata key, otherwise the else case would also need to construct a metadata key (since we need to notice intents). If yes, can you add a comment?
pkg/storage/engine/mvcc_incremental_iterator.go, line 229 at r5 (raw file):
// advance advances the main iterator until it is referencing a key within // (start_time, end_time]. It may return an error if it encounters an
nit: "return an error" is not quite accurate since this function does not have a return value.
pkg/storage/engine/mvcc_incremental_iterator.go, line 180 at r6 (raw file):
// // Pre: The TBI should be pointing to either the same key or the key prior // to the same key as the main iterator. This is enforced by a combination
Did you mean to say that the TBI is at a key <= main iterator? The phrasing "the key prior" suggests that they only differ by a one call to NextKey()
?
pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r6 (raw file):
// possibly different timestamp. func (i *MVCCIncrementalIterator) maybeSkipKeys() { if i.timeBoundIter != nil {
can you structure this as follows to reduce the block nesting
if i.timeBoundIter == nil {
return
}
tbiKey := ...
iterKey := ...
if iterKey.Compare(tbiKey) <= 0 {
return
}
...
pkg/storage/engine/mvcc_incremental_iterator.go, line 209 at r6 (raw file):
// This means that for the incremental iterator to perform a Next or // NextKey in this case, it will require only 1 extra NextKey call to the // TBI.
I think there are two things that can break this lockstep behavior:
- when the TBI is seeing phantom
MVCCKey.Key
s, i.e., keys that are completely absent from the main iterator -- this can happen rarely due to aborted transactions and keys that have been subsumed due to range tombstones. - when the TBI has fewer keys than the the main iterator, which will be quite common.
This code doesn't handle the first case (and I think we need test coverage for it) -- it corresponds to iterKey.Compare(tbiKey) > 0
. Since this will be rare I think we should simply call i.timeBoundIter.SeekGE(MakeMVCCMetadataKey(iterKey))
. The code here is still correct, but just wasteful in causing the main iterator to iterate over keys that are not in the TBI.
pkg/storage/engine/mvcc_incremental_iterator.go, line 233 at r6 (raw file):
func (i *MVCCIncrementalIterator) advance() { for { if !i.valid {
I think the invariant is that advance()
is called when i.valid
is true, though i.iter.Valid()
may be true or false. In which case I don't think this if-block is needed. The only thing that will make i.valid
false is maybeSkipKeys()
and that is followed by an if-block that returns.
pkg/storage/engine/mvcc_incremental_iterator.go, line 280 at r6 (raw file):
i.iter.Next() continue }
can you add a comment like the following since the use of i.meta
is a bit confusing
// Else, i.meta contains an MVCC value and not an intent. This is handled next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)
c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
y is already a slice -- no need to convert.
Can you pass tbi_key directly to ToDBKey?key = ToDBKey(tbi_key)?
Oops, should be fixed now.
c-deps/libroach/incremental_iterator.cc, line 75 at r5 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
rocksdb::Slice DBIncrementalIterator::getKey(rocksdb::Slice mvcc_key) {
Looking more at getKey() calls, I think we should change its signature to:
bool DBIncrementalIterator::extractKey(rocksdb::Slice mvcc_key, rocksdb::Slice *key) WARN_UNUSED_RESULT
Pass the argument "*key' down to the SplitKey, and return valid indicator.
Basically, we're making multiple calls to getKey() which mutates internal validity state;
and we never check on that.
Ah, nice - updated!
pkg/storage/engine/mvcc_incremental_iterator.go, line 106 at r5 (raw file):
Previously, sumeerbhola wrote…
the
MinTimestampHint
andMaxTimestampHint
need to be consistent withStartTime
andEndTime
, yes? This structure is odd, since it looks like the same information is being provided twice -- can you add a code comment.
I believe that the requirement is that the timestamp hint range should not be more restrictive than the range between StartTime
and EndTime
, though I would expect callers in practice to have these consistent.
pkg/storage/engine/mvcc_incremental_iterator.go, line 141 at r5 (raw file):
Previously, sumeerbhola wrote…
I am assuming the
startKey
thatSeekGE
is called with must be a metadata key, otherwise the else case would also need to construct a metadata key (since we need to notice intents). If yes, can you add a comment?
Yep, I've added a comment.
pkg/storage/engine/mvcc_incremental_iterator.go, line 229 at r5 (raw file):
Previously, sumeerbhola wrote…
nit: "return an error" is not quite accurate since this function does not have a return value.
Done.
pkg/storage/engine/mvcc_incremental_iterator.go, line 180 at r6 (raw file):
Previously, sumeerbhola wrote…
Did you mean to say that the TBI is at a key <= main iterator? The phrasing "the key prior" suggests that they only differ by a one call to
NextKey()
?
Yep, updated the comment on the C++ side, but not on the go side. Updated here as well now.
pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r6 (raw file):
Previously, sumeerbhola wrote…
can you structure this as follows to reduce the block nesting
if i.timeBoundIter == nil { return } tbiKey := ... iterKey := ... if iterKey.Compare(tbiKey) <= 0 { return } ...
Done.
pkg/storage/engine/mvcc_incremental_iterator.go, line 209 at r6 (raw file):
Previously, sumeerbhola wrote…
I think there are two things that can break this lockstep behavior:
- when the TBI is seeing phantom
MVCCKey.Key
s, i.e., keys that are completely absent from the main iterator -- this can happen rarely due to aborted transactions and keys that have been subsumed due to range tombstones.- when the TBI has fewer keys than the the main iterator, which will be quite common.
This code doesn't handle the first case (and I think we need test coverage for it) -- it corresponds to
iterKey.Compare(tbiKey) > 0
. Since this will be rare I think we should simply calli.timeBoundIter.SeekGE(MakeMVCCMetadataKey(iterKey))
. The code here is still correct, but just wasteful in causing the main iterator to iterate over keys that are not in the TBI.
Ah, I was wondering if the iterKey > tbiKey
case was possible and didn't consider phantom keys. Thanks for pointing that out! Will follow this up with a test.
pkg/storage/engine/mvcc_incremental_iterator.go, line 233 at r6 (raw file):
Previously, sumeerbhola wrote…
I think the invariant is that
advance()
is called wheni.valid
is true, thoughi.iter.Valid()
may be true or false. In which case I don't think this if-block is needed. The only thing that will makei.valid
false ismaybeSkipKeys()
and that is followed by an if-block that returns.
@miretskiy's review of the C++ side led to a re-working of the error handling there. I've updated the go implementation to mirror how we handle errors there closer. This adds a bit more verbosity here, but I think it's more readable. I remove this check at the start of the for
and explicitly check after every invocation which may make the iterator invalid. I think this is clearer since each error handling check is right after the code which may have caused the error.
pkg/storage/engine/mvcc_incremental_iterator.go, line 280 at r6 (raw file):
Previously, sumeerbhola wrote…
can you add a comment like the following since the use of
i.meta
is a bit confusing// Else, i.meta contains an MVCC value and not an intent. This is handled next.
Done. I was also a bit confused by the naming of i.meta
so in addition to adding a comment here, I also added a comment to the meta
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)
pkg/storage/engine/mvcc_incremental_iterator.go, line 209 at r6 (raw file):
Previously, pbardea (Paul Bardea) wrote…
Ah, I was wondering if the
iterKey > tbiKey
case was possible and didn't consider phantom keys. Thanks for pointing that out! Will follow this up with a test.
Since it's not a correctness concern, I'd actually prefer getting this version in as soon as possible so we can start stressing it sooner. I am interested in eventually following up with a test though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
a discussion (no related file):
Giving my LGTM for C++ -- i'll let @sumeerbhola handle go stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r6, 3 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
pkg/storage/engine/mvcc_incremental_iterator.go, line 225 at r8 (raw file):
// This means that for the incremental iterator to perform a Next or NextKey // will require only 1 extra NextKey invocation while they remain in // lockstep.
btw, for performance experiments you may want to keep stats like a counter of the fast path and slow path -- unless modified keys are adjacent, the slow path will be more common.
(Not suggesting that you add that to this PR)
pkg/storage/engine/mvcc_incremental_iterator.go, line 227 at r8 (raw file):
// lockstep. if iterKey.Compare(tbiKey) > 0 {
can you restructure this as
cmp := iterKey.Compare(tbiKey)
if cmp > 0 {
...
}
if cmp < 0 {
...
}
to avoid repeated string comparisons.
pkg/storage/engine/mvcc_incremental_iterator_test.go, line 756 at r8 (raw file):
// and the second is the MVCC metadata entry (i.e. an intent). The next // SSTable contains the provisional value for the intent. The effect is that // the metadata entry is separated from the entry it is metadata for.
how is this different from TestMVCCIncrementalIteratorIntentStraddlesSStables
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
pkg/storage/engine/mvcc_incremental_iterator.go, line 225 at r8 (raw file):
Previously, sumeerbhola wrote…
btw, for performance experiments you may want to keep stats like a counter of the fast path and slow path -- unless modified keys are adjacent, the slow path will be more common.
(Not suggesting that you add that to this PR)
Ack.
pkg/storage/engine/mvcc_incremental_iterator.go, line 227 at r8 (raw file):
Previously, sumeerbhola wrote…
can you restructure this as
cmp := iterKey.Compare(tbiKey) if cmp > 0 { ... } if cmp < 0 { ... }
to avoid repeated string comparisons.
Done.
pkg/storage/engine/mvcc_incremental_iterator_test.go, line 756 at r8 (raw file):
Previously, sumeerbhola wrote…
how is this different from
TestMVCCIncrementalIteratorIntentStraddlesSStables
?
Removed. Accidentally added when I was playing around with the idea of testing the phantom key case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)
14cbc56
to
ee76b9e
Compare
This latest implementation of MVCCIncrementalIterator updates its use of the time-bound iterator to only be used to skip over keys that were not modified at all within the time bound limits. Previously, the incremental iterator primarily used the time-bound iterator and then used a non-time-bound iterator (sanity_iter) to check that meta keys were properly read. See cockroachdb#43799 for discussion around why this may not be sufficient. This commit updates the implementation to use the time-bound iterator more conservatively. The 2 iterators that it now uses are `iter` (a non-time-bound iterator and `timeBoundIter`). The main iteration is now done with `iter`. When `iter` sees a new key, the `timeBoundIter` is incremented and if it goes further than `iter`, `iter` is `Seek`ed forward to the meta key of the new key. This means that `iter` should visit every version of every key that `timeBoundIter` visits. It should be noted that if many adjacent keys were modified, the above logic tries to avoid to do a `Seek` to the next key as a `Seek` is ~an order of magnitude more expensive than a `Next`. In cases where a lot of keys were modified, we expect to do ~double the work since 2 iterators would need to seek to every key. However, the expected use case for the MVCCIncrementalIterator is such that the _vast_ majority of the data is excluded from the range (as should be the case for incremental backups), since there is potentially years worth of data that can be skipped over, versus hours worth of data to include. Release note: None
These tests exercise the Go MVCCIncrementalIterator which is no longer used directly in CockroachDB. However, the MVCCIncrementalIterator is used as an oracle to test the C++ implementation. Therefore, it would still be beneficial to have these tests present as long as we plan to keep these 2 implementations in sync. Release note: None
Thanks for the reviews! |
Build succeeded |
45143: ui: Release notes signup r=dhartunian a=koorosh Resolves: #43912 Depends on: #44744, #44821, #44856 - [x] WIP. Current branch has two branches merged which haven't been approved/merged in master branch yet. - [x] rebase branch to remove merge commits from branches other than master. Add Release notes signup form to Cluster Overview page right after page title. Release Notes signup view is created in `src/views/dashboard` directory because it will be the main page where we display this view. And Cluster Overview page is a temporary place while Dashboard view doesn't exist. These changes integrate three main parts: ReleaseNotesForm, AlertNotification component and custom analytics saga. 45426: coldata: minor tweak of flat bytes r=yuzefovich a=yuzefovich This commit changes `maybeBackfillOffsets` to update `maxSetIndex` accordingly (this might be a minor performance improvement). In a sense, when we're backfilling offsets, we *are* setting indices to point to empty `[]byte` slices. Also, the logic for `Set` method is slightly refactored. Release note: None 45455: clusterversion: significantly rework cluster version handling r=tbg a=irfansharif We split off ClusterVersion out of pkg/settings/cluster into a dedicated pkg/clusterversion. This is to pave the way for #39182 where we introduce long running version migration infrastructure. We then introduce clusterversion.Handle as a read only view to the active cluster version and this binary's version details. We similarly fold in the actual global cluster version setting into pkg/clusterversion, and enforce all external usage to go through clusterversion.Handle. We can also hide away external usage of the baked in cluster.Binary{,MinimumSupported}Version constants. Instead we have the implementation of clusterversion.Handle allow for tests to override binary and minimum supported versions as needed. Along the way we clean up Settings.Initialized, as it is not really used. We document all the various "versions" in play ("binary version", "minimum supported version", "active version") and change usage of what was previously "serverVersion" to simply be "binaryVersion", because that's what it is. We also clean up the Settings constructors into Test/Non-Test types that differ in cluster version setting initialization behaviour. --- For reviewers: It's probably easiest to start with what pkg/settings/cluster/cluster_settings.go looks like, then following into pkg/clusterversion/cluster_version.go and then pkg/clusterversion/setting.go. --- I still don't like the following detail about our pkg structure: - pkg/settings/cluster depends on it's "parent" pkg, pkg/settings - pkg/settings/cluster depends pkg/clusterversion, which in turn depends on pkg/settings Naively, pkg/settings/cluster should be a top level package, but I'm not making that change in this PR. For now I've renamed the settings.go file to cluster_settings.go. Release note: None 45535: Revert "storage,libroach: update MVCCIncrementalIterator to look at every updated key" r=pbardea a=pbardea Reverts #45163 To stop the errors we're seeing on #45524. Will investigate further once it's off master. Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Daniel Harrison <[email protected]> Co-authored-by: Paul Bardea <[email protected]>
This latest implementation of MVCCIncrementalIterator updates its use of
the time-bound iterator to only be used to skip over keys that were not
modified at all within the time bound limits.
Previously, the incremental iterator primarily used the time-bound
iterator and then used a non-time-bound iterator (sanity_iter) to check
that meta keys were properly read. See #43799 for discussion around why
this may not be sufficient.
This commit updates the implementation to use the time-bound iterator
more conservatively. The 2 iterators that it now uses are
iter
(anon-time-bound iterator and
timeBoundIter
). The main iteration is nowdone with
iter
. Wheniter
sees a new key, thetimeBoundIter
isincremented and if it goes further than
iter
,iter
isSeek
edforward to the meta key of the new key. This means that
iter
shouldvisit every version of every key that
timeBoundIter
visits.It should be noted that if many adjacent keys were modified, the above
logic tries to avoid to do a
Seek
to the next key as aSeek
is ~anorder of magnitude more expensive than a
Next
.In cases where a lot of keys were modified, we expect to do ~double the
work since 2 iterators would need to seek to every key. However, the
expected use case for the MVCCIncrementalIterator is such that the
vast majority of the data is excluded from the range (as should be the
case for incremental backups), since there is potentially years worth of
data that can be skipped over, versus hours worth of data to include.
Release note: None