-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add MVCCStats
for range keys
#78085
storage: add MVCCStats
for range keys
#78085
Conversation
cfb0b56
to
33c04d4
Compare
b050b4a
to
bb5aa54
Compare
cbba19c
to
c75f6ff
Compare
bb5aa54
to
ba8eb78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits and questions mostly otherwise looks sensible.
ba8eb78
to
34e6ab0
Compare
c75f6ff
to
1d48763
Compare
34e6ab0
to
4798ae2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 61 of 61 files at r8, 6 of 6 files at r9, 19 of 19 files at r10, 16 of 16 files at r11, 2 of 2 files at r12, 2 of 2 files at r13, 2 of 2 files at r14, 6 of 6 files at r15, 32 of 32 files at r16, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)
pkg/kv/kvserver/batcheval/cmd_delete_range.go, line 119 at r16 (raw file):
startKey, endKey, rangeStart, rangeEnd roachpb.Key, ) (roachpb.Key, roachpb.Key) { const prevKeyLength = 8192
Is there any significance to this const? Worth a comment?
2097ac5
to
3ad25e4
Compare
1d48763
to
f6d2aca
Compare
3ad25e4
to
6a1756c
Compare
0a1a054
to
acc5569
Compare
Allright, this bad boy is unblocked and ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing some questions
Reviewed 13 of 61 files at r8, 1 of 16 files at r11, 1 of 27 files at r55, 10 of 44 files at r56, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/kv/kvserver/replica_raft.go
line 2152 at r56 (raw file):
prefixEnd := prefix.PrefixEnd() iter := reader.NewMVCCIterator(storage.MVCCKeyIterKind, storage.IterOptions{ KeyTypes: storage.IterKeyTypePointsAndRanges,
there won't be mvcc delete ranges in the raft log, right?
pkg/kv/kvserver/batcheval/cmd_end_transaction.go
line 1316 at r56 (raw file):
// Calculate the RHS adjustment, which turns out to be equivalent to the stats // contribution of the range key fragmentation. The naïve calculation would be // rhs.EncodedSize() - (keyLen(rhs.EndKey) - keyLen(lhs.EndKey))
should this say
rhs.EncodedSize() - (keyLen(rhs.StartKey) - keyLen(lhs.EndKey))
?
pkg/kv/kvserver/batcheval/cmd_truncate_log.go
line 124 at r56 (raw file):
LowerBound: start, UpperBound: end, })
same question about raft log having range keys
pkg/kv/kvserver/batcheval/split_stats_helper.go
line 34 at r56 (raw file):
// AbsPostSplit{Left,Right}Fn, and so we don't expect ContainsEstimates to be // set in them. The choice of which side to scan is controlled by ScanRightFirst. // - DeltaRangeKey: the stats delta that must be added to the non-computed
nit: this is referenced as DeltaRangeKeyRight
down below, but is defined as DeltaRangeKey
here. should they be unified?
pkg/storage/engine.go
line 268 at r56 (raw file):
// This isn't currently done in order to do spanset assertions on it, but this // could be better solved by checking the iterator bounds in NewMVCCIterator // and requiring callers to set them appropriately.
🎉 This will allow us to drop an end-key comparison in ComputeStatsForRange
: #41899
pkg/storage/mvcc.go
line 902 at r56 (raw file):
which is not affected by MVCC range tombstones
can you explain this? A MVCC Range tombstone at t2
that covers the key k@t1
does cause k@t1
to contribute to GCByesAge
from t2
on, just like a point tombstone would, right?
is this saying that a MVCC range tombstone that's shadowed at prefix k
doesn't itself apply to prefix k
's GCBytesAge
?
do MVCC range tombstones themselves that are shadowed ever contribute to GCBytesAge
? Eg, a MVCCDeleteRange([a,z)@t10)
and a MVCCDeleteRange([a,z)@t3)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/kv/kvserver/replica_raft.go
line 2152 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
there won't be mvcc delete ranges in the raft log, right?
No, shouldn't be. I figured we might as well enable range keys for all stats calculations, since I assumed the overall overhead would be negligible and better safe than sorry, but we can be more selective about it too.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go
line 1316 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
should this say
rhs.EncodedSize() - (keyLen(rhs.StartKey) - keyLen(lhs.EndKey))
?
No -- that would just be 0, because rhs.StartKey == lhs.EndKey
after the split.
What we're trying to adjust for here is really the change in the LHS' end key, from before the split to after the split. Before the split, the range key's end key way rhs.EndKey
, while post-split it's lhs.EndKey
. But as luck would have it, we can express rhs.EndKey
in terms of rhs.StartKey
via EncodedSize()
, eliminating the dependence on rhs.EndKey
.
pkg/kv/kvserver/batcheval/split_stats_helper.go
line 34 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: this is referenced as
DeltaRangeKeyRight
down below, but is defined asDeltaRangeKey
here. should they be unified?
Ah, thanks. Yeah, it used to be the case that we always scanned the LHS and then used that to compute the RHS, but some time ago we made it possible to scan either side (since the RHS would often be empty). Forgot to update the names everywhere.
pkg/storage/engine.go
line 268 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
🎉 This will allow us to drop an end-key comparison in
ComputeStatsForRange
: #41899
Ah, nice -- definitely seems worthwhile.
pkg/storage/mvcc.go
line 902 at r56 (raw file):
Previously, jbowens (Jackson Owens) wrote…
which is not affected by MVCC range tombstones
can you explain this? A MVCC Range tombstone at
t2
that covers the keyk@t1
does causek@t1
to contribute toGCByesAge
fromt2
on, just like a point tombstone would, right?is this saying that a MVCC range tombstone that's shadowed at prefix
k
doesn't itself apply to prefixk
'sGCBytesAge
?do MVCC range tombstones themselves that are shadowed ever contribute to
GCBytesAge
? Eg, aMVCCDeleteRange([a,z)@t10)
and aMVCCDeleteRange([a,z)@t3)
?
This really only comes up when there are multiple tombstones above a value. With point keys, the prefix contribution to GCBytesAge
only starts at the last tombstone. So with a non-tombstone value k@1
with tombstones at k@2
, k@3
, the GCBytesAge
contribution of the prefix starts accumulating from k@3
, since GCing below that won't affect the prefix contribution to KeyBytes
. However, if we instead had range tombstones [a-z)@2
and [a-z)@3
then the point key ceases to exist at @2
, and starts accumulating GCBytesAge
from there, because if we GCed up to @2
then we would remove all k
point keys, so it would no longer contribute to KeyBytes
.
Range keys contribute to GCBytesAge
too, but only for the range key as a whole, and per fragment stack. This works in much the same way as point keys: the prefix/key contribution to GCBytesAge
begins at the latest range key in the stack, which is where the range key ceases to exist (if it is a range tombstone, which it currently always is). The key contribution of range keys and point keys to stats are independent (they contribute to KeyBytes
and RangeKeyBytes
separately), and thus the key's contribution to GCBytesAge
is also mostly independent, except for range tombstones determining the visibility of point keys.
acc5569
to
1293722
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 27 files at r55, 27 of 44 files at r56, 1 of 1 files at r57, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/storage/mvcc.go
line 902 at r56 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This really only comes up when there are multiple tombstones above a value. With point keys, the prefix contribution to
GCBytesAge
only starts at the last tombstone. So with a non-tombstone valuek@1
with tombstones atk@2
,k@3
, theGCBytesAge
contribution of the prefix starts accumulating fromk@3
, since GCing below that won't affect the prefix contribution toKeyBytes
. However, if we instead had range tombstones[a-z)@2
and[a-z)@3
then the point key ceases to exist at@2
, and starts accumulatingGCBytesAge
from there, because if we GCed up to@2
then we would remove allk
point keys, so it would no longer contribute toKeyBytes
.Range keys contribute to
GCBytesAge
too, but only for the range key as a whole, and per fragment stack. This works in much the same way as point keys: the prefix/key contribution toGCBytesAge
begins at the latest range key in the stack, which is where the range key ceases to exist (if it is a range tombstone, which it currently always is). The key contribution of range keys and point keys to stats are independent (they contribute toKeyBytes
andRangeKeyBytes
separately), and thus the key's contribution toGCBytesAge
is also mostly independent, except for range tombstones determining the visibility of point keys.
Ah, thanks, that makes a lot of sense.
pkg/storage/mvcc.go
line 2686 at r57 (raw file):
case 1: // fragment fragmentRangeKeys(iter.RangeKeys(), startKey) case 0: // merge
To avoid the duplication of this defragmenting logic, could we create an iterator that reads through a batch containing the new range-key iterator? If the SeekLT(startKey)
lands on a range key whose end boundary extends beyond startKey
, the addition of the new range key merged two fragments. If the surfaced range key has an end boundary equal to startKey
, the addition of the range key fragmented a key at the left bound.
Release note: None
This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None
1293722
to
f86f168
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/storage/mvcc.go
line 2686 at r57 (raw file):
Previously, jbowens (Jackson Owens) wrote…
To avoid the duplication of this defragmenting logic, could we create an iterator that reads through a batch containing the new range-key iterator? If the
SeekLT(startKey)
lands on a range key whose end boundary extends beyondstartKey
, the addition of the new range key merged two fragments. If the surfaced range key has an end boundary equal tostartKey
, the addition of the range key fragmented a key at the left bound.
That might work here, but we're going to need similar logic in a bunch of other places where I don't think that's viable, e.g. during AddSSTable
, ClearRange
, RevertRange
, as well as during splits/merges. That said, this code is rather ugly. The plan is to consolidate it with logic elsewhere and clean it up once we've implemented stats handling everywhere at the KV layer. I added a TODO explaining this. Wdyt?
This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Release note: None
f86f168
to
8fe2362
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 37 files at r25, 4 of 60 files at r26, 11 of 31 files at r52, 16 of 27 files at r55, 7 of 44 files at r56.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)
pkg/storage/mvcc.go
line 2686 at r57 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
That might work here, but we're going to need similar logic in a bunch of other places where I don't think that's viable, e.g. during
AddSSTable
,ClearRange
,RevertRange
, as well as during splits/merges. That said, this code is rather ugly. The plan is to consolidate it with logic elsewhere and clean it up once we've implemented stats handling everywhere at the KV layer. I added a TODO explaining this. Wdyt?
Yeah, fine by me. When we get there, we might consider exposing what we need from Pebble if it'll help contain the complexity.
TFTR! bors r=jbowens |
Build failed (retrying...): |
Build failed (retrying...): |
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker **storage: export some MVCC key encoding functions** Release note: None **roachpb: add `Key.Prevish()` to find a previous key** This patch adds `Key.Prevish()`, which returns a previous key in lexicographical sort order. This is needed to expand a latch span leftwards to peek at any left-adjacent range keys. It is impossible to find the exact immediate predecessor of a key, because it can have an infinite number of `0xff` bytes at the end, so this returns the nearest previous key right-padded with `0xff` up to the given length. It is still possible for an infinite number of keys to exist between `Key` and `Key.Prevish()`, as keys have unbounded length. Release note: None **storage: add `MVCCStats` for range keys** This patch adds `MVCCStats` tracking for range keys. Four new fields are added to `MVCCStats`: * `RangeKeyCount`: the number of (fragmented) range keys, not counting historical versions. * `RangeKeyBytes`: the logical encoded byte size of all range keys. The latest version contributes the encoded key bounds, and all versions contribute encoded timestamps. Unlike point keys, which for historical reasons use a fixed-size timestamp contribution, this uses the actual variable-length timestamp size. * `RangeValCount`: the number of (fragmented) range key versions. * `RangeValBytes`: the encoded size of all range key values across all versions. The same value can be stored in multiple range keys due to fragmentation, which will be counted separately. Even though all range keys are currently MVCC range tombstones with no value, the `MVCCValueHeader` contribution can be non-zero due to e.g. a local timestamp. `ComputeStatsForRange()` has been extended to calculate the above quantities, and additionally account for range tombstones themselves in `GCBytesAge` along with their effect on point keys. All relevant call sites have been updated to surface range keys for the MVCC iterators passed to `ComputeStatsForRange()`. Most MVCC operations have been updated to correctly account for MVCC range tombstones, e.g. during point key writes and intent resolution. KV APIs are not yet updated, this will be addressed later. Range key stats are also adjusted during range splits and merges, which will split and merge any range keys that straddle the split key. This requires a single range key seek to the left and right of the split key during these operations. Touches #70412. Release note: None 82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same slice of `RequestUnion` objects between different fetches. It is now extremely easy to do given the recent refactor. We do perform memory accounting for this slice (against a memory account bound to an unlimited memory monitor). Additionally, a similar optimization is applied to how resume requests are populated by the Streamer. Addresses: #82160. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
Build failed (retrying...): |
Build failed: |
bors retry |
Build succeeded: |
storage: export some MVCC key encoding functions
Release note: None
roachpb: add
Key.Prevish()
to find a previous keyThis patch adds
Key.Prevish()
, which returns a previous key inlexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.
It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of
0xff
bytes at the end, sothis returns the nearest previous key right-padded with
0xff
up to thegiven length. It is still possible for an infinite number of keys to
exist between
Key
andKey.Prevish()
, as keys have unbounded length.Release note: None
storage: add
MVCCStats
for range keysThis patch adds
MVCCStats
tracking for range keys. Four new fields areadded to
MVCCStats
:RangeKeyCount
: the number of (fragmented) range keys, not countinghistorical versions.
RangeKeyBytes
: the logical encoded byte size of all range keys.The latest version contributes the encoded key bounds, and all
versions contribute encoded timestamps. Unlike point keys, which for
historical reasons use a fixed-size timestamp contribution, this uses
the actual variable-length timestamp size.
RangeValCount
: the number of (fragmented) range key versions.RangeValBytes
: the encoded size of all range key values acrossall versions. The same value can be stored in multiple range keys
due to fragmentation, which will be counted separately. Even though
all range keys are currently MVCC range tombstones with no value, the
MVCCValueHeader
contribution can be non-zero due to e.g. a localtimestamp.
ComputeStatsForRange()
has been extended to calculate the abovequantities, and additionally account for range tombstones themselves in
GCBytesAge
along with their effect on point keys. All relevant callsites have been updated to surface range keys for the MVCC iterators
passed to
ComputeStatsForRange()
.Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.
Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.
Touches #70412.
Release note: None