Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add MVCCStats for range keys #78085

Merged
merged 3 commits into from
Jun 28, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 18, 2022

storage: export some MVCC key encoding functions

Release note: None

roachpb: add Key.Prevish() to find a previous key

This patch adds Key.Prevish(), which returns a previous key in
lexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.

It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of 0xff bytes at the end, so
this returns the nearest previous key right-padded with 0xff up to the
given length. It is still possible for an infinite number of keys to
exist between Key and Key.Prevish(), as keys have unbounded length.

Release note: None

storage: add MVCCStats for range keys

This patch adds MVCCStats tracking for range keys. Four new fields are
added to MVCCStats:

  • RangeKeyCount: the number of (fragmented) range keys, not counting
    historical versions.

  • RangeKeyBytes: the logical encoded byte size of all range keys.
    The latest version contributes the encoded key bounds, and all
    versions contribute encoded timestamps. Unlike point keys, which for
    historical reasons use a fixed-size timestamp contribution, this uses
    the actual variable-length timestamp size.

  • RangeValCount: the number of (fragmented) range key versions.

  • RangeValBytes: the encoded size of all range key values across
    all versions. The same value can be stored in multiple range keys
    due to fragmentation, which will be counted separately. Even though
    all range keys are currently MVCC range tombstones with no value, the
    MVCCValueHeader contribution can be non-zero due to e.g. a local
    timestamp.

ComputeStatsForRange() has been extended to calculate the above
quantities, and additionally account for range tombstones themselves in
GCBytesAge along with their effect on point keys. All relevant call
sites have been updated to surface range keys for the MVCC iterators
passed to ComputeStatsForRange().

Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.

Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.

Touches #70412.

Release note: None

@erikgrinaker erikgrinaker requested a review from aliher1911 March 18, 2022 15:51
@erikgrinaker erikgrinaker requested review from a team as code owners March 18, 2022 15:51
@erikgrinaker erikgrinaker self-assigned this Mar 18, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-stats branch 3 times, most recently from cfb0b56 to 33c04d4 Compare March 18, 2022 19:57
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch 2 times, most recently from b050b4a to bb5aa54 Compare March 24, 2022 11:39
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-stats branch 2 times, most recently from cbba19c to c75f6ff Compare March 26, 2022 18:27
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from bb5aa54 to ba8eb78 Compare March 30, 2022 12:13
Copy link
Contributor

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and questions mostly otherwise looks sensible.

pkg/storage/mvcc_stats_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/batcheval/cmd_end_transaction.go Outdated Show resolved Hide resolved
pkg/storage/engine.go Outdated Show resolved Hide resolved
pkg/storage/mvcc_history_test.go Outdated Show resolved Hide resolved
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from ba8eb78 to 34e6ab0 Compare April 2, 2022 17:10
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-stats branch from c75f6ff to 1d48763 Compare April 3, 2022 17:01
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 34e6ab0 to 4798ae2 Compare April 4, 2022 13:43
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 61 of 61 files at r8, 6 of 6 files at r9, 19 of 19 files at r10, 16 of 16 files at r11, 2 of 2 files at r12, 2 of 2 files at r13, 2 of 2 files at r14, 6 of 6 files at r15, 32 of 32 files at r16, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, and @sumeerbhola)


pkg/kv/kvserver/batcheval/cmd_delete_range.go, line 119 at r16 (raw file):

	startKey, endKey, rangeStart, rangeEnd roachpb.Key,
) (roachpb.Key, roachpb.Key) {
	const prevKeyLength = 8192

Is there any significance to this const? Worth a comment?

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch 5 times, most recently from 2097ac5 to 3ad25e4 Compare April 11, 2022 20:36
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-stats branch from 1d48763 to f6d2aca Compare April 12, 2022 09:48
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 3ad25e4 to 6a1756c Compare April 15, 2022 09:41
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-stats branch from 0a1a054 to acc5569 Compare June 15, 2022 20:53
@erikgrinaker erikgrinaker marked this pull request as ready for review June 15, 2022 20:53
@erikgrinaker
Copy link
Contributor Author

Allright, this bad boy is unblocked and ready for final review.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing some questions

Reviewed 13 of 61 files at r8, 1 of 16 files at r11, 1 of 27 files at r55, 10 of 44 files at r56, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)


pkg/kv/kvserver/replica_raft.go line 2152 at r56 (raw file):

	prefixEnd := prefix.PrefixEnd()
	iter := reader.NewMVCCIterator(storage.MVCCKeyIterKind, storage.IterOptions{
		KeyTypes:   storage.IterKeyTypePointsAndRanges,

there won't be mvcc delete ranges in the raft log, right?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1316 at r56 (raw file):

	// Calculate the RHS adjustment, which turns out to be equivalent to the stats
	// contribution of the range key fragmentation. The naïve calculation would be
	// rhs.EncodedSize() - (keyLen(rhs.EndKey) - keyLen(lhs.EndKey))

should this say

 rhs.EncodedSize() - (keyLen(rhs.StartKey) - keyLen(lhs.EndKey))

?


pkg/kv/kvserver/batcheval/cmd_truncate_log.go line 124 at r56 (raw file):

		LowerBound: start,
		UpperBound: end,
	})

same question about raft log having range keys


pkg/kv/kvserver/batcheval/split_stats_helper.go line 34 at r56 (raw file):

//   AbsPostSplit{Left,Right}Fn, and so we don't expect ContainsEstimates to be
//   set in them. The choice of which side to scan is controlled by ScanRightFirst.
// - DeltaRangeKey: the stats delta that must be added to the non-computed

nit: this is referenced as DeltaRangeKeyRight down below, but is defined as DeltaRangeKey here. should they be unified?


pkg/storage/engine.go line 268 at r56 (raw file):

	// This isn't currently done in order to do spanset assertions on it, but this
	// could be better solved by checking the iterator bounds in NewMVCCIterator
	// and requiring callers to set them appropriately.

🎉 This will allow us to drop an end-key comparison in ComputeStatsForRange: #41899


pkg/storage/mvcc.go line 902 at r56 (raw file):

which is not affected by MVCC range tombstones

can you explain this? A MVCC Range tombstone at t2 that covers the key k@t1 does cause k@t1 to contribute to GCByesAge from t2 on, just like a point tombstone would, right?

is this saying that a MVCC range tombstone that's shadowed at prefix k doesn't itself apply to prefix k's GCBytesAge?

do MVCC range tombstones themselves that are shadowed ever contribute to GCBytesAge? Eg, a MVCCDeleteRange([a,z)@t10) and a MVCCDeleteRange([a,z)@t3)?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)


pkg/kv/kvserver/replica_raft.go line 2152 at r56 (raw file):

Previously, jbowens (Jackson Owens) wrote…

there won't be mvcc delete ranges in the raft log, right?

No, shouldn't be. I figured we might as well enable range keys for all stats calculations, since I assumed the overall overhead would be negligible and better safe than sorry, but we can be more selective about it too.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1316 at r56 (raw file):

Previously, jbowens (Jackson Owens) wrote…

should this say

 rhs.EncodedSize() - (keyLen(rhs.StartKey) - keyLen(lhs.EndKey))

?

No -- that would just be 0, because rhs.StartKey == lhs.EndKey after the split.

What we're trying to adjust for here is really the change in the LHS' end key, from before the split to after the split. Before the split, the range key's end key way rhs.EndKey, while post-split it's lhs.EndKey. But as luck would have it, we can express rhs.EndKey in terms of rhs.StartKey via EncodedSize(), eliminating the dependence on rhs.EndKey.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 34 at r56 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: this is referenced as DeltaRangeKeyRight down below, but is defined as DeltaRangeKey here. should they be unified?

Ah, thanks. Yeah, it used to be the case that we always scanned the LHS and then used that to compute the RHS, but some time ago we made it possible to scan either side (since the RHS would often be empty). Forgot to update the names everywhere.


pkg/storage/engine.go line 268 at r56 (raw file):

Previously, jbowens (Jackson Owens) wrote…

🎉 This will allow us to drop an end-key comparison in ComputeStatsForRange: #41899

Ah, nice -- definitely seems worthwhile.


pkg/storage/mvcc.go line 902 at r56 (raw file):

Previously, jbowens (Jackson Owens) wrote…

which is not affected by MVCC range tombstones

can you explain this? A MVCC Range tombstone at t2 that covers the key k@t1 does cause k@t1 to contribute to GCByesAge from t2 on, just like a point tombstone would, right?

is this saying that a MVCC range tombstone that's shadowed at prefix k doesn't itself apply to prefix k's GCBytesAge?

do MVCC range tombstones themselves that are shadowed ever contribute to GCBytesAge? Eg, a MVCCDeleteRange([a,z)@t10) and a MVCCDeleteRange([a,z)@t3)?

This really only comes up when there are multiple tombstones above a value. With point keys, the prefix contribution to GCBytesAge only starts at the last tombstone. So with a non-tombstone value k@1 with tombstones at k@2, k@3, the GCBytesAge contribution of the prefix starts accumulating from k@3, since GCing below that won't affect the prefix contribution to KeyBytes. However, if we instead had range tombstones [a-z)@2 and [a-z)@3 then the point key ceases to exist at @2, and starts accumulating GCBytesAge from there, because if we GCed up to @2 then we would remove all k point keys, so it would no longer contribute to KeyBytes.

Range keys contribute to GCBytesAge too, but only for the range key as a whole, and per fragment stack. This works in much the same way as point keys: the prefix/key contribution to GCBytesAge begins at the latest range key in the stack, which is where the range key ceases to exist (if it is a range tombstone, which it currently always is). The key contribution of range keys and point keys to stats are independent (they contribute to KeyBytes and RangeKeyBytes separately), and thus the key's contribution to GCBytesAge is also mostly independent, except for range tombstones determining the visibility of point keys.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-stats branch from acc5569 to 1293722 Compare June 17, 2022 16:25
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 27 files at r55, 27 of 44 files at r56, 1 of 1 files at r57, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)


pkg/storage/mvcc.go line 902 at r56 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This really only comes up when there are multiple tombstones above a value. With point keys, the prefix contribution to GCBytesAge only starts at the last tombstone. So with a non-tombstone value k@1 with tombstones at k@2, k@3, the GCBytesAge contribution of the prefix starts accumulating from k@3, since GCing below that won't affect the prefix contribution to KeyBytes. However, if we instead had range tombstones [a-z)@2 and [a-z)@3 then the point key ceases to exist at @2, and starts accumulating GCBytesAge from there, because if we GCed up to @2 then we would remove all k point keys, so it would no longer contribute to KeyBytes.

Range keys contribute to GCBytesAge too, but only for the range key as a whole, and per fragment stack. This works in much the same way as point keys: the prefix/key contribution to GCBytesAge begins at the latest range key in the stack, which is where the range key ceases to exist (if it is a range tombstone, which it currently always is). The key contribution of range keys and point keys to stats are independent (they contribute to KeyBytes and RangeKeyBytes separately), and thus the key's contribution to GCBytesAge is also mostly independent, except for range tombstones determining the visibility of point keys.

Ah, thanks, that makes a lot of sense.


pkg/storage/mvcc.go line 2686 at r57 (raw file):

				case 1: // fragment
					fragmentRangeKeys(iter.RangeKeys(), startKey)
				case 0: // merge

To avoid the duplication of this defragmenting logic, could we create an iterator that reads through a batch containing the new range-key iterator? If the SeekLT(startKey) lands on a range key whose end boundary extends beyond startKey, the addition of the new range key merged two fragments. If the surfaced range key has an end boundary equal to startKey, the addition of the range key fragmented a key at the left bound.

This patch adds `Key.Prevish()`, which returns a previous key in
lexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.

It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of `0xff` bytes at the end, so
this returns the nearest previous key right-padded with `0xff` up to the
given length. It is still possible for an infinite number of keys to
exist between `Key` and `Key.Prevish()`, as keys have unbounded length.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-stats branch from 1293722 to f86f168 Compare June 26, 2022 17:34
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)


pkg/storage/mvcc.go line 2686 at r57 (raw file):

Previously, jbowens (Jackson Owens) wrote…

To avoid the duplication of this defragmenting logic, could we create an iterator that reads through a batch containing the new range-key iterator? If the SeekLT(startKey) lands on a range key whose end boundary extends beyond startKey, the addition of the new range key merged two fragments. If the surfaced range key has an end boundary equal to startKey, the addition of the range key fragmented a key at the left bound.

That might work here, but we're going to need similar logic in a bunch of other places where I don't think that's viable, e.g. during AddSSTable, ClearRange, RevertRange, as well as during splits/merges. That said, this code is rather ugly. The plan is to consolidate it with logic elsewhere and clean it up once we've implemented stats handling everywhere at the KV layer. I added a TODO explaining this. Wdyt?

This patch adds `MVCCStats` tracking for range keys. Four new fields are
added to `MVCCStats`:

* `RangeKeyCount`: the number of (fragmented) range keys, not counting
  historical versions.

* `RangeKeyBytes`: the logical encoded byte size of all range keys.
  The latest version contributes the encoded key bounds, and all
  versions contribute encoded timestamps. Unlike point keys, which for
  historical reasons use a fixed-size timestamp contribution, this uses
  the actual variable-length timestamp size.

* `RangeValCount`: the number of (fragmented) range key versions.

* `RangeValBytes`: the encoded size of all range key values across
  all versions. The same value can be stored in multiple range keys
  due to fragmentation, which will be counted separately. Even though
  all range keys are currently MVCC range tombstones with no value, the
  `MVCCValueHeader` contribution can be non-zero due to e.g. a local
  timestamp.

`ComputeStatsForRange()` has been extended to calculate the above
quantities, and additionally account for range tombstones themselves in
`GCBytesAge` along with their effect on point keys. All relevant call
sites have been updated to surface range keys for the MVCC iterators
passed to `ComputeStatsForRange()`.

Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.

Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.

Release note: None
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 37 files at r25, 4 of 60 files at r26, 11 of 31 files at r52, 16 of 27 files at r55, 7 of 44 files at r56.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, @jbowens, @nicktrav, and @sumeerbhola)


pkg/storage/mvcc.go line 2686 at r57 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

That might work here, but we're going to need similar logic in a bunch of other places where I don't think that's viable, e.g. during AddSSTable, ClearRange, RevertRange, as well as during splits/merges. That said, this code is rather ugly. The plan is to consolidate it with logic elsewhere and clean it up once we've implemented stats handling everywhere at the KV layer. I added a TODO explaining this. Wdyt?

Yeah, fine by me. When we get there, we might consider exposing what we need from Pebble if it'll help contain the complexity.

@erikgrinaker
Copy link
Contributor Author

TFTR!

bors r=jbowens

@craig
Copy link
Contributor

craig bot commented Jun 27, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 27, 2022

Build failed (retrying...):

craig bot pushed a commit that referenced this pull request Jun 28, 2022
78085: storage: add `MVCCStats` for range keys r=jbowens a=erikgrinaker

**storage: export some MVCC key encoding functions**

Release note: None

**roachpb: add `Key.Prevish()` to find a previous key**

This patch adds `Key.Prevish()`, which returns a previous key in
lexicographical sort order. This is needed to expand a latch span
leftwards to peek at any left-adjacent range keys.

It is impossible to find the exact immediate predecessor of a key,
because it can have an infinite number of `0xff` bytes at the end, so
this returns the nearest previous key right-padded with `0xff` up to the
given length. It is still possible for an infinite number of keys to
exist between `Key` and `Key.Prevish()`, as keys have unbounded length.

Release note: None

**storage: add `MVCCStats` for range keys**

This patch adds `MVCCStats` tracking for range keys. Four new fields are
added to `MVCCStats`:

* `RangeKeyCount`: the number of (fragmented) range keys, not counting
  historical versions.

* `RangeKeyBytes`: the logical encoded byte size of all range keys.
  The latest version contributes the encoded key bounds, and all
  versions contribute encoded timestamps. Unlike point keys, which for
  historical reasons use a fixed-size timestamp contribution, this uses
  the actual variable-length timestamp size.

* `RangeValCount`: the number of (fragmented) range key versions.

* `RangeValBytes`: the encoded size of all range key values across
  all versions. The same value can be stored in multiple range keys
  due to fragmentation, which will be counted separately. Even though
  all range keys are currently MVCC range tombstones with no value, the
  `MVCCValueHeader` contribution can be non-zero due to e.g. a local
  timestamp.

`ComputeStatsForRange()` has been extended to calculate the above
quantities, and additionally account for range tombstones themselves in
`GCBytesAge` along with their effect on point keys. All relevant call
sites have been updated to surface range keys for the MVCC iterators
passed to `ComputeStatsForRange()`.

Most MVCC operations have been updated to correctly account for MVCC
range tombstones, e.g. during point key writes and intent resolution. KV
APIs are not yet updated, this will be addressed later.

Range key stats are also adjusted during range splits and merges, which
will split and merge any range keys that straddle the split key. This
requires a single range key seek to the left and right of the split key
during these operations.

Touches #70412.

Release note: None

82384: sql: reuse the slice of RequestUnion objects between fetches r=yuzefovich a=yuzefovich

This commit teaches `txnKVFetcher` and `txnKVStreamer` to reuse the same
slice of `RequestUnion` objects between different fetches. It is now
extremely easy to do given the recent refactor. We do perform memory
accounting for this slice (against a memory account bound to an
unlimited memory monitor). Additionally, a similar optimization is
applied to how resume requests are populated by the Streamer.

Addresses: #82160.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 28, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 28, 2022

Build failed:

@erikgrinaker
Copy link
Contributor Author

bors retry

@craig
Copy link
Contributor

craig bot commented Jun 28, 2022

Build succeeded:

@craig craig bot merged commit 443c4d5 into cockroachdb:master Jun 28, 2022
@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-stats branch June 28, 2022 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants