Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: avoid unnecessary key comparisons during iteration #1840

Merged
merged 1 commit into from
Aug 2, 2022

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Aug 1, 2022

Use a trailer comparison to avoid a key comparison during forward iteration
when the next key has a trailer ≥ than the previous key.

In Cockroach this is expected to help particularly during scans of over
ingested sstables: eg, replicas that were ingested through a snapshot, or
sstables ingested through AddSSTable.

name                                                          old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-10               3.80µs ± 0%    3.68µs ± 0%  -3.16%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-10             22.1µs ± 1%    21.0µs ± 0%  -4.79%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-10           1.38ms ± 1%    1.31ms ± 0%  -4.95%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10            3.85µs ± 1%    3.77µs ± 1%  -2.07%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10           4.30µs ± 1%    4.24µs ± 1%  -1.33%  (p=0.000 n=9+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10          8.19µs ± 2%    8.06µs ± 2%  -1.60%  (p=0.023 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10             1.54µs ± 0%    1.54µs ± 1%    ~     (p=0.673 n=9+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10            2.09µs ± 1%    2.08µs ± 1%    ~     (p=0.366 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10           5.14µs ± 3%    5.08µs ± 1%  -1.11%  (p=0.015 n=10+10)

name                                                             old time/op  new time/op  delta
IteratorScan/keys=100,r-amp=1,key-types=points-only-10           5.87µs ± 1%  5.48µs ± 2%  -6.59%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=1,key-types=points-and-ranges-10     6.09µs ± 3%  5.62µs ± 0%  -7.79%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=3,key-types=points-only-10           10.2µs ± 1%   9.9µs ± 1%  -2.55%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=3,key-types=points-and-ranges-10     10.5µs ± 1%  10.3µs ± 1%  -1.51%  (p=0.016 n=5+5)
IteratorScan/keys=100,r-amp=7,key-types=points-only-10           15.1µs ± 1%  14.9µs ± 1%  -1.22%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=7,key-types=points-and-ranges-10     15.5µs ± 1%  15.0µs ± 1%  -3.03%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=10,key-types=points-only-10          18.7µs ± 2%  17.8µs ± 1%  -5.12%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=10,key-types=points-and-ranges-10    18.9µs ± 1%  18.1µs ± 0%  -3.93%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=1,key-types=points-only-10          45.1µs ± 2%  41.2µs ± 0%  -8.49%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=1,key-types=points-and-ranges-10    47.1µs ± 1%  43.2µs ± 0%  -8.29%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=3,key-types=points-only-10          77.7µs ± 0%  74.2µs ± 0%  -4.49%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=3,key-types=points-and-ranges-10    80.8µs ± 2%  76.2µs ± 0%  -5.76%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=7,key-types=points-only-10           105µs ± 1%   100µs ± 0%  -4.89%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=7,key-types=points-and-ranges-10     108µs ± 1%   103µs ± 0%  -5.09%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=10,key-types=points-only-10          118µs ± 0%   114µs ± 0%  -4.15%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=10,key-types=points-and-ranges-10    120µs ± 1%   115µs ± 0%  -3.89%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=1,key-types=points-only-10          424µs ± 1%   392µs ± 0%  -7.70%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=1,key-types=points-and-ranges-10    441µs ± 2%   408µs ± 0%  -7.46%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=3,key-types=points-only-10          729µs ± 2%   685µs ± 0%  -6.15%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=3,key-types=points-and-ranges-10    741µs ± 2%   704µs ± 0%  -5.07%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=7,key-types=points-only-10          968µs ± 3%   917µs ± 0%  -5.26%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=7,key-types=points-and-ranges-10   1.01ms ± 2%  0.94ms ± 0%  -6.45%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=10,key-types=points-only-10        1.08ms ± 3%  1.02ms ± 0%  -5.55%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=10,key-types=points-and-ranges-10  1.09ms ± 2%  1.04ms ± 0%  -4.82%  (p=0.008 n=5+5)

Use a trailer comparison to avoid a key comparison during forward iteration
when the next key has a trailer >= than the previous key.

In Cockroach this is expected to help particularly during scans of over
ingested sstables: eg, replicas that were ingested through a snapshot, or
sstables ingested through AddSSTable.

```
name                                                          old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-10               3.80µs ± 0%    3.68µs ± 0%  -3.16%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-10             22.1µs ± 1%    21.0µs ± 0%  -4.79%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-10           1.38ms ± 1%    1.31ms ± 0%  -4.95%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10            3.85µs ± 1%    3.77µs ± 1%  -2.07%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10           4.30µs ± 1%    4.24µs ± 1%  -1.33%  (p=0.000 n=9+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10          8.19µs ± 2%    8.06µs ± 2%  -1.60%  (p=0.023 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10             1.54µs ± 0%    1.54µs ± 1%    ~     (p=0.673 n=9+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10            2.09µs ± 1%    2.08µs ± 1%    ~     (p=0.366 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10           5.14µs ± 3%    5.08µs ± 1%  -1.11%  (p=0.015 n=10+10)

name                                                             old time/op  new time/op  delta
IteratorScan/keys=100,r-amp=1,key-types=points-only-10           5.87µs ± 1%  5.48µs ± 2%  -6.59%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=1,key-types=points-and-ranges-10     6.09µs ± 3%  5.62µs ± 0%  -7.79%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=3,key-types=points-only-10           10.2µs ± 1%   9.9µs ± 1%  -2.55%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=3,key-types=points-and-ranges-10     10.5µs ± 1%  10.3µs ± 1%  -1.51%  (p=0.016 n=5+5)
IteratorScan/keys=100,r-amp=7,key-types=points-only-10           15.1µs ± 1%  14.9µs ± 1%  -1.22%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=7,key-types=points-and-ranges-10     15.5µs ± 1%  15.0µs ± 1%  -3.03%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=10,key-types=points-only-10          18.7µs ± 2%  17.8µs ± 1%  -5.12%  (p=0.008 n=5+5)
IteratorScan/keys=100,r-amp=10,key-types=points-and-ranges-10    18.9µs ± 1%  18.1µs ± 0%  -3.93%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=1,key-types=points-only-10          45.1µs ± 2%  41.2µs ± 0%  -8.49%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=1,key-types=points-and-ranges-10    47.1µs ± 1%  43.2µs ± 0%  -8.29%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=3,key-types=points-only-10          77.7µs ± 0%  74.2µs ± 0%  -4.49%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=3,key-types=points-and-ranges-10    80.8µs ± 2%  76.2µs ± 0%  -5.76%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=7,key-types=points-only-10           105µs ± 1%   100µs ± 0%  -4.89%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=7,key-types=points-and-ranges-10     108µs ± 1%   103µs ± 0%  -5.09%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=10,key-types=points-only-10          118µs ± 0%   114µs ± 0%  -4.15%  (p=0.008 n=5+5)
IteratorScan/keys=1000,r-amp=10,key-types=points-and-ranges-10    120µs ± 1%   115µs ± 0%  -3.89%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=1,key-types=points-only-10          424µs ± 1%   392µs ± 0%  -7.70%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=1,key-types=points-and-ranges-10    441µs ± 2%   408µs ± 0%  -7.46%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=3,key-types=points-only-10          729µs ± 2%   685µs ± 0%  -6.15%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=3,key-types=points-and-ranges-10    741µs ± 2%   704µs ± 0%  -5.07%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=7,key-types=points-only-10          968µs ± 3%   917µs ± 0%  -5.26%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=7,key-types=points-and-ranges-10   1.01ms ± 2%  0.94ms ± 0%  -6.45%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=10,key-types=points-only-10        1.08ms ± 3%  1.02ms ± 0%  -5.55%  (p=0.008 n=5+5)
IteratorScan/keys=10000,r-amp=10,key-types=points-and-ranges-10  1.09ms ± 2%  1.04ms ± 0%  -4.82%  (p=0.008 n=5+5)
```
@jbowens jbowens requested review from erikgrinaker and a team August 1, 2022 18:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 2 files reviewed, all discussions resolved

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion


iterator.go line 568 at r1 (raw file):

		// distributed writes. We expect it to trigger very frequently when
		// iterating through ingested sstables, which contain keys that all have
		// the same sequence number.

nice!

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion


iterator.go line 568 at r1 (raw file):

Previously, sumeerbhola wrote…

nice!

Can pebbleIterator.NextKey do something similar by comparing the hlc.Timestamp and if it has not decreased, avoid doing p.UnsafeKey().Key.Equal(p.keyBuf)? And something similar in pebbleMVCCScanner for nextKey, prevKey and avoid the bytes.Equal(p.curUnsafeKey.Key, p.keyBuf).

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion


iterator.go line 568 at r1 (raw file):

Previously, sumeerbhola wrote…

Can pebbleIterator.NextKey do something similar by comparing the hlc.Timestamp and if it has not decreased, avoid doing p.UnsafeKey().Key.Equal(p.keyBuf)? And something similar in pebbleMVCCScanner for nextKey, prevKey and avoid the bytes.Equal(p.curUnsafeKey.Key, p.keyBuf).

This is a really good idea, I'll give it a shot.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion


iterator.go line 568 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This is a really good idea, I'll give it a shot.

Unfortunately not as trivial as it sounds. We need to take intents into account, and these can both cause timestamps at 0 which incurs an extra comparison (because they need to be considered ordered before values) , as well as multiple values at the same timestamp which precludes equality comparisons (intent record and provisional value).

After taking the above cases into account, an initial MVCCScan benchmark did not show any significant difference. That could possibly be an artifact of the benchmark timestamp distribution though, and there may be other benchmarks that benefit more.

In any case, this seems worth pursuing, but I've written up an issue for now: cockroachdb/cockroach#85610

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion


iterator.go line 568 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Unfortunately not as trivial as it sounds. We need to take intents into account, and these can both cause timestamps at 0 which incurs an extra comparison (because they need to be considered ordered before values) , as well as multiple values at the same timestamp which precludes equality comparisons (intent record and provisional value).

After taking the above cases into account, an initial MVCCScan benchmark did not show any significant difference. That could possibly be an artifact of the benchmark timestamp distribution though, and there may be other benchmarks that benefit more.

In any case, this seems worth pursuing, but I've written up an issue for now: cockroachdb/cockroach#85610

Yeah, that's brilliant.

I think we should try to do what we can in a new pebble.Iterator NextPrefix method, eg through Compare-ing the Split suffixes. A NextPrefix method would give us the opportunity to perform Pebble-level optimizations (like the one drafted in #1387).

I think we could also elide the user key comparison performed here in nextUserKey in the NextPrefix call by only performing prefix comparisons.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion


iterator.go line 568 at r1 (raw file):

I think we should try to do what we can in a new pebble.Iterator NextPrefix method

Yeah, that seems worthwhile.

That could possibly be an artifact of the benchmark timestamp distribution though

This is definitely the case: the MVCC scan benchmark writes all keys at the same timestamp (as long as we don't specify more than one version), which means we're not hitting the timestamp comparison at all (we can't on equality due to intent collisions).

@jbowens
Copy link
Collaborator Author

jbowens commented Aug 5, 2022

we can't on equality due to intent collisions

but we could still do equality if we do it beneath the intent interleaving, right?

@erikgrinaker
Copy link
Contributor

Yeah, but only on NextKey(). I think we'd get more savings in pebbleMVCCScanner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants