-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: avoid unnecessary key comparisons during iteration #1840
Conversation
Use a trailer comparison to avoid a key comparison during forward iteration when the next key has a trailer >= than the previous key. In Cockroach this is expected to help particularly during scans of over ingested sstables: eg, replicas that were ingested through a snapshot, or sstables ingested through AddSSTable. ``` name old time/op new time/op delta MVCCScan_Pebble/rows=1/versions=1/valueSize=64-10 3.80µs ± 0% 3.68µs ± 0% -3.16% (p=0.000 n=10+10) MVCCScan_Pebble/rows=100/versions=1/valueSize=64-10 22.1µs ± 1% 21.0µs ± 0% -4.79% (p=0.000 n=10+10) MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-10 1.38ms ± 1% 1.31ms ± 0% -4.95% (p=0.000 n=10+10) MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10 3.85µs ± 1% 3.77µs ± 1% -2.07% (p=0.000 n=10+9) MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10 4.30µs ± 1% 4.24µs ± 1% -1.33% (p=0.000 n=9+9) MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10 8.19µs ± 2% 8.06µs ± 2% -1.60% (p=0.023 n=10+10) MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10 1.54µs ± 0% 1.54µs ± 1% ~ (p=0.673 n=9+10) MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10 2.09µs ± 1% 2.08µs ± 1% ~ (p=0.366 n=9+10) MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10 5.14µs ± 3% 5.08µs ± 1% -1.11% (p=0.015 n=10+10) name old time/op new time/op delta IteratorScan/keys=100,r-amp=1,key-types=points-only-10 5.87µs ± 1% 5.48µs ± 2% -6.59% (p=0.008 n=5+5) IteratorScan/keys=100,r-amp=1,key-types=points-and-ranges-10 6.09µs ± 3% 5.62µs ± 0% -7.79% (p=0.008 n=5+5) IteratorScan/keys=100,r-amp=3,key-types=points-only-10 10.2µs ± 1% 9.9µs ± 1% -2.55% (p=0.008 n=5+5) IteratorScan/keys=100,r-amp=3,key-types=points-and-ranges-10 10.5µs ± 1% 10.3µs ± 1% -1.51% (p=0.016 n=5+5) IteratorScan/keys=100,r-amp=7,key-types=points-only-10 15.1µs ± 1% 14.9µs ± 1% -1.22% (p=0.008 n=5+5) IteratorScan/keys=100,r-amp=7,key-types=points-and-ranges-10 15.5µs ± 1% 15.0µs ± 1% -3.03% (p=0.008 n=5+5) IteratorScan/keys=100,r-amp=10,key-types=points-only-10 18.7µs ± 2% 17.8µs ± 1% -5.12% (p=0.008 n=5+5) IteratorScan/keys=100,r-amp=10,key-types=points-and-ranges-10 18.9µs ± 1% 18.1µs ± 0% -3.93% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=1,key-types=points-only-10 45.1µs ± 2% 41.2µs ± 0% -8.49% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=1,key-types=points-and-ranges-10 47.1µs ± 1% 43.2µs ± 0% -8.29% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=3,key-types=points-only-10 77.7µs ± 0% 74.2µs ± 0% -4.49% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=3,key-types=points-and-ranges-10 80.8µs ± 2% 76.2µs ± 0% -5.76% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=7,key-types=points-only-10 105µs ± 1% 100µs ± 0% -4.89% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=7,key-types=points-and-ranges-10 108µs ± 1% 103µs ± 0% -5.09% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=10,key-types=points-only-10 118µs ± 0% 114µs ± 0% -4.15% (p=0.008 n=5+5) IteratorScan/keys=1000,r-amp=10,key-types=points-and-ranges-10 120µs ± 1% 115µs ± 0% -3.89% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=1,key-types=points-only-10 424µs ± 1% 392µs ± 0% -7.70% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=1,key-types=points-and-ranges-10 441µs ± 2% 408µs ± 0% -7.46% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=3,key-types=points-only-10 729µs ± 2% 685µs ± 0% -6.15% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=3,key-types=points-and-ranges-10 741µs ± 2% 704µs ± 0% -5.07% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=7,key-types=points-only-10 968µs ± 3% 917µs ± 0% -5.26% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=7,key-types=points-and-ranges-10 1.01ms ± 2% 0.94ms ± 0% -6.45% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=10,key-types=points-only-10 1.08ms ± 3% 1.02ms ± 0% -5.55% (p=0.008 n=5+5) IteratorScan/keys=10000,r-amp=10,key-types=points-and-ranges-10 1.09ms ± 2% 1.04ms ± 0% -4.82% (p=0.008 n=5+5) ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: 0 of 2 files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion
iterator.go
line 568 at r1 (raw file):
// distributed writes. We expect it to trigger very frequently when // iterating through ingested sstables, which contain keys that all have // the same sequence number.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion
iterator.go
line 568 at r1 (raw file):
Previously, sumeerbhola wrote…
nice!
Can pebbleIterator.NextKey
do something similar by comparing the hlc.Timestamp
and if it has not decreased, avoid doing p.UnsafeKey().Key.Equal(p.keyBuf)
? And something similar in pebbleMVCCScanner
for nextKey
, prevKey
and avoid the bytes.Equal(p.curUnsafeKey.Key, p.keyBuf)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion
iterator.go
line 568 at r1 (raw file):
Previously, sumeerbhola wrote…
Can
pebbleIterator.NextKey
do something similar by comparing thehlc.Timestamp
and if it has not decreased, avoid doingp.UnsafeKey().Key.Equal(p.keyBuf)
? And something similar inpebbleMVCCScanner
fornextKey
,prevKey
and avoid thebytes.Equal(p.curUnsafeKey.Key, p.keyBuf)
.
This is a really good idea, I'll give it a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion
iterator.go
line 568 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This is a really good idea, I'll give it a shot.
Unfortunately not as trivial as it sounds. We need to take intents into account, and these can both cause timestamps at 0 which incurs an extra comparison (because they need to be considered ordered before values) , as well as multiple values at the same timestamp which precludes equality comparisons (intent record and provisional value).
After taking the above cases into account, an initial MVCCScan
benchmark did not show any significant difference. That could possibly be an artifact of the benchmark timestamp distribution though, and there may be other benchmarks that benefit more.
In any case, this seems worth pursuing, but I've written up an issue for now: cockroachdb/cockroach#85610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion
iterator.go
line 568 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Unfortunately not as trivial as it sounds. We need to take intents into account, and these can both cause timestamps at 0 which incurs an extra comparison (because they need to be considered ordered before values) , as well as multiple values at the same timestamp which precludes equality comparisons (intent record and provisional value).
After taking the above cases into account, an initial
MVCCScan
benchmark did not show any significant difference. That could possibly be an artifact of the benchmark timestamp distribution though, and there may be other benchmarks that benefit more.In any case, this seems worth pursuing, but I've written up an issue for now: cockroachdb/cockroach#85610
Yeah, that's brilliant.
I think we should try to do what we can in a new pebble.Iterator
NextPrefix
method, eg through Compare
-ing the Split
suffixes. A NextPrefix
method would give us the opportunity to perform Pebble-level optimizations (like the one drafted in #1387).
I think we could also elide the user key comparison performed here in nextUserKey
in the NextPrefix
call by only performing prefix comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion
iterator.go
line 568 at r1 (raw file):
I think we should try to do what we can in a new pebble.Iterator NextPrefix method
Yeah, that seems worthwhile.
That could possibly be an artifact of the benchmark timestamp distribution though
This is definitely the case: the MVCC scan benchmark writes all keys at the same timestamp (as long as we don't specify more than one version), which means we're not hitting the timestamp comparison at all (we can't on equality due to intent collisions).
but we could still do equality if we do it beneath the intent interleaving, right? |
Yeah, but only on NextKey(). I think we'd get more savings in pebbleMVCCScanner. |
Use a trailer comparison to avoid a key comparison during forward iteration
when the next key has a trailer ≥ than the previous key.
In Cockroach this is expected to help particularly during scans of over
ingested sstables: eg, replicas that were ingested through a snapshot, or
sstables ingested through AddSSTable.