storage: optimize new SST iterators #83051

erikgrinaker · 2022-06-18T10:22:09Z

The new SST iterators in #82799, e.g. NewPebbleSSTMemIterator(), are much slower than the existing NewMemSSTIterator(), around 80% or so. Some of this is due to CRDB-side overhead, in particular the use of pebbleIterator and VerifyingMVCCIterator, but the majority appears to be in Pebble via the new NewExternalIter() SST iterator. We need to optimize these.

This can be seen with BenchmarkSSTIterator, by changing NewMemSSTIterator() to NewPebbleMemSSTIterator() in runSSTIterator() on top of #82799.

Many of the optimizations that address #82559 and #83049 would likely apply here as well.

Jira issue: CRDB-16822

Epic CRDB-2624

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-06-18T10:22:10Z

cc @cockroachdb/replication

jbowens · 2022-07-26T15:08:59Z

I expect most of the slowdown comes from the fact that the Pebble iterator resolves duplicates. This requires copying every key and an extra key comparison.

erikgrinaker · 2022-07-26T16:05:06Z

Oof, yeah, that isn't great. The probability of duplicates is very low though, I wonder if we could do something cheaper and then backtrack if we find a possible match.

This is purely a refactor. The only prod refactor is in the SSTBatcher, where it calls ComputeStatsForRange. This should not have a significant perf hit as the iterator only surfaces point keys and the bulk of sst_batcher comput time is spent in doing ingestion. Currently the pebbleIterator is slightly slower than the old iterator, but optimizations are coming soon (cockroachdb#83051). Release note: none

Previously BACKUP would not back up range tombstones. With this patch, BACKUPs with revision_history will backup range tombstones. Non-revision history backups are not affected by this diff because MVCCExportToSST filters all tombstones out of the backup already. Specifically, this patch replaces the iterators used in the backup_processor with the pebbleIterator, which has baked in range key support. This refactor introduces a 5% regression in backup runtime, even when the backup has no range keys, though cockroachdb#83051 hopes to address this gap. See details below on the benchmark experiment. At this point a backup with range keys is restorable, thanks to cockroachdb#84214. Note that the restore codebase still touches iterators that are not range key aware. This is not a problem because restored data does not have range keys, nor do the empty ranges restore dumps data into. These iterators (e.g. in SSTBatcher and in CheckSSTConflicts) will be updated when cockroachdb#70428 gets fixed. Fixes cockroachdb#71155 Release note: none To benchmark this diff, the following commands were used on the following sha a5ccdc3, with and without this commit, over three trials: ``` roachprod create -n 5 --gce-machine-type=n2-standard-16 $CLUSTER roachprod put $CLUSTER [build] cockroach roachprod wipe $CLUSTER; roachprod start $CLUSTER; roachprod run $CLUSTER:1 -- "./cockroach workload init bank --rows 1000000000" roachprod sql $CLUSTER:1 -- -e "BACKUP INTO 'gs://somebucket/michael-rangkey?AUTH=implicit'" ``` The backup on the control binary took on average 478 seconds with a stdev of 13 seconds, while the backup with the treatment binary took on average 499 seconds with stddev of 8 seconds.

84875: backupccl: handle range keys in BACKUP r=erikgrinaker a=msbutler Previously BACKUP would not back up range tombstones. With this patch, BACKUPs with revision_history will backup range tombstones. Non-revision history backups are not affected by this diff because MVCCExportToSST filters all tombstones out of the backup already. Specifically, this patch replaces the iterators used in the backup_processor with the pebbleIterator, which has baked in range key support. This refactor introduces a 5% regression in backup runtime, even when the backup has no range keys, though #83051 hopes to address this gap. See details below on the benchmark experiment. At this point a backup with range keys is restorable, thanks to #84214. Note that the restore codebase still touches iterators that are not range key aware. This is not a problem because restored data does not have range keys, nor do the empty ranges restore dumps data into. These iterators (e.g. in SSTBatcher and in CheckSSTConflicts) will be updated when #70428 gets fixed. Fixes #71155 Release note: none To benchmark this diff, the following commands were used on the following sha a5ccdc3, with and without this commit, over three trials: ``` roachprod create -n 5 --gce-machine-type=n2-standard-16 $CLUSTER roachprod put $CLUSTER [build] cockroach roachprod wipe $CLUSTER; roachprod start $CLUSTER; roachprod run $CLUSTER:1 -- "./cockroach workload init bank --rows 1000000000" roachprod sql $CLUSTER:1 -- -e "BACKUP INTO 'gs://somebucket/michael-rangkey?AUTH=implicit'" ``` The backup on the control binary took on average 478 seconds with a stdev of 13 seconds, while the backup with the treatment binary took on average 499 seconds with stddev of 8 seconds. 84883: kvserver: add server-side transaction retry metrics r=arulajmani a=arulajmani This patch adds a few new metrics to track successful/failed server-side transaction retries. Specifically, whenever we attempt to retry a read or write batch or run into a read within uncertainty interval error, we increment specific counters indicating if the retry was successful or not. Release note: None 85074: upgrades: add checkpointing for `raftAppliedIndexTermMigration` r=irfansharif a=erikgrinaker Forward-port of #84909, for posterity. ---- The `raftAppliedIndexTermMigration` upgrade migration could be unreliable. It iterates over all ranges and runs a `Migrate` request which must be applied on all replicas. However, if any ranges merge or replicas are unavailable, the migration fails and starts over from the beginning. In large clusters with many ranges, this meant that it might never complete. This patch makes the upgrade more robust, by retrying each `Migrate` request 5 times, and checkpointing the progress after every fifth batch (1000 ranges), allowing resumption on failure. At some point this should be made part of the migration infrastructure. NB: This fix was initially submitted for 22.1, and even though the migration will be removed for 22.2, it is forward-ported for posterity. Release note: None 85086: eval: stop ignoring all ResolveOIDFromOID errors r=ajwerner a=rafiss fixes #84448 The decision about whether an error is safe to ignore is made at the place where the error is created/returned. This way, the callers don't need to be aware of any new error codes that the implementation may start returning in the future. Release note (bug fix): Fixed incorrect error handling that could cause casts to OID types to fail in some cases. Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

85068: bulk: replace old SSTIterators with PebbleSSTIterator in bulk stack r=erikgrinaker a=msbutler This is purely a refactor. The only prod refactor is in the SSTBatcher, where it calls ComputeStatsForRange. This should not have a significant perf hit as the iterator only surfaces point keys and the bulk of sst_batcher comput time is spent in doing ingestion. Currently the pebbleIterator is slightly slower than the old iterator, but optimizations are coming soon (#83051). Release note: none Co-authored-by: Michael Butler <[email protected]>

jbowens · 2022-08-01T19:57:29Z

@msbutler what structure do we have in the backups? Is it correct that we within each individual full backup and each individual backup we have an ordered series of nonoverlapping files? I think we should also be able to refactor the interface so that we can propagate this information to the Pebble iterator, and it can create num-backups levels in the merging iterator instead of num-files levels.

msbutler · 2022-08-01T20:29:13Z

The set of files produced by an individual backup may have overlapping key spans, ~~however there will never be overlapping keys across files (i.e. if a@2 is in file foo, then a@anyTimestamp will be in file foo)~~. CORRECTION: A key's versions can get split across different backup SSTs.

jbowens · 2022-08-02T12:37:44Z

I think we'll want to change the interface to accept a [][]sstable.ReadableFile, where each subslice must not contain overlapping point keys.

This will allow us to organize each backup into its own 'level', by implementing a simple level iterator that chains all the internal point key iterators within a backup. There's a bit of a complication that pebble.Iterator supports seeking, but we don't the bounds of these files. We don't require seeking for this use case, but it's a poor UX to expose a *pebble.Iterator that's only intended for forward scans. I think we could support seeking without too much work by binary searching among the level's files, First()-ing each to retrieve its smallest point key for comparison.

@nicktrav — did you want to pick this up after the min-table version work, or something else?

msbutler · 2022-08-02T12:48:06Z

@stevendanna I just thought of one wrinkle to my answer to Jackson's question above. Could you verify this? If the backup retries, we might write several SSTs that contain the same versioned key, however, only one of those SST's will ever get check pointed. So, from RESTORE's perspective, this still holds:

The set of files produced by an individual backup may have overlapping key spans, and a key's versions can get split across different backup ssts.

nicktrav · 2022-08-02T13:51:56Z

@nicktrav — did you want to pick this up after the min-table version work, or something else?

SGTM.

itsbilal · 2022-08-12T19:34:32Z

@jbowens Just to clarify, we'd still need the pausing behaviour for the simpler point-key levelIter right (i.e. the one that operates on observed file bounds and not FileMetadata and only needs to support forward iteration)? Since we could still have range deletes in one fake level that apply to points in another level.

Also on that note, we'd need a similar simpler levelIter for range keys, right? Or is it sufficient to still expose all the range key iterators to its merging iter, with the hope that there aren't many files with range keys?

jbowens · 2022-08-12T20:35:25Z

Just to clarify, we'd still need the pausing behaviour for the simpler point-key levelIter right (i.e. the one that operates on observed file bounds and not FileMetadata and only needs to support forward iteration)? Since we could still have range deletes in one fake level that apply to points in another level.

Yeah, I guess so. It's not needed for the intended use case of backups, but we should try to keep NewExternalIter very general, even if some of our optimizations are not general.

Also on that note, we'd need a similar simpler levelIter for range keys, right? Or is it sufficient to still expose all the range key iterators to its merging iter, with the hope that there aren't many files with range keys?

I think that's sufficient since range keys should be relatively rare, and I think range keys are allowed to overlap between backup files from the same backup.

erikgrinaker · 2022-08-15T16:13:03Z

Recent benchmarks comparing the old sstIterator vs. the new pebbleIterator based one:

name                                      old time/op  new time/op  delta
SSTIterator/X/keys=1/verify=false-24      39.8µs ± 1%  45.5µs ± 4%  +14.35%  (p=0.000 n=9+10)
SSTIterator/X/keys=1/verify=true-24       39.8µs ± 1%  44.1µs ± 4%  +10.68%  (p=0.000 n=9+10)
SSTIterator/X/keys=100/verify=false-24    47.8µs ± 0%  57.1µs ± 1%  +19.56%  (p=0.000 n=8+10)
SSTIterator/X/keys=100/verify=true-24     50.3µs ± 1%  61.9µs ± 1%  +22.97%  (p=0.000 n=9+10)
SSTIterator/X/keys=10000/verify=false-24   850µs ± 0%  1498µs ± 1%  +76.28%  (p=0.000 n=8+9)
SSTIterator/X/keys=10000/verify=true-24   1.07ms ± 1%  1.92ms ± 0%  +80.40%  (p=0.000 n=10+9)

Currently, we create a new merging iterator level for each file passed into NewExternalIter. This is unnecessary for most use-cases of creating ExternalIters around lots of sstables, as we can externally guarantee that many of those sstables won't have overlapping points with each other. We can have the caller pass this knowledge by specifying a [][]sstable.ReadableFile where each sub-slice obeys level invariants for files within it, and is also already sorted by user keys. This change makes the interface change to allow for the above optimization, and also adds a `simpleLevelIter` that implements forward iteration within a single "level". For files that don't contain range deletes, we shove all the point iters into one `simpleLevelIter`, greatly reducing merging iterator levels and speeding up its operations by a lot. Fixes cockroachdb/cockroach#83051.

86259: storage: implement `RangeKeyChanged()` for `MVCCIncrementalIterator` r=tbg a=erikgrinaker **storage: use `RangeKeyChanged()` in `MVCCIncrementalIterator`** This patch uses `RangeKeyChanged()` to detect changes to range keys in `MVCCIncrementalIterator`. There are no functional changes. Some quick benchmarks, using catchup scans: ``` name old time/op new time/op delta CatchUpScan/mixed-case/withDiff=true/perc=0.00/numRangeKeys=0-24 538ms ± 1% 535ms ± 1% ~ (p=0.211 n=10+9) CatchUpScan/mixed-case/withDiff=true/perc=0.00/numRangeKeys=1-24 690ms ± 0% 590ms ± 1% -14.56% (p=0.000 n=9+10) CatchUpScan/mixed-case/withDiff=true/perc=0.00/numRangeKeys=100-24 743ms ± 1% 646ms ± 1% -13.13% (p=0.000 n=9+9) CatchUpScan/mixed-case/withDiff=false/perc=0.00/numRangeKeys=0-24 794ms ± 1% 794ms ± 0% ~ (p=0.579 n=10+10) CatchUpScan/mixed-case/withDiff=false/perc=0.00/numRangeKeys=1-24 966ms ± 0% 911ms ± 1% -5.72% (p=0.000 n=10+10) CatchUpScan/mixed-case/withDiff=false/perc=0.00/numRangeKeys=100-24 974ms ± 0% 920ms ± 0% -5.51% (p=0.000 n=10+10) ``` Release justification: bug fixes and low-risk updates to new functionality. Release note: None **storage: implement `RangeKeyChanged()` for `MVCCIncrementalIterator`** This patch implements `RangeKeyChanged()` for `MVCCIncrementalIterator`. It only fires if the time bound range keys change, not if a `Next(Key)IgnoringTime()` operation reveals additional range keys. Resolves #86105. Release justification: bug fixes and low-risk updates to new functionality Release note: None **storage: add `MVCCIncrementalIterator.RangeKeysIgnoringTime()`** This patch changes `MVCCIncrementalIterator` range key behavior following a `Next(Key)IgnoringTime()` call. Previously, range key methods would then expose unfiltered range keys. Now, the standard range key methods only apply to filtered range keys, and an additional `RangeKeysIgnoringTime()` method provides access to unfiltered range keys. This implies that if such a call steps onto a range key that's entirely outside of the time bounds then: * `HasPointAndRange()` will return `false`,`false` if on a bare range key. * `RangeKeyChanged()` will not fire, unless stepping off of a range key within the time bounds. * `RangeBounds()` and `RangeKeys()` will return empty results. This is done to avoid conditional range key handling following these calls, except for the exact sites where the caller is interested in the unfiltered range keys. Release justification: bug fixes and low-risk updates to new functionality Release note: None 86440: storage: use concrete `pebbleIterator` in `intentInterleavingIter` r=tbg a=erikgrinaker **storage: tweak `unsafeMVCCIterator` construction** Release justification: non-production code changes Release note: None **storage: inline some `intentInterleavingIter` methods** This patch splits up `maybeSkipIntentRangeKeys()` and `maybeSuppressRangeKeyChanged()` to allow for mid-stack inlining. I doubt that the gains are as large as these microbenchmarks claim, and there's a fair bit of variance between runs, but it can't hurt. ``` name old time/op new time/op delta MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.37µs ± 2% 5.43µs ± 2% +1.13% (p=0.041 n=10+10) MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 38.2µs ± 2% 38.2µs ± 2% ~ (p=0.971 n=10+10) MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 2.79ms ± 2% 2.71ms ± 2% -2.59% (p=0.000 n=10+10) MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.99µs ± 1% 5.99µs ± 2% ~ (p=0.541 n=10+10) MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 51.7µs ± 3% 52.1µs ± 1% ~ (p=0.631 n=10+10) MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 3.88ms ± 2% 3.87ms ± 1% ~ (p=0.897 n=10+8) MVCCComputeStats_Pebble/valueSize=32/numRangeKeys=0-24 158ms ± 1% 155ms ± 1% -2.34% (p=0.000 n=10+9) ``` Touches #82559. Release justification: bug fixes and low-risk updates to new functionality Release note: None **storage: use concrete `pebbleIterator` in `intentInterleavingIter`** Since `intentInterleavingIter` always constructs the underlying iterators from the given reader, and these readers always construct `*pebbleIterator`, it can use the concrete type rather than interfaces for both iterators. This avoids dynamic dispatch, yielding a decent performance improvement. Unfortunately, this requires disabling `unsafeMVCCIterator` inside `intentInterleavingIter`. This wasn't always enabled anyway, since it was omitted both for the engine iterator and when using an engine directly (which doesn't have consistent iterators). ``` name old time/op new time/op delta MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.43µs ± 2% 5.45µs ± 2% ~ (p=0.566 n=10+10) MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 38.2µs ± 2% 37.0µs ± 1% -3.02% (p=0.000 n=10+10) MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 2.71ms ± 2% 2.66ms ± 1% -1.83% (p=0.000 n=10+9) MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.99µs ± 2% 5.86µs ± 2% -2.15% (p=0.000 n=10+10) MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 52.1µs ± 1% 50.2µs ± 2% -3.77% (p=0.000 n=10+10) MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 3.87ms ± 1% 3.83ms ± 1% -1.26% (p=0.000 n=8+10) MVCCComputeStats_Pebble/valueSize=32/numRangeKeys=0-24 155ms ± 1% 155ms ± 1% ~ (p=0.842 n=9+10) ``` Touches #82559. Release justification: low risk, high benefit changes to existing functionality Release note: None 86478: storage: use concrete `pebbleIterator` in `verifyingIterator` r=tbg a=erikgrinaker **storage: add Pebble SST iterator benchmarks** Release justification: non-production code changes Release note: None **storage: use concrete `pebbleIterator` in `verifyingIterator`** Gives a slight performance boost, since it avoid dynamic dispatch: ``` name old time/op new time/op delta SSTIterator/keys=1/variant=pebble/verify=true-24 42.8µs ± 1% 42.4µs ± 1% -0.79% (p=0.043 n=10+10) SSTIterator/keys=100/variant=pebble/verify=true-24 61.8µs ± 1% 60.7µs ± 1% -1.64% (p=0.000 n=10+10) SSTIterator/keys=10000/variant=pebble/verify=true-24 1.91ms ± 0% 1.88ms ± 0% -1.79% (p=0.000 n=10+10) ``` An attempt was also made at using `RangeKeyChanged()` instead of `HasPointAndRange()`, but this had no effect. Touches #83051. Release justification: bug fixes and low-risk updates to new functionality Release note: None 86513: storage: use `RangeKeyChanged` in `ReadAsOfIterator` r=tbg a=erikgrinaker This patch uses `RangeKeyChanged()` to detect range keys in `ReadAsOfIterator`, and caches them to improve performance. It also fixes a bug where the iterator would fail to detect tombstones with a non-empty `MVCCValueHeader`. Resolves #84714. Release justification: bug fixes and low-risk updates to new functionality Release note: None 86514: storage: use `RangeKeyChanged()` in `MVCCDeleteRangeUsingTombstone()` r=tbg a=erikgrinaker Release justification: bug fixes and low-risk updates to new functionality Release note: None 86529: storage: omit unnecessary range key calls during intent resolution r=tbg a=erikgrinaker Release justification: bug fixes and low-risk updates to new functionality Release note: None Co-authored-by: Erik Grinaker <[email protected]>

blathers-crl bot added the A-kv-replication Relating to Raft, consensus, and coordination. label Jun 18, 2022

nicktrav added the T-storage Storage Team label Jul 13, 2022

msbutler mentioned this issue Jul 26, 2022

bulk: replace old SSTIterators with PebbleSSTIterator in bulk stack #85068

Merged

msbutler mentioned this issue Jul 26, 2022

backupccl: handle range keys in BACKUP #84875

Merged

nicktrav self-assigned this Aug 2, 2022

itsbilal assigned itsbilal and unassigned nicktrav Aug 8, 2022

exalate-issue-sync bot removed the T-kv-replication label Aug 12, 2022

itsbilal mentioned this issue Aug 17, 2022

*: add simpleLevelIterator, reduce merging levels in external iter cockroachdb/pebble#1905

Merged

itsbilal closed this as completed in cockroachdb/pebble#1905 Aug 19, 2022

erikgrinaker mentioned this issue Aug 19, 2022

storage: use concrete pebbleIterator in verifyingIterator #86478

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: optimize new SST iterators #83051

storage: optimize new SST iterators #83051

erikgrinaker commented Jun 18, 2022 •

edited

Loading

blathers-crl bot commented Jun 18, 2022

jbowens commented Jul 26, 2022 •

edited

Loading

erikgrinaker commented Jul 26, 2022

jbowens commented Aug 1, 2022

msbutler commented Aug 1, 2022 •

edited

Loading

jbowens commented Aug 2, 2022

msbutler commented Aug 2, 2022 •

edited

Loading

nicktrav commented Aug 2, 2022

itsbilal commented Aug 12, 2022

jbowens commented Aug 12, 2022

erikgrinaker commented Aug 15, 2022

storage: optimize new SST iterators #83051

storage: optimize new SST iterators #83051

Comments

erikgrinaker commented Jun 18, 2022 • edited Loading

blathers-crl bot commented Jun 18, 2022

jbowens commented Jul 26, 2022 • edited Loading

erikgrinaker commented Jul 26, 2022

jbowens commented Aug 1, 2022

msbutler commented Aug 1, 2022 • edited Loading

jbowens commented Aug 2, 2022

msbutler commented Aug 2, 2022 • edited Loading

nicktrav commented Aug 2, 2022

itsbilal commented Aug 12, 2022

jbowens commented Aug 12, 2022

erikgrinaker commented Aug 15, 2022

erikgrinaker commented Jun 18, 2022 •

edited

Loading

jbowens commented Jul 26, 2022 •

edited

Loading

msbutler commented Aug 1, 2022 •

edited

Loading

msbutler commented Aug 2, 2022 •

edited

Loading