-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: optimize new SST iterators #83051
Comments
cc @cockroachdb/replication |
I expect most of the slowdown comes from the fact that the Pebble iterator resolves duplicates. This requires copying every key and an extra key comparison. |
Oof, yeah, that isn't great. The probability of duplicates is very low though, I wonder if we could do something cheaper and then backtrack if we find a possible match. |
This is purely a refactor. The only prod refactor is in the SSTBatcher, where it calls ComputeStatsForRange. This should not have a significant perf hit as the iterator only surfaces point keys and the bulk of sst_batcher comput time is spent in doing ingestion. Currently the pebbleIterator is slightly slower than the old iterator, but optimizations are coming soon (cockroachdb#83051). Release note: none
Previously BACKUP would not back up range tombstones. With this patch, BACKUPs with revision_history will backup range tombstones. Non-revision history backups are not affected by this diff because MVCCExportToSST filters all tombstones out of the backup already. Specifically, this patch replaces the iterators used in the backup_processor with the pebbleIterator, which has baked in range key support. This refactor introduces a 5% regression in backup runtime, even when the backup has no range keys, though cockroachdb#83051 hopes to address this gap. See details below on the benchmark experiment. At this point a backup with range keys is restorable, thanks to cockroachdb#84214. Note that the restore codebase still touches iterators that are not range key aware. This is not a problem because restored data does not have range keys, nor do the empty ranges restore dumps data into. These iterators (e.g. in SSTBatcher and in CheckSSTConflicts) will be updated when cockroachdb#70428 gets fixed. Fixes cockroachdb#71155 Release note: none To benchmark this diff, the following commands were used on the following sha a5ccdc3, with and without this commit, over three trials: ``` roachprod create -n 5 --gce-machine-type=n2-standard-16 $CLUSTER roachprod put $CLUSTER [build] cockroach roachprod wipe $CLUSTER; roachprod start $CLUSTER; roachprod run $CLUSTER:1 -- "./cockroach workload init bank --rows 1000000000" roachprod sql $CLUSTER:1 -- -e "BACKUP INTO 'gs://somebucket/michael-rangkey?AUTH=implicit'" ``` The backup on the control binary took on average 478 seconds with a stdev of 13 seconds, while the backup with the treatment binary took on average 499 seconds with stddev of 8 seconds.
84875: backupccl: handle range keys in BACKUP r=erikgrinaker a=msbutler Previously BACKUP would not back up range tombstones. With this patch, BACKUPs with revision_history will backup range tombstones. Non-revision history backups are not affected by this diff because MVCCExportToSST filters all tombstones out of the backup already. Specifically, this patch replaces the iterators used in the backup_processor with the pebbleIterator, which has baked in range key support. This refactor introduces a 5% regression in backup runtime, even when the backup has no range keys, though #83051 hopes to address this gap. See details below on the benchmark experiment. At this point a backup with range keys is restorable, thanks to #84214. Note that the restore codebase still touches iterators that are not range key aware. This is not a problem because restored data does not have range keys, nor do the empty ranges restore dumps data into. These iterators (e.g. in SSTBatcher and in CheckSSTConflicts) will be updated when #70428 gets fixed. Fixes #71155 Release note: none To benchmark this diff, the following commands were used on the following sha a5ccdc3, with and without this commit, over three trials: ``` roachprod create -n 5 --gce-machine-type=n2-standard-16 $CLUSTER roachprod put $CLUSTER [build] cockroach roachprod wipe $CLUSTER; roachprod start $CLUSTER; roachprod run $CLUSTER:1 -- "./cockroach workload init bank --rows 1000000000" roachprod sql $CLUSTER:1 -- -e "BACKUP INTO 'gs://somebucket/michael-rangkey?AUTH=implicit'" ``` The backup on the control binary took on average 478 seconds with a stdev of 13 seconds, while the backup with the treatment binary took on average 499 seconds with stddev of 8 seconds. 84883: kvserver: add server-side transaction retry metrics r=arulajmani a=arulajmani This patch adds a few new metrics to track successful/failed server-side transaction retries. Specifically, whenever we attempt to retry a read or write batch or run into a read within uncertainty interval error, we increment specific counters indicating if the retry was successful or not. Release note: None 85074: upgrades: add checkpointing for `raftAppliedIndexTermMigration` r=irfansharif a=erikgrinaker Forward-port of #84909, for posterity. ---- The `raftAppliedIndexTermMigration` upgrade migration could be unreliable. It iterates over all ranges and runs a `Migrate` request which must be applied on all replicas. However, if any ranges merge or replicas are unavailable, the migration fails and starts over from the beginning. In large clusters with many ranges, this meant that it might never complete. This patch makes the upgrade more robust, by retrying each `Migrate` request 5 times, and checkpointing the progress after every fifth batch (1000 ranges), allowing resumption on failure. At some point this should be made part of the migration infrastructure. NB: This fix was initially submitted for 22.1, and even though the migration will be removed for 22.2, it is forward-ported for posterity. Release note: None 85086: eval: stop ignoring all ResolveOIDFromOID errors r=ajwerner a=rafiss fixes #84448 The decision about whether an error is safe to ignore is made at the place where the error is created/returned. This way, the callers don't need to be aware of any new error codes that the implementation may start returning in the future. Release note (bug fix): Fixed incorrect error handling that could cause casts to OID types to fail in some cases. Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
85068: bulk: replace old SSTIterators with PebbleSSTIterator in bulk stack r=erikgrinaker a=msbutler This is purely a refactor. The only prod refactor is in the SSTBatcher, where it calls ComputeStatsForRange. This should not have a significant perf hit as the iterator only surfaces point keys and the bulk of sst_batcher comput time is spent in doing ingestion. Currently the pebbleIterator is slightly slower than the old iterator, but optimizations are coming soon (#83051). Release note: none Co-authored-by: Michael Butler <[email protected]>
@msbutler what structure do we have in the backups? Is it correct that we within each individual full backup and each individual backup we have an ordered series of nonoverlapping files? I think we should also be able to refactor the interface so that we can propagate this information to the Pebble iterator, and it can create num-backups levels in the merging iterator instead of num-files levels. |
The set of files produced by an individual backup may have overlapping key spans, |
I think we'll want to change the interface to accept a This will allow us to organize each backup into its own 'level', by implementing a simple level iterator that chains all the internal point key iterators within a backup. There's a bit of a complication that @nicktrav — did you want to pick this up after the min-table version work, or something else? |
@stevendanna I just thought of one wrinkle to my answer to Jackson's question above. Could you verify this? If the backup retries, we might write several SSTs that contain the same versioned key, however, only one of those SST's will ever get check pointed. So, from RESTORE's perspective, this still holds:
|
SGTM. |
@jbowens Just to clarify, we'd still need the pausing behaviour for the simpler point-key levelIter right (i.e. the one that operates on observed file bounds and not Also on that note, we'd need a similar simpler levelIter for range keys, right? Or is it sufficient to still expose all the range key iterators to its merging iter, with the hope that there aren't many files with range keys? |
Yeah, I guess so. It's not needed for the intended use case of backups, but we should try to keep NewExternalIter very general, even if some of our optimizations are not general.
I think that's sufficient since range keys should be relatively rare, and I think range keys are allowed to overlap between backup files from the same backup. |
Recent benchmarks comparing the old
|
Currently, we create a new merging iterator level for each file passed into NewExternalIter. This is unnecessary for most use-cases of creating ExternalIters around lots of sstables, as we can externally guarantee that many of those sstables won't have overlapping points with each other. We can have the caller pass this knowledge by specifying a [][]sstable.ReadableFile where each sub-slice obeys level invariants for files within it, and is also already sorted by user keys. This change makes the interface change to allow for the above optimization, and also adds a `simpleLevelIter` that implements forward iteration within a single "level". For files that don't contain range deletes, we shove all the point iters into one `simpleLevelIter`, greatly reducing merging iterator levels and speeding up its operations by a lot. Fixes cockroachdb/cockroach#83051.
Currently, we create a new merging iterator level for each file passed into NewExternalIter. This is unnecessary for most use-cases of creating ExternalIters around lots of sstables, as we can externally guarantee that many of those sstables won't have overlapping points with each other. We can have the caller pass this knowledge by specifying a [][]sstable.ReadableFile where each sub-slice obeys level invariants for files within it, and is also already sorted by user keys. This change makes the interface change to allow for the above optimization, and also adds a `simpleLevelIter` that implements forward iteration within a single "level". For files that don't contain range deletes, we shove all the point iters into one `simpleLevelIter`, greatly reducing merging iterator levels and speeding up its operations by a lot. Fixes cockroachdb/cockroach#83051.
Currently, we create a new merging iterator level for each file passed into NewExternalIter. This is unnecessary for most use-cases of creating ExternalIters around lots of sstables, as we can externally guarantee that many of those sstables won't have overlapping points with each other. We can have the caller pass this knowledge by specifying a [][]sstable.ReadableFile where each sub-slice obeys level invariants for files within it, and is also already sorted by user keys. This change makes the interface change to allow for the above optimization, and also adds a `simpleLevelIter` that implements forward iteration within a single "level". For files that don't contain range deletes, we shove all the point iters into one `simpleLevelIter`, greatly reducing merging iterator levels and speeding up its operations by a lot. Fixes cockroachdb/cockroach#83051.
Currently, we create a new merging iterator level for each file passed into NewExternalIter. This is unnecessary for most use-cases of creating ExternalIters around lots of sstables, as we can externally guarantee that many of those sstables won't have overlapping points with each other. We can have the caller pass this knowledge by specifying a [][]sstable.ReadableFile where each sub-slice obeys level invariants for files within it, and is also already sorted by user keys. This change makes the interface change to allow for the above optimization, and also adds a `simpleLevelIter` that implements forward iteration within a single "level". For files that don't contain range deletes, we shove all the point iters into one `simpleLevelIter`, greatly reducing merging iterator levels and speeding up its operations by a lot. Fixes cockroachdb/cockroach#83051.
Currently, we create a new merging iterator level for each file passed into NewExternalIter. This is unnecessary for most use-cases of creating ExternalIters around lots of sstables, as we can externally guarantee that many of those sstables won't have overlapping points with each other. We can have the caller pass this knowledge by specifying a [][]sstable.ReadableFile where each sub-slice obeys level invariants for files within it, and is also already sorted by user keys. This change makes the interface change to allow for the above optimization, and also adds a `simpleLevelIter` that implements forward iteration within a single "level". For files that don't contain range deletes, we shove all the point iters into one `simpleLevelIter`, greatly reducing merging iterator levels and speeding up its operations by a lot. Fixes cockroachdb/cockroach#83051.
86259: storage: implement `RangeKeyChanged()` for `MVCCIncrementalIterator` r=tbg a=erikgrinaker **storage: use `RangeKeyChanged()` in `MVCCIncrementalIterator`** This patch uses `RangeKeyChanged()` to detect changes to range keys in `MVCCIncrementalIterator`. There are no functional changes. Some quick benchmarks, using catchup scans: ``` name old time/op new time/op delta CatchUpScan/mixed-case/withDiff=true/perc=0.00/numRangeKeys=0-24 538ms ± 1% 535ms ± 1% ~ (p=0.211 n=10+9) CatchUpScan/mixed-case/withDiff=true/perc=0.00/numRangeKeys=1-24 690ms ± 0% 590ms ± 1% -14.56% (p=0.000 n=9+10) CatchUpScan/mixed-case/withDiff=true/perc=0.00/numRangeKeys=100-24 743ms ± 1% 646ms ± 1% -13.13% (p=0.000 n=9+9) CatchUpScan/mixed-case/withDiff=false/perc=0.00/numRangeKeys=0-24 794ms ± 1% 794ms ± 0% ~ (p=0.579 n=10+10) CatchUpScan/mixed-case/withDiff=false/perc=0.00/numRangeKeys=1-24 966ms ± 0% 911ms ± 1% -5.72% (p=0.000 n=10+10) CatchUpScan/mixed-case/withDiff=false/perc=0.00/numRangeKeys=100-24 974ms ± 0% 920ms ± 0% -5.51% (p=0.000 n=10+10) ``` Release justification: bug fixes and low-risk updates to new functionality. Release note: None **storage: implement `RangeKeyChanged()` for `MVCCIncrementalIterator`** This patch implements `RangeKeyChanged()` for `MVCCIncrementalIterator`. It only fires if the time bound range keys change, not if a `Next(Key)IgnoringTime()` operation reveals additional range keys. Resolves #86105. Release justification: bug fixes and low-risk updates to new functionality Release note: None **storage: add `MVCCIncrementalIterator.RangeKeysIgnoringTime()`** This patch changes `MVCCIncrementalIterator` range key behavior following a `Next(Key)IgnoringTime()` call. Previously, range key methods would then expose unfiltered range keys. Now, the standard range key methods only apply to filtered range keys, and an additional `RangeKeysIgnoringTime()` method provides access to unfiltered range keys. This implies that if such a call steps onto a range key that's entirely outside of the time bounds then: * `HasPointAndRange()` will return `false`,`false` if on a bare range key. * `RangeKeyChanged()` will not fire, unless stepping off of a range key within the time bounds. * `RangeBounds()` and `RangeKeys()` will return empty results. This is done to avoid conditional range key handling following these calls, except for the exact sites where the caller is interested in the unfiltered range keys. Release justification: bug fixes and low-risk updates to new functionality Release note: None 86440: storage: use concrete `pebbleIterator` in `intentInterleavingIter` r=tbg a=erikgrinaker **storage: tweak `unsafeMVCCIterator` construction** Release justification: non-production code changes Release note: None **storage: inline some `intentInterleavingIter` methods** This patch splits up `maybeSkipIntentRangeKeys()` and `maybeSuppressRangeKeyChanged()` to allow for mid-stack inlining. I doubt that the gains are as large as these microbenchmarks claim, and there's a fair bit of variance between runs, but it can't hurt. ``` name old time/op new time/op delta MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.37µs ± 2% 5.43µs ± 2% +1.13% (p=0.041 n=10+10) MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 38.2µs ± 2% 38.2µs ± 2% ~ (p=0.971 n=10+10) MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 2.79ms ± 2% 2.71ms ± 2% -2.59% (p=0.000 n=10+10) MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.99µs ± 1% 5.99µs ± 2% ~ (p=0.541 n=10+10) MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 51.7µs ± 3% 52.1µs ± 1% ~ (p=0.631 n=10+10) MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 3.88ms ± 2% 3.87ms ± 1% ~ (p=0.897 n=10+8) MVCCComputeStats_Pebble/valueSize=32/numRangeKeys=0-24 158ms ± 1% 155ms ± 1% -2.34% (p=0.000 n=10+9) ``` Touches #82559. Release justification: bug fixes and low-risk updates to new functionality Release note: None **storage: use concrete `pebbleIterator` in `intentInterleavingIter`** Since `intentInterleavingIter` always constructs the underlying iterators from the given reader, and these readers always construct `*pebbleIterator`, it can use the concrete type rather than interfaces for both iterators. This avoids dynamic dispatch, yielding a decent performance improvement. Unfortunately, this requires disabling `unsafeMVCCIterator` inside `intentInterleavingIter`. This wasn't always enabled anyway, since it was omitted both for the engine iterator and when using an engine directly (which doesn't have consistent iterators). ``` name old time/op new time/op delta MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.43µs ± 2% 5.45µs ± 2% ~ (p=0.566 n=10+10) MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 38.2µs ± 2% 37.0µs ± 1% -3.02% (p=0.000 n=10+10) MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 2.71ms ± 2% 2.66ms ± 1% -1.83% (p=0.000 n=10+9) MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24 5.99µs ± 2% 5.86µs ± 2% -2.15% (p=0.000 n=10+10) MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24 52.1µs ± 1% 50.2µs ± 2% -3.77% (p=0.000 n=10+10) MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24 3.87ms ± 1% 3.83ms ± 1% -1.26% (p=0.000 n=8+10) MVCCComputeStats_Pebble/valueSize=32/numRangeKeys=0-24 155ms ± 1% 155ms ± 1% ~ (p=0.842 n=9+10) ``` Touches #82559. Release justification: low risk, high benefit changes to existing functionality Release note: None 86478: storage: use concrete `pebbleIterator` in `verifyingIterator` r=tbg a=erikgrinaker **storage: add Pebble SST iterator benchmarks** Release justification: non-production code changes Release note: None **storage: use concrete `pebbleIterator` in `verifyingIterator`** Gives a slight performance boost, since it avoid dynamic dispatch: ``` name old time/op new time/op delta SSTIterator/keys=1/variant=pebble/verify=true-24 42.8µs ± 1% 42.4µs ± 1% -0.79% (p=0.043 n=10+10) SSTIterator/keys=100/variant=pebble/verify=true-24 61.8µs ± 1% 60.7µs ± 1% -1.64% (p=0.000 n=10+10) SSTIterator/keys=10000/variant=pebble/verify=true-24 1.91ms ± 0% 1.88ms ± 0% -1.79% (p=0.000 n=10+10) ``` An attempt was also made at using `RangeKeyChanged()` instead of `HasPointAndRange()`, but this had no effect. Touches #83051. Release justification: bug fixes and low-risk updates to new functionality Release note: None 86513: storage: use `RangeKeyChanged` in `ReadAsOfIterator` r=tbg a=erikgrinaker This patch uses `RangeKeyChanged()` to detect range keys in `ReadAsOfIterator`, and caches them to improve performance. It also fixes a bug where the iterator would fail to detect tombstones with a non-empty `MVCCValueHeader`. Resolves #84714. Release justification: bug fixes and low-risk updates to new functionality Release note: None 86514: storage: use `RangeKeyChanged()` in `MVCCDeleteRangeUsingTombstone()` r=tbg a=erikgrinaker Release justification: bug fixes and low-risk updates to new functionality Release note: None 86529: storage: omit unnecessary range key calls during intent resolution r=tbg a=erikgrinaker Release justification: bug fixes and low-risk updates to new functionality Release note: None Co-authored-by: Erik Grinaker <[email protected]>
The new SST iterators in #82799, e.g.
NewPebbleSSTMemIterator()
, are much slower than the existingNewMemSSTIterator()
, around 80% or so. Some of this is due to CRDB-side overhead, in particular the use ofpebbleIterator
andVerifyingMVCCIterator
, but the majority appears to be in Pebble via the newNewExternalIter()
SST iterator. We need to optimize these.This can be seen with
BenchmarkSSTIterator
, by changingNewMemSSTIterator()
toNewPebbleMemSSTIterator()
inrunSSTIterator()
on top of #82799.Many of the optimizations that address #82559 and #83049 would likely apply here as well.
Jira issue: CRDB-16822
Epic CRDB-2624
The text was updated successfully, but these errors were encountered: