Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: investigate slow ScanRequest evaluation #96361

Closed
sumeerbhola opened this issue Feb 1, 2023 · 3 comments
Closed

storage: investigate slow ScanRequest evaluation #96361

sumeerbhola opened this issue Feb 1, 2023 · 3 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Feb 1, 2023

In a production cluster there were some slow ScanRequests that took ~95ms, with trace:
94.349ms event:scan stats: stepped 674 times (22806 internal); seeked 51782 times (25902 internal); block-bytes: (total 2.6 MiB, cached 2.6 MiB); points: (count 48.706 k, key-bytes 2.6 MiB, value-bytes 1.3 MiB, tombstoned: 0 ) ranges: (count 0 ), (contained-points 0 , skipped-points 0 )

This was a single ScanRequest in the BatchRequest and returned resp=num_keys:26165 num_bytes:2851274 from the KV layer.

The range had "val_count": 3233223 and "live_count": 3160356, so the fraction of garbage is low.
(the details of this incident are in https://github.com/cockroachlabs/support/issues/2033)

The hypothesis for the excessive seeks is that (in this v22.2 cluster), we are getting unlucky with the pebbleMVCCScanner.itersBeforeSeek logic (that no longer exists on master, since it uses the Pebble implemented NextPrefix). More details:

  • I think pebbleMVCCScanner.itersBeforeSeek logic is flawed in that if it gets unlucky and encounters many versions for the initial keys, it can reduce this value down to 0 and then use SeekGE for the rest of the iteration.
  • But using SeekGE should also be fine since we have multiple optimizations inside Pebble (a) the one that avoid seeking the lock table iterator, (b) (for the mvcc part of the key space) the monotonic SeekGE calls will result in a try-seek-using-next optimization at the sstable level. (b) will still show as a high number of internal seeks in the stats, that we see, but they should be very cheap seeks.
    • @jbowens pointed out that we do need to seek within the manifest.LevelIterator on v22.2, which we have optimized away on master, which can be costly.
  • There are val_count 3233223 - live_count 3160356 = ~70K old versions in this range. And we are scanning and retrieving ~26K keys. So it is plausible that we are getting unlucky and itersBeforeSeek is dropping to 0, but it seems unlikely that our SeekGE optimizations at the Pebble level would be ineffective.

More hypothesizing: stepped 674 times (22806 internal); seeked 51782 times (25902 internal)
The ~2x the number of seeks as the keys retrieved is explained by the intentInterleavingIter which needs to seek the lock-table iter and the mvcc-key-space iter. The latter will translate into an internal seek since the iterator is positioned at the preceding key. The former does not translate into an internal seek, due to:

  • Consider that intents have been resolved but most of the SingleDelete and Set pairs have not yet met in the LSM (for these ~26K keys).
  • When doing SeekGEWithLimit the first time, the lock-table iter will do a seek, then step over the SingleDelete, Set pair for the key that is within the limit and get to the next SingleDelete and stop because of the limit. The next SeekGEWithLimit is to the userkey of the SingleDelete so will not do an internal seek and step over the SingleDelete, Set (2 internal next calls) and then stop due to the limit. This behavior would explain the low value of 674 interface steps vs the high value of 22806 internal steps.

Next steps:

  • Try to reproduce this itersBeforeSeek behavior in a microbenchmark.
  • Compare benchmark results for v22.1, v22.2 and master and understand differences and if there any optimizations we can do.

Jira issue: CRDB-24086

@sumeerbhola sumeerbhola added A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Feb 1, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 1, 2023

Hi @sumeerbhola, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 1, 2023
@sumeerbhola sumeerbhola self-assigned this Feb 1, 2023
@sumeerbhola
Copy link
Collaborator Author

Benchmark with 7 levels (these are L0 sub-levels, but these are similar to normal levels in using LevelIter) each with one file. 1000 roachpb.Keys with the first 10 keys having 6 versions, and the remaining with 1 version. Each key is Put using a transactional write, so an intent is written too. The lowest level has only the intents and corresponding provisional value. The next higher level has the intent resolution of the next lower level, and its own Puts (i.e., intents and provisional value). This means all SingleDelete, Set pairs for the intents are separated into 2 files (in 2 levels). And the first 10 keys with 6 versions have each of their versions in separate files (in separate levels). Each iteration is a MVCCScan over all these keys reading at a timestamp higher than the latest version.

v22.2

stats: (interface (dir, seek, step): (fwd, 2002, 30), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 1007, 2115), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 15 K, cached 15 K)), (points: (count 3.1 K, key-bytes 88 K, value-bytes 60 K, tombstoned: 0))
BenchmarkMVCCScannerWithIntentsAndVersions-10    	    1706	    698974 ns/op	  134703 B/op	      32 allocs/op

Note that we are mostly seeking at the interface level, due to the first 10 keys defeating pebbleMVCCScanner.itersBeforeSeek. The internal stats are mostly seeks for the mvcc-key-space iter and 2 nexts per interface Seek for the intent-iter.

  • These are 8 byte keys and tiny values, so ~0.7ms for 1000 keys, would translate to 17.5ms for ~25K keys (choosing the same number of keys in the original production issue). EngineKey{Compare,Equal} are 15% of the CPU because of these short keys, so that is something that can easily get inflated in a real setting.
  • Of the 91% in intentInterleavingIter.SeekGE, SeekEngineKeyGEWithLimit is ~27% (over the lock-table) and SeekGE (over the mvcc-key-space) is 56%.
    • In the latter, levelIter.findFileGE is ~25%. This is with only 1 file in the level! In a real LSM with 1000s of files in a level, this will multiply by a logarithmic factor of the number of files.

Screenshot 2023-02-01 at 3 06 12 PM

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Feb 1, 2023

v22.1 is 2x faster! This is because it does not seek despite this attempt to defeat pebbleMVCCScanner.itersBeforeSeek.

stats: (interface (dir, seek, step): (fwd, 22, 2019), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 22, 3110), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 15 K, cached 15 K)), (points: (count 3.1 K, key-bytes 88 K, value-bytes 60 K, tombstoned: 0))
BenchmarkMVCCScannerWithIntentsAndVersions-10    	    4000	    356587 ns/op	  133987 B/op	      29 allocs/op

Looks like we "temporarily" changed the code in b258e82#diff-8a8fc3cd44bc071821e87442be0871b30214293e94c435c2a21368ad8c3af3bfL486 and forgot to change it back

Screenshot 2023-02-01 at 3 42 33 PM

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 1, 2023
This was supposed to be temporarily removed in
cockroachdb@b258e82
but was never restored.
It accounts for a 2x slowdown in a benchmark where a scan encounters a
few keys at the beginning of the scan with > 5 versions, that cause
itersBeforeSeek to drop to 0, and then for the remaining 1000s of keys
with only 1 versions it uses SeekGE instead of Next.

Informs cockroachdb#96361

Also see cockroachlabs/support#2033

Epic: none

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 1, 2023
…canner.itersBeforeSeek

This commit restores the lower-bound of 1 on
pebbleMVCCScanner.itersBeforeSeek. This has no effect on the added
benchmark since on master (unlike v22.2) we use Pebble's Iterator.NextPrefix
for the common case of stepping to the next roachpb.Key. itersBeforeSeek
continues to be used for seeking to a particular version and for reverse
scans.

The added benchmark has 7 levels and resolved intents where both the Set
and SingleDelete of the intent are present in Pebble. It tries to trick
pebbleMVCCScanner with having keys with many versions in the beginning
of the scan. Benchmark results:

BenchmarkMVCCScannerWithIntentsAndVersions-10    	4000	    316177 ns/op	  133119 B/op	      28 allocs/op

stats: (interface (dir, seek, step): (fwd, 2, 1999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 2, 3110), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 15 K, cached 15 K)), (points: (count 3.1 K, key-bytes 88 K, value-bytes 60 K, tombstoned 0)))

Informs cockroachdb#96361

Epic: none

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 1, 2023
This was supposed to be temporarily removed in
cockroachdb@b258e82
but was never restored.
It accounts for a 2x slowdown in a benchmark where a scan encounters a
few keys at the beginning of the scan with > 5 versions, that cause
itersBeforeSeek to drop to 0, and then for the remaining 1000s of keys
with only 1 versions it uses SeekGE instead of Next.

Informs cockroachdb#96361

Also see cockroachlabs/support#2033

Epic: none

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 2, 2023
…canner.itersBeforeSeek

This commit restores the lower-bound of 1 on
pebbleMVCCScanner.itersBeforeSeek. This has no effect on the added
benchmark since on master (unlike v22.2) we use Pebble's Iterator.NextPrefix
for the common case of stepping to the next roachpb.Key. itersBeforeSeek
continues to be used for seeking to a particular version and for reverse
scans.

The added benchmark has 7 levels and resolved intents where both the Set
and SingleDelete of the intent are present in Pebble. It tries to trick
pebbleMVCCScanner with having keys with many versions in the beginning
of the scan. Benchmark results:

BenchmarkMVCCScannerWithIntentsAndVersions-10    	4000	    316177 ns/op	  133119 B/op	      28 allocs/op

stats: (interface (dir, seek, step): (fwd, 2, 1999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 2, 3110), (rev, 0, 0)),
(internal-stats: (block-bytes: (total 15 K, cached 15 K)), (points: (count 3.1 K, key-bytes 88 K, value-bytes 60 K, tombstoned 0)))

Informs cockroachdb#96361

Epic: none

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 2, 2023
This was supposed to be temporarily removed in
cockroachdb@b258e82
but was never restored.
It accounts for a 2x slowdown in a benchmark where a scan encounters a
few keys at the beginning of the scan with > 5 versions, that cause
itersBeforeSeek to drop to 0, and then for the remaining 1000s of keys
with only 1 versions it uses SeekGE instead of Next.

Informs cockroachdb#96361

Also see cockroachlabs/support#2033

Epic: none

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 2, 2023
This was supposed to be temporarily removed in
cockroachdb@b258e82
but was never restored.
It accounts for a 2x slowdown in a benchmark where a scan encounters a
few keys at the beginning of the scan with > 5 versions, that cause
itersBeforeSeek to drop to 0, and then for the remaining 1000s of keys
with only 1 versions it uses SeekGE instead of Next.

Informs cockroachdb#96361

Also see cockroachlabs/support#2033

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Feb 3, 2023
96380: storage: add scan benchmark with resolved intents and fix pebbleMVCCS… r=sumeerbhola a=sumeerbhola

…canner.itersBeforeSeek

This commit restores the lower-bound of 1 on
pebbleMVCCScanner.itersBeforeSeek. This has no effect on the added benchmark since on master (unlike v22.2) we use Pebble's Iterator.NextPrefix for the common case of stepping to the next roachpb.Key. itersBeforeSeek continues to be used for seeking to a particular version and for reverse scans.

The added benchmark has 7 levels and resolved intents where both the Set and SingleDelete of the intent are present in Pebble. It tries to trick pebbleMVCCScanner with having keys with many versions in the beginning of the scan. Benchmark results:

```
BenchmarkMVCCScannerWithIntentsAndVersions-10    	4000	    316177 ns/op	  133119 B/op	      28 allocs/op

stats: (interface (dir, seek, step): (fwd, 2, 1999), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 2, 3110), (rev, 0, 0)), (internal-stats: (block-bytes: (total 15 K, cached 15 K)), (points: (count 3.1 K, key-bytes 88 K, value-bytes 60 K, tombstoned 0)))
```
Informs #96361

Epic: none

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
@exalate-issue-sync exalate-issue-sync bot removed the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Feb 3, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

1 participant