db: defer loading L0 range key blocks during iterator construction #3004

jbowens · 2023-10-17T15:55:07Z

Previously, construction of an iterator over range keys that found sstables
containing range keys within L0 performed I/O to load the range key blocks
during iterator construction. This was less efficient: If the iterator
ultimately didn't need to read the keyspace overlapping the sstables containing
range keys, the block loads were unnecessary.

More significantly, if the I/O failed during iterator construction, the
resulting iterator was unusable. It would always error with the original error
returned by the failed block load. This is a deviation from iterator error
handling across the rest of the iterator stack, which allows an Iterator to be
re-seeked to clear the current iterator error.

Resolves #2994.

Add a predicate that evaluates to true only for (vfs.File).ReadAt operations at the provided offset. This allows datadriven tests to inject errors into specific block loads if they know the layout of the sstable.

cockroach-teamcity · 2023-10-17T15:55:17Z

This change is

Previously, construction of an iterator over range keys that found sstables containing range keys within L0 performed I/O to load the range key blocks during iterator construction. This was less efficient: If the iterator ultimately didn't need to read the keyspace overlapping the sstables containing range keys, the block loads were unnecessary. More significantly, if the I/O failed during iterator construction, the resulting iterator was unusable. It would always error with the original error returned by the failed block load. This is a deviation from iterator error handling across the rest of the iterator stack, which allows an Iterator to be re-seeked to clear the current iterator error. Resolves cockroachdb#2994.

sumeerbhola

Reviewed 3 of 3 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens and @RaduBerinde)

range_keys.go line 70 at r2 (raw file):

	// through the fileMetadata to determine that. Since L0's file count should
	// not significantly exceed ~1000 files (see L0CompactionFileThreshold),
	// this should be okay.

I wonder if we can quickly verify the cost of this by running kv0 and taking a CPU profile.
We need kv0 with block size that both overloads the LSM and doesn't run very cold wrt CPU. Maybe block size of 100 with very high client concurrency? May need some experimentation. Then we may have ~400 files in L0 with ~10 sub-levels. And the MVCCPuts are creating an iterator that needs to read range keys.

testdata/iter_histories/errors line 94 at r2 (raw file):

----

combined-iter

loving this simple to read but sophisticated error injection test

jbowens

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

range_keys.go line 70 at r2 (raw file):

Previously, sumeerbhola wrote…

I wonder if we can quickly verify the cost of this by running kv0 and taking a CPU profile.
We need kv0 with block size that both overloads the LSM and doesn't run very cold wrt CPU. Maybe block size of 100 with very high client concurrency? May need some experimentation. Then we may have ~400 files in L0 with ~10 sub-levels. And the MVCCPuts are creating an iterator that needs to read range keys.

It seems tricky because we also need range keys to exist within L0. And there need to exist range keys within sstables that overlap the put key (in order to defeat the lazy-combined iterator optimization). Another option is to add one keyspan.LevelIter per L0 file containing a range key (ignoring the sublevels structure entirely) by iterating through current.RangeKeyLevels. This becomes O(files containing range keys) which should be significantly less, at the cost of more levels in the iterator stack.

sumeerbhola

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens and @RaduBerinde)

range_keys.go line 70 at r2 (raw file):

This becomes O(files containing range keys) which should be significantly less, at the cost of more levels in the iterator stack.

It's no more than the levels we currently have on master, except that we'd be now wrapping each of those in a level iter, yes?

jbowens

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

range_keys.go line 70 at r2 (raw file):

Previously, sumeerbhola wrote…

This becomes O(files containing range keys) which should be significantly less, at the cost of more levels in the iterator stack.

It's no more than the levels we currently have on master, except that we'd be now wrapping each of those in a level iter, yes?

Yeah, that's right. I gave this a try but unfortunately it's not viable today because the keyspan.LevelIter requires that the underlying manifest.LevelIterator be sorted by user keys, even if it's bounded to contain a single file. The manifest.LevelIterator's bounding is implemented by first seeking within the b-tree, and then adjusting to the bounds. If the underlying b-tree is sorted by sequence numbers, it can return incorrect results.

Some additional alternatives:

implement a special keyspan.FragmentIterator that takes a single fileMetadata, defering the I/O until a relevant seek call is made.
adjust the L0Sublevels code to construct a parallel b-tree containing only files with range keys for each sublevel. We don't need to actually recompute the L0 sublevels as if these range-key files are the only files in L0 to get the optimal read-amp.

jbowens · 2023-10-20T22:19:44Z

TFTR!

Merging this, and filed #3007 for following up on potentially avoiding the O(# files in L0) iteration.

vfs/errorfs: add OpFileReadAt predicate

aaedec5

Add a predicate that evaluates to true only for (vfs.File).ReadAt operations at the provided offset. This allows datadriven tests to inject errors into specific block loads if they know the layout of the sstable.

jbowens requested review from a team and RaduBerinde October 17, 2023 15:55

jbowens marked this pull request as draft October 17, 2023 16:01

This comment was marked as resolved.

Sign in to view

jbowens force-pushed the rkl0 branch from 07287c9 to 31c81a4 Compare October 17, 2023 18:09

jbowens marked this pull request as ready for review October 17, 2023 18:24

jbowens requested a review from sumeerbhola October 18, 2023 17:33

sumeerbhola approved these changes Oct 19, 2023

View reviewed changes

jbowens commented Oct 19, 2023

View reviewed changes

sumeerbhola reviewed Oct 19, 2023

View reviewed changes

jbowens commented Oct 20, 2023

View reviewed changes

jbowens mentioned this pull request Oct 20, 2023

db: consider avoiding O(L0 file count) range key iterator construction #3007

Open

jbowens merged commit babd592 into cockroachdb:master Oct 20, 2023
11 checks passed

jbowens deleted the rkl0 branch October 20, 2023 22:19

erikgrinaker mentioned this pull request Nov 8, 2023

kvnemesis: committed reverse scan non-atomic timestamps [range key omission] cockroachdb/cockroach#113973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: defer loading L0 range key blocks during iterator construction #3004

db: defer loading L0 range key blocks during iterator construction #3004

jbowens commented Oct 17, 2023

cockroach-teamcity commented Oct 17, 2023

This comment was marked as resolved.

sumeerbhola left a comment

jbowens left a comment

sumeerbhola left a comment

jbowens left a comment

jbowens commented Oct 20, 2023

db: defer loading L0 range key blocks during iterator construction #3004

db: defer loading L0 range key blocks during iterator construction #3004

Conversation

jbowens commented Oct 17, 2023

cockroach-teamcity commented Oct 17, 2023

This comment was marked as resolved.

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens commented Oct 20, 2023