Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: defer loading L0 range key blocks during iterator construction #3004

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Oct 17, 2023

Previously, construction of an iterator over range keys that found sstables
containing range keys within L0 performed I/O to load the range key blocks
during iterator construction. This was less efficient: If the iterator
ultimately didn't need to read the keyspace overlapping the sstables containing
range keys, the block loads were unnecessary.

More significantly, if the I/O failed during iterator construction, the
resulting iterator was unusable. It would always error with the original error
returned by the failed block load. This is a deviation from iterator error
handling across the rest of the iterator stack, which allows an Iterator to be
re-seeked to clear the current iterator error.

Resolves #2994.

Add a predicate that evaluates to true only for (vfs.File).ReadAt operations at
the provided offset. This allows datadriven tests to inject errors into
specific block loads if they know the layout of the sstable.
@jbowens jbowens requested review from a team and RaduBerinde October 17, 2023 15:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens marked this pull request as draft October 17, 2023 16:01
@jbowens

This comment was marked as resolved.

Previously, construction of an iterator over range keys that found sstables
containing range keys within L0 performed I/O to load the range key blocks
during iterator construction. This was less efficient: If the iterator
ultimately didn't need to read the keyspace overlapping the sstables containing
range keys, the block loads were unnecessary.

More significantly, if the I/O failed during iterator construction, the
resulting iterator was unusable. It would always error with the original error
returned by the failed block load. This is a deviation from iterator error
handling across the rest of the iterator stack, which allows an Iterator to be
re-seeked to clear the current iterator error.

Resolves cockroachdb#2994.
@jbowens jbowens marked this pull request as ready for review October 17, 2023 18:24
@jbowens jbowens requested a review from sumeerbhola October 18, 2023 17:33
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens and @RaduBerinde)


range_keys.go line 70 at r2 (raw file):

	// through the fileMetadata to determine that. Since L0's file count should
	// not significantly exceed ~1000 files (see L0CompactionFileThreshold),
	// this should be okay.

I wonder if we can quickly verify the cost of this by running kv0 and taking a CPU profile.
We need kv0 with block size that both overloads the LSM and doesn't run very cold wrt CPU. Maybe block size of 100 with very high client concurrency? May need some experimentation. Then we may have ~400 files in L0 with ~10 sub-levels. And the MVCCPuts are creating an iterator that needs to read range keys.


testdata/iter_histories/errors line 94 at r2 (raw file):

----

combined-iter

loving this simple to read but sophisticated error injection test

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


range_keys.go line 70 at r2 (raw file):

Previously, sumeerbhola wrote…

I wonder if we can quickly verify the cost of this by running kv0 and taking a CPU profile.
We need kv0 with block size that both overloads the LSM and doesn't run very cold wrt CPU. Maybe block size of 100 with very high client concurrency? May need some experimentation. Then we may have ~400 files in L0 with ~10 sub-levels. And the MVCCPuts are creating an iterator that needs to read range keys.

It seems tricky because we also need range keys to exist within L0. And there need to exist range keys within sstables that overlap the put key (in order to defeat the lazy-combined iterator optimization). Another option is to add one keyspan.LevelIter per L0 file containing a range key (ignoring the sublevels structure entirely) by iterating through current.RangeKeyLevels. This becomes O(files containing range keys) which should be significantly less, at the cost of more levels in the iterator stack.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens and @RaduBerinde)


range_keys.go line 70 at r2 (raw file):

This becomes O(files containing range keys) which should be significantly less, at the cost of more levels in the iterator stack.

It's no more than the levels we currently have on master, except that we'd be now wrapping each of those in a level iter, yes?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


range_keys.go line 70 at r2 (raw file):

Previously, sumeerbhola wrote…

This becomes O(files containing range keys) which should be significantly less, at the cost of more levels in the iterator stack.

It's no more than the levels we currently have on master, except that we'd be now wrapping each of those in a level iter, yes?

Yeah, that's right. I gave this a try but unfortunately it's not viable today because the keyspan.LevelIter requires that the underlying manifest.LevelIterator be sorted by user keys, even if it's bounded to contain a single file. The manifest.LevelIterator's bounding is implemented by first seeking within the b-tree, and then adjusting to the bounds. If the underlying b-tree is sorted by sequence numbers, it can return incorrect results.

Some additional alternatives:

  • implement a special keyspan.FragmentIterator that takes a single fileMetadata, defering the I/O until a relevant seek call is made.
  • adjust the L0Sublevels code to construct a parallel b-tree containing only files with range keys for each sublevel. We don't need to actually recompute the L0 sublevels as if these range-key files are the only files in L0 to get the optimal read-amp.

@jbowens
Copy link
Collaborator Author

jbowens commented Oct 20, 2023

TFTR!

Merging this, and filed #3007 for following up on potentially avoiding the O(# files in L0) iteration.

@jbowens jbowens merged commit babd592 into cockroachdb:master Oct 20, 2023
11 checks passed
@jbowens jbowens deleted the rkl0 branch October 20, 2023 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

db: iterator creation is fallible
3 participants