Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: SeekPrefixGE lazy positioning, or GetPrefix #2002

Open
jbowens opened this issue Oct 11, 2022 · 4 comments
Open

db: SeekPrefixGE lazy positioning, or GetPrefix #2002

jbowens opened this issue Oct 11, 2022 · 4 comments
Labels

Comments

@jbowens
Copy link
Collaborator

jbowens commented Oct 11, 2022

I have not thought through the design space here in detail, but it seems possible to use MVCC metadata about sstables (eg, the computed block properties) to avoid reading files that contain older versions of a key during a SeekPrefixGE. The goal would be to reduce block reads during MVCCGets, making MVCCGets performance profile more similar to a pebble Get.

A design, just to serve as an illustrative example:

  • the sstable iterator SeekPrefixGE returns a synthetic <prefix>@<sstable's max mvcc timestamp>key if the bloom filter indicates the file may contain the key
  • the merging iterator is updated to skip past these synthetic keys (much like it does today for range tombstones)
  • nexting a sstable iterator last positioned by SeekPrefixGE actually performs the seek

if we elevated MVCC timestamps into the *fileMetadata, it seems like we could even avoid some of the bloom filter reads and table loads.

Somewhat related to #2182.

Jira issue: PEBBLE-142

@jbowens
Copy link
Collaborator Author

jbowens commented Feb 15, 2023

Added the consider-23.2 label because this and avoiding bloom filter block reads (by elevating the mvcc props to the fileMetadata) seems especially beneficial with disaggregated storage, where each new table read requires reading from remote storage.

@sumeerbhola
Copy link
Collaborator

I'm a bit confused by

the sstable iterator SeekPrefixGE returns a synthetic @<sstable's max mvcc timestamp>key if the bloom filter indicates the file may contain the key

since I thought this is trying to avoid reading the bloom filter of that file. Maybe I am misreading the sentence.
Is the idea here that many levels will have files that overlap with the prefix, but the max MVCC timestamp in most of these files will be smaller than the actual largest MVCC timestamp for that prefix, so these files will never rise to the top of the heap and so we will never need to read their bloom filters? Nice idea! It should help when the version being read was written recently, which I believe is more common for workloads with huge tables with lots of old data (which are exactly the ones where the bloom filter is more likely to miss the cache).

@jbowens
Copy link
Collaborator Author

jbowens commented Feb 15, 2023

since I thought this is trying to avoid reading the bloom filter of that file.

Ah yeah, when I started writing the issue I was thinking we'd perform the bloom filter check before returning the synthetic key and we'd only be saving the block loads on the tables that the filter fails to exclude, but I think it makes more sense to reverse the order to try to avoid the bloom filter check as well.

Is the idea here that many levels will have files that overlap with the prefix, but the max MVCC timestamp in most of these files will be smaller than the actual largest MVCC timestamp for that prefix, so these files will never rise to the top of the heap and so we will never need to read their bloom filters?

Yeah, that's right

@jbowens
Copy link
Collaborator Author

jbowens commented Jan 24, 2024

This may be helpful with #3230 for workloads that are frequently reading and updating recently-written keys. It would shift the working set of data towards the higher levels of the LSM (which are also smaller), making it more likely that a read finds that everything it needs is already in either the block cache or OS page cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 24.2 candidates
Development

No branches or pull requests

2 participants