-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add MVCC range tombstone handling in scans and gets #82045
storage: add MVCC range tombstone handling in scans and gets #82045
Conversation
1b7a50f
to
c2c3606
Compare
@jbowens We'll need to optimize the null path here (no range keys). Here are the latest benchmarks against the parent of this PR:
Most of this seems to be in Pebble. Here's a couple of profiles, along with a profile diff graph showing much of it in I also tried using the latest Pebble
Appreciate you pulling at this. There is also some optimization work needed in |
Thanks @erikgrinaker — In this "null" case, are there still range-key clear tombstones? (eg, from #82041) |
Shouldn't be, no -- we're setting up a new engine for the benchmark. |
I tried removing the entire range-key iterator stack and the interleaving iterator, just to try to measure its overhead in this noop case. It looks like it has a 5-8.5% delta. I think there's definitely some performance we can claw back through optimizing the the interleaving iterator and range-key iterator.
Compared with this PR's parent SHA, the commit without the interleaving iter still has a slowdown:
With a batch, it appears the slowdown is exclusively with iterator construction. Without a batch, there's the slowdown with iterator construction and it looks like some additional slowdown that scales with the number of Gets. Going to keep digging. |
c2c3606
to
682df9d
Compare
Marking this as ready for review, since functional blockers have merged, but still needs optimization work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like some additional slowdown that scales with the number of Gets.
Ah, I think this is because MVCCGet
without a batch needs to always initialize an iterator, whereas when the reader is the same pebbleBatch
each time, the pebbleBatch
holds the iterators it creates and can reuse them.
I've merged some Pebble optimizations that help (included in #82736), especially in the MVCCGet_Pebble/batch=true
case. Here's the benchmarks of this branch (an old commit from before you rebased) versus its parent:
name old time/op new time/op delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10 3.47µs ± 1% 3.87µs ± 1% +11.39% (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10 4.22µs ± 3% 4.66µs ± 1% +10.37% (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10 9.32µs ± 2% 9.75µs ± 3% +4.54% (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10 2.18µs ± 1% 2.25µs ± 3% +3.30% (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10 3.01µs ± 4% 3.17µs ± 1% +5.22% (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10 7.46µs ± 5% 7.15µs ± 2% -4.10% (p=0.000 n=10+10)
The iterator construction slowdown is still very present, and unfortunately, is going to be challenging to reduce. The current design of combined iteration constructs two iterator stacks: the point iterator stack, and the range key iterator stack, gluing them together with the interleaved iterator. Just the addition of the initialization of the range key iterator stack's various internal iterators (eg, setting fields, etc) adds a slowdown. It's also certainly going to get worse with persistence (just merged! 🎉 in cockroachdb/pebble@ae99f4f12f).
There are two approaches I can see to removing the iterator construction overhead:
- Lazily construct the combined iterator: As the point iterator iterates through the LSM, if it ever encounters a file that contains range key, it bubbles that knowledge up. The
pebble.Iterator
constructs the range key iterator and restructures itself to initialize the combined iterator in the same spot. - Rework combined iteration to dynamically update the range key merging iterator's levels as files are opened and closed. This likely would be pretty gnarly.
cc @sumeerbhola if you have thoughts ^
Luckily breather week is a nice span of heads down time.
Reviewed 2 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've merged some Pebble optimizations that help (included in #82736), especially in the MVCCGet_Pebble/batch=true case.
Awesome, thanks for the improvements! The batch=true
case is getting to the point where I think we can merge this to master
. batch=false
still needs some work, but once we land the latest optimizations and persistence in CRDB I'll run some end-to-end benchmarks to look at the overall perf impact. I'll see if we can reclaim a couple of percents in CRDB too.
Lazily construct the combined iterator
This seems like a good first stab. It's possible that we can more aggressively reuse iterators in CRDB too, e.g. by pooling them between batches or something.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)
682df9d
to
8c874b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 28 of 28 files at r6, 2 of 11 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)
TFTR! Going to let this sit until we bump Pebble and do a kv95 benchmark run, but I think we're probably close enough to baseline that we can merge. |
2126b67
to
8504e23
Compare
Thanks for bumping Pebble! I ran a couple of
This branch (rebased):
I think we have to merge this as-is to unblock other work, but we'll need to claw much of this back later. |
8504e23
to
be87e94
Compare
Microbenchmarks are still pretty bad though. I got back a few percent of the construction cost by explicitly embedding a
|
Did a quick experiment to see if lazily switching to the
Compared to
This shouldn't be a huge amount of work, I'll give it a shot before merging this. How much work would it be to implement lazy construction in Pebble @jbowens? |
I've started working on it, and I think it shouldn't be too much work to get something functional. I'm a little worried about finding a design that isn't complicated and a burden to maintain. I think I should be able to put up a PR with something good enough for now soon, and we can try to refactor to lighten the complexity afterwards. I'm sure the addition of persistence caused a step backwards on performance, because it adds more work to iterator construction. cockroachdb/pebble#1771 should've clawed back most or all of that, but it didn't make the pebble bump on master last week. |
Looks like gains restricted to the
|
I tried running some benchmarks with this branch with Pebble at cockroachdb/pebble@20e506c and compared it to its parent with the same Pebble SHA:
The get regression is inching downwards, but the scan numbers are much, much worse than previous numbers. Diffing profiles for
It appears that we're performing more seeks than previously, because the CPU time is elevated throughout |
These results seem off to me. I tried these myself after rebasing this branch onto
The diff isn't showing any seek changes either:
I've found benchmarking with Bazel to be highly unreliable, since these benchmarks pre-create a dataset which is then stored inside some sort of Bazel working directory that changes with the build. Running the IDE and other software will also interfere with benchmarks. I've resorted to using
|
f915137
to
76c4b76
Compare
I've updated the PR to only initialize a
|
An MVCC scan in `TestMVCCHistories` would show an incorrect key for the intents, using the scan start key rather than the intent key. Furthermore, intents are listed before the scan results, but this was not made clear by the formatting, which could cause readers to believe they were emitted in an incorrect order. Release note: None
76c4b76
to
aa7c9bb
Compare
This patch adds MVCC range tombstone handling for scans and gets. In the basic case, this simply means that point keys below an MVCC range tombstone are not visible. When tombstones are requested by the caller, the MVCC range tombstones themselves are never exposed, to avoid having to explicitly handle these throughout the codebase. Instead, synthetic MVCC point tombstones are emitted at the start of MVCC range tombstones and wherever they overlap a point key (above and below). Additionally, point gets return synthetic point tombstones if they overlap an MVCC range tombstone even if no existing point key exists. This is based on `pointSynthesizingIter`, which avoids additional logic in `pebbleMVCCScanner`. Synthetic MVCC point tombstones emitted for MVCC range tombstones are not stable, nor are they fully deterministic. For example, the start key will be truncated by iterator bounds, so an `MVCCScan` over a given key span may see a synthetic point tombstone at its start (if it overlaps an MVCC range tombstone), but this will not be emitted if a broader span is used (a different point tombstone will be emitted instead). Similarly, a CRDB range split/merge will split/merge MVCC range tombstones, changing which point tombstones are emitted. Furthermore, `MVCCGet` will synthesize an MVCC point tombstone if it overlaps an MVCC range tombstone and there is no existing point key there, while an `MVCCScan` will not emit these. Callers must take care not to rely on such semantics for MVCC tombstones. Existing callers have been audited to ensure they are not affected. Point tombstone synthesis must be enabled even when the caller has not requested tombstones, because they must always be taken into account for conflict/uncertainty checks. However, in these cases we enable range key masking below the read timestamp, omitting any covered points since these are no longer needed. Release note: None
aa7c9bb
to
6733f95
Compare
Ran a couple of quick This PR:
|
TFTR! bors r=jbowens |
Build failed: |
bors retry |
Build failed: |
Groan. Third time's the charm. bors retry |
Build failed: |
This is getting ridiculous. bors retry |
Build succeeded: |
This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.
When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones if they overlap an MVCC range tombstone even if no
existing point key exists. This is based on
pointSynthesizingIter
,which avoids additional logic in
pebbleMVCCScanner
.Synthetic MVCC point tombstones emitted for MVCC range tombstones are
not stable, nor are they fully deterministic. For example, the start key
will be truncated by iterator bounds, so an
MVCCScan
over a given keyspan may see a synthetic point tombstone at its start (if it overlaps an
MVCC range tombstone), but this will not be emitted if a broader span is
used (a different point tombstone will be emitted instead). Similarly, a
CRDB range split/merge will split/merge MVCC range tombstones, changing
which point tombstones are emitted. Furthermore,
MVCCGet
willsynthesize an MVCC point tombstone if it overlaps an MVCC range
tombstone and there is no existing point key there, while an
MVCCScan
will not emit these. Callers must take care not to rely on such
semantics for MVCC tombstones. Existing callers have been audited to
ensure they are not affected.
Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.
Touches #70412.
Release note: None