Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: add MVCC range tombstone handling in scans and gets #82045

Merged
merged 2 commits into from
Jun 26, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 29, 2022

This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones if they overlap an MVCC range tombstone even if no
existing point key exists. This is based on pointSynthesizingIter,
which avoids additional logic in pebbleMVCCScanner.

Synthetic MVCC point tombstones emitted for MVCC range tombstones are
not stable, nor are they fully deterministic. For example, the start key
will be truncated by iterator bounds, so an MVCCScan over a given key
span may see a synthetic point tombstone at its start (if it overlaps an
MVCC range tombstone), but this will not be emitted if a broader span is
used (a different point tombstone will be emitted instead). Similarly, a
CRDB range split/merge will split/merge MVCC range tombstones, changing
which point tombstones are emitted. Furthermore, MVCCGet will
synthesize an MVCC point tombstone if it overlaps an MVCC range
tombstone and there is no existing point key there, while an MVCCScan
will not emit these. Callers must take care not to rely on such
semantics for MVCC tombstones. Existing callers have been audited to
ensure they are not affected.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Touches #70412.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 4, 2022

@jbowens We'll need to optimize the null path here (no range keys). Here are the latest benchmarks against the parent of this PR:

name                                                           old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24                4.69µs ± 1%    5.54µs ± 0%  +18.06%  (p=0.000 n=10+8)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64-24               6.45µs ± 1%    7.60µs ± 0%  +17.84%  (p=0.000 n=9+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24              35.4µs ± 1%    41.1µs ± 1%  +16.15%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64-24              113µs ± 0%     135µs ± 1%  +19.64%  (p=0.000 n=8+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24            2.57ms ± 1%    2.99ms ± 1%  +16.55%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64-24           9.62ms ± 1%   11.68ms ± 2%  +21.41%  (p=0.000 n=9+10)
MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64-24         5.18µs ± 1%    5.93µs ± 1%  +14.37%  (p=0.000 n=9+10)
MVCCReverseScan_Pebble/rows=1/versions=10/valueSize=64-24        8.94µs ± 1%   10.54µs ± 1%  +17.87%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64-24       47.3µs ± 1%    56.2µs ± 0%  +18.69%  (p=0.000 n=10+9)
MVCCReverseScan_Pebble/rows=100/versions=10/valueSize=64-24       320µs ± 1%     397µs ± 1%  +24.03%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64-24     3.78ms ± 1%    4.48ms ± 1%  +18.49%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=10/valueSize=64-24    30.6ms ± 4%    37.3ms ± 3%  +21.76%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24             4.53µs ± 0%    5.22µs ± 1%  +15.35%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24            5.54µs ± 1%    6.33µs ± 1%  +14.19%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24           14.0µs ± 1%    15.0µs ± 3%   +7.47%  (p=0.000 n=8+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24              2.77µs ± 0%    3.39µs ± 1%  +22.31%  (p=0.000 n=8+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24             3.96µs ± 1%    4.77µs ± 1%  +20.51%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24            11.1µs ± 3%    12.4µs ± 4%  +12.06%  (p=0.000 n=10+10)

Most of this seems to be in Pebble. Here's a couple of profiles, along with a profile diff graph showing much of it in pebble.InterleavingIter if I'm reading this right (there's also a fair bit in SetOptions):

Screenshot 2022-06-04 at 11 35 49

I also tried using the latest Pebble master, which shows a modest improvement over the current PR, but still pretty far from where we need to be:

MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      5.22µs ± 1%    5.20µs ± 1%    ~     (p=0.128 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     6.33µs ± 1%    6.34µs ± 1%    ~     (p=0.271 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    15.0µs ± 3%    15.0µs ± 3%    ~     (p=0.529 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       3.39µs ± 1%    3.32µs ± 1%  -1.90%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      4.77µs ± 1%    4.63µs ± 1%  -2.87%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     12.4µs ± 4%    12.0µs ± 4%  -3.38%  (p=0.011 n=10+10)

Appreciate you pulling at this. There is also some optimization work needed in pointSynthesizingIter, but it's a far smaller contribution so I'll hold off until we've optimized Pebble. Let me know if I can do anything to help.

@jbowens
Copy link
Collaborator

jbowens commented Jun 6, 2022

Thanks @erikgrinaker — In this "null" case, are there still range-key clear tombstones? (eg, from #82041)

@erikgrinaker
Copy link
Contributor Author

Shouldn't be, no -- we're setting up a new engine for the benchmark.

@jbowens
Copy link
Collaborator

jbowens commented Jun 7, 2022

I tried removing the entire range-key iterator stack and the interleaving iterator, just to try to measure its overhead in this noop case. It looks like it has a 5-8.5% delta. I think there's definitely some performance we can claw back through optimizing the the interleaving iterator and range-key iterator.

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      4.06µs ± 1%    3.85µs ± 2%  -5.08%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.86µs ± 4%    4.57µs ± 2%  -6.02%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    10.0µs ± 2%     9.7µs ± 2%  -2.52%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.50µs ± 1%    2.28µs ± 1%  -8.54%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.38µs ± 2%    3.13µs ± 2%  -7.34%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     8.03µs ± 4%    7.45µs ± 1%  -7.27%  (p=0.000 n=10+10)

name                                                    old speed      new speed      delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10    1.97MB/s ± 0%  2.08MB/s ± 2%  +5.20%  (p=0.000 n=7+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10   1.65MB/s ± 4%  1.75MB/s ± 2%  +6.37%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10   803kB/s ± 1%   823kB/s ± 2%  +2.49%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10     3.20MB/s ± 1%  3.50MB/s ± 1%  +9.30%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10    2.36MB/s ± 1%  2.55MB/s ± 2%  +8.07%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10   1.00MB/s ± 2%  1.07MB/s ± 0%  +7.62%  (p=0.000 n=9+10)

Compared with this PR's parent SHA, the commit without the interleaving iter still has a slowdown:

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      3.47µs ± 1%    3.85µs ± 2%  +10.97%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.22µs ± 3%    4.57µs ± 2%   +8.08%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    9.32µs ± 2%    9.72µs ± 2%   +4.26%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.18µs ± 1%    2.28µs ± 1%   +4.94%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.01µs ± 4%    3.13µs ± 2%   +4.06%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     7.46µs ± 5%    7.45µs ± 1%     ~     (p=1.000 n=10+10)

name                                                    old speed      new speed      delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10    2.31MB/s ± 1%  2.08MB/s ± 2%   -9.93%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10   1.90MB/s ± 3%  1.75MB/s ± 2%   -7.49%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10   860kB/s ± 2%   823kB/s ± 2%   -4.30%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10     3.68MB/s ± 1%  3.50MB/s ± 1%   -4.73%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10    2.66MB/s ± 4%  2.55MB/s ± 2%   -3.98%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10   1.07MB/s ± 5%  1.07MB/s ± 0%     ~     (p=0.862 n=10+10)

With a batch, it appears the slowdown is exclusively with iterator construction.

Without a batch, there's the slowdown with iterator construction and it looks like some additional slowdown that scales with the number of Gets.

Going to keep digging.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from c2c3606 to 682df9d Compare June 10, 2022 07:38
@erikgrinaker erikgrinaker marked this pull request as ready for review June 10, 2022 07:38
@erikgrinaker erikgrinaker requested review from a team as code owners June 10, 2022 07:38
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 10, 2022

Marking this as ready for review, since functional blockers have merged, but still needs optimization work.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like some additional slowdown that scales with the number of Gets.

Ah, I think this is because MVCCGet without a batch needs to always initialize an iterator, whereas when the reader is the same pebbleBatch each time, the pebbleBatch holds the iterators it creates and can reuse them.

I've merged some Pebble optimizations that help (included in #82736), especially in the MVCCGet_Pebble/batch=true case. Here's the benchmarks of this branch (an old commit from before you rebased) versus its parent:

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      3.47µs ± 1%    3.87µs ± 1%  +11.39%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.22µs ± 3%    4.66µs ± 1%  +10.37%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    9.32µs ± 2%    9.75µs ± 3%   +4.54%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.18µs ± 1%    2.25µs ± 3%   +3.30%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.01µs ± 4%    3.17µs ± 1%   +5.22%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     7.46µs ± 5%    7.15µs ± 2%   -4.10%  (p=0.000 n=10+10)

The iterator construction slowdown is still very present, and unfortunately, is going to be challenging to reduce. The current design of combined iteration constructs two iterator stacks: the point iterator stack, and the range key iterator stack, gluing them together with the interleaved iterator. Just the addition of the initialization of the range key iterator stack's various internal iterators (eg, setting fields, etc) adds a slowdown. It's also certainly going to get worse with persistence (just merged! 🎉 in cockroachdb/pebble@ae99f4f12f).

There are two approaches I can see to removing the iterator construction overhead:

  1. Lazily construct the combined iterator: As the point iterator iterates through the LSM, if it ever encounters a file that contains range key, it bubbles that knowledge up. The pebble.Iterator constructs the range key iterator and restructures itself to initialize the combined iterator in the same spot.
  2. Rework combined iteration to dynamically update the range key merging iterator's levels as files are opened and closed. This likely would be pretty gnarly.

cc @sumeerbhola if you have thoughts ^

Luckily breather week is a nice span of heads down time.

Reviewed 2 of 5 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged some Pebble optimizations that help (included in #82736), especially in the MVCCGet_Pebble/batch=true case.

Awesome, thanks for the improvements! The batch=true case is getting to the point where I think we can merge this to master. batch=false still needs some work, but once we land the latest optimizations and persistence in CRDB I'll run some end-to-end benchmarks to look at the overall perf impact. I'll see if we can reclaim a couple of percents in CRDB too.

Lazily construct the combined iterator

This seems like a good first stab. It's possible that we can more aggressively reuse iterators in CRDB too, e.g. by pooling them between batches or something.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from 682df9d to 8c874b4 Compare June 15, 2022 21:01
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 28 of 28 files at r6, 2 of 11 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)

@erikgrinaker
Copy link
Contributor Author

TFTR! Going to let this sit until we bump Pebble and do a kv95 benchmark run, but I think we're probably close enough to baseline that we can merge.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan branch 2 times, most recently from 2126b67 to 8504e23 Compare June 17, 2022 16:28
@erikgrinaker
Copy link
Contributor Author

Thanks for bumping Pebble! I ran a couple of kv95 benchmarks on a 3-node 32-core cluster, shows a ~5% regression:

master:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0       33952270       113173.7      1.5      1.1      3.9     10.0    125.8  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0        1785616         5952.0      3.5      2.9      7.9     14.2     62.9  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0       35737886       119125.7      1.6      1.2      4.5     10.5    125.8

This branch (rebased):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0       32376678       107921.9      1.6      1.2      4.1     10.0    117.4  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0        1706696         5689.0      3.5      2.9      7.9     14.2     75.5  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0       34083374       113610.9      1.7      1.2      4.7     10.5    117.4

I think we have to merge this as-is to unblock other work, but we'll need to claw much of this back later.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from 8504e23 to be87e94 Compare June 17, 2022 20:23
@erikgrinaker
Copy link
Contributor Author

Microbenchmarks are still pretty bad though. I got back a few percent of the construction cost by explicitly embedding a *pointSynthesizingIter in pebbleMVCCScanner and pooling them together, which also reduced some of the interface overhead. But we're still looking at ~20%. In the small-count cases, this is mostly due to range key iterator construction. In the large-count cases, it seems to be pretty evenly split between range key handling in Pebble and pointSynthesizingIter overhead. Will try to pull at this a bit more again tomorrow.

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.51µs ± 1%    5.18µs ± 3%  +14.89%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       33.3µs ± 1%    39.5µs ± 1%  +18.54%  (p=0.000 n=9+9)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.44ms ± 1%    3.02ms ± 1%  +23.66%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.39µs ± 1%    5.03µs ± 2%  +14.48%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.29µs ± 0%    6.09µs ± 3%  +15.13%  (p=0.000 n=8+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    12.5µs ± 3%    14.3µs ± 5%  +14.43%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.65µs ± 2%    2.89µs ± 2%   +8.87%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.67µs ± 1%    4.00µs ± 2%   +9.09%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     9.75µs ± 4%   10.33µs ± 6%   +5.89%  (p=0.000 n=10+10)

@erikgrinaker
Copy link
Contributor Author

Did a quick experiment to see if lazily switching to the pointSynthesizingIter would be worthwhile. This simply enables IterKeyTypePointsAndRanges and does a HasPointAndRange() call in the hot path. The improvement from this PR isn't huge, but it's something:

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         5.18µs ± 3%    5.05µs ± 1%  -2.43%  (p=0.000 n=10+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       39.5µs ± 1%    37.6µs ± 1%  -4.79%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     3.02ms ± 1%    2.83ms ± 1%  -6.21%  (p=0.000 n=10+8)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      5.03µs ± 2%    4.87µs ± 2%  -3.18%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     6.09µs ± 3%    5.96µs ± 4%    ~     (p=0.063 n=9+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    14.3µs ± 5%    14.4µs ± 5%    ~     (p=0.739 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.89µs ± 2%    2.80µs ± 2%  -3.14%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      4.00µs ± 2%    3.90µs ± 2%  -2.57%  (p=0.001 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     10.3µs ± 6%    10.1µs ± 6%    ~     (p=0.143 n=10+10)

Compared to master it's still a hefty penalty:

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.51µs ± 1%    5.05µs ± 1%  +12.10%  (p=0.000 n=9+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       33.3µs ± 1%    37.6µs ± 1%  +12.86%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.44ms ± 1%    2.83ms ± 1%  +15.98%  (p=0.000 n=9+8)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.39µs ± 1%    4.87µs ± 2%  +10.84%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.29µs ± 0%    5.96µs ± 4%  +12.61%  (p=0.000 n=8+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    12.5µs ± 3%    14.4µs ± 5%  +15.00%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.65µs ± 2%    2.80µs ± 2%   +5.45%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.67µs ± 1%    3.90µs ± 2%   +6.29%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     9.75µs ± 4%   10.06µs ± 6%   +3.17%  (p=0.018 n=10+10)

This shouldn't be a huge amount of work, I'll give it a shot before merging this. How much work would it be to implement lazy construction in Pebble @jbowens?

@jbowens
Copy link
Collaborator

jbowens commented Jun 18, 2022

How much work would it be to implement lazy construction in Pebble

I've started working on it, and I think it shouldn't be too much work to get something functional. I'm a little worried about finding a design that isn't complicated and a burden to maintain. I think I should be able to put up a PR with something good enough for now soon, and we can try to refactor to lighten the complexity afterwards.

I'm sure the addition of persistence caused a step backwards on performance, because it adds more work to iterator construction. cockroachdb/pebble#1771 should've clawed back most or all of that, but it didn't make the pebble bump on master last week.

@jbowens
Copy link
Collaborator

jbowens commented Jun 20, 2022

cockroachdb/pebble#1771 should've clawed back most or all of that

Looks like gains restricted to the batch=false/versions=100/valueSize=8. Delta compared with pebble master:

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      3.90µs ± 5%    3.84µs ± 4%    ~     (p=0.062 n=20+19)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.62µs ± 4%    4.65µs ± 4%    ~     (p=0.376 n=19+20)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    10.3µs ± 6%     9.5µs ±10%  -8.36%  (p=0.000 n=20+20)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.19µs ± 5%    2.21µs ± 5%    ~     (p=0.172 n=20+20)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.05µs ± 7%    3.08µs ± 4%    ~     (p=0.560 n=20+20)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     7.21µs ± 4%    7.28µs ± 7%    ~     (p=0.525 n=20+20)

@jbowens
Copy link
Collaborator

jbowens commented Jun 22, 2022

I tried running some benchmarks with this branch with Pebble at cockroachdb/pebble@20e506c and compared it to its parent with the same Pebble SHA:

name                                                     old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24       5.75µs ± 1%    6.32µs ± 1%    +9.78%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24      7.27µs ± 1%    7.88µs ± 1%    +8.45%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24     18.0µs ± 6%    19.1µs ± 5%    +5.93%  (p=0.001 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24        3.59µs ± 1%    3.75µs ± 2%    +4.48%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24       5.26µs ± 1%    5.54µs ± 1%    +5.29%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24      14.2µs ± 4%    14.7µs ± 7%    +4.10%  (p=0.001 n=9+10)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24          5.84µs ± 2%    7.59µs ± 3%   +29.90%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64-24        17.6µs ± 1%    17.1µs ± 8%      ~     (p=0.138 n=10+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24        51.4µs ± 2%   118.6µs ± 1%  +130.83%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64-24       451µs ± 1%     485µs ± 1%    +7.35%  (p=0.000 n=8+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24      3.30ms ± 4%    9.80ms ± 1%  +196.64%  (p=0.000 n=10+9)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64-24    40.6ms ± 6%    43.1ms ± 7%    +6.40%  (p=0.002 n=10+10)

The get regression is inching downwards, but the scan numbers are much, much worse than previous numbers.

Diffing profiles for BenchmarkMVCCScan_Pebble/rows=10000/versions=1/valueSize=64:

      flat  flat%   sum%        cum   cum%
    1100ms  5.03%  5.03%     2360ms 10.80%  github.com/cockroachdb/cockroach/pkg/storage.EngineKeyCompare
    1080ms  4.94%  9.97%    10560ms 48.31%  github.com/cockroachdb/pebble.(*Iterator).SeekGEWithLimit
     910ms  4.16% 14.14%      910ms  4.16%  cmpbody
    -720ms  3.29% 10.84%    -1310ms  5.99%  github.com/cockroachdb/cockroach/pkg/storage.decodeExtendedMVCCValue
     650ms  2.97% 13.82%    15800ms 72.28%  github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekGE
    -610ms  2.79% 11.02%    -1870ms  8.55%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).decodeCurrentValueExtended
    -580ms  2.65%  8.37%    -2260ms 10.34%  github.com/cockroachdb/pebble.(*mergingIter).nextEntry
    -440ms  2.01%  6.36%     -470ms  2.15%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).updateCurrent
     420ms  1.92%  8.28%     1010ms  4.62%  github.com/cockroachdb/cockroach/pkg/util/encoding.encodeBytesAscendingWithoutTerminatorOrPrefix

It appears that we're performing more seeks than previously, because the CPU time is elevated throughout (*pebble.Iterator).SeekGEWithLimit.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 24, 2022

I tried running some benchmarks with this branch with Pebble at cockroachdb/pebble@20e506c and compared it to its parent with the same Pebble SHA:

These results seem off to me. I tried these myself after rebasing this branch onto master (which is at cockroachdb/pebble@20e506c), and the results were consistent with my previous results, although it does seem like we've shaved off a few percentage points (🎉):

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.65µs ± 1%    5.24µs ± 1%  +12.54%  (p=0.000 n=10+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       34.5µs ± 1%    39.6µs ± 1%  +14.93%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.52ms ± 1%    3.01ms ± 1%  +19.72%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.30µs ± 1%    4.91µs ± 1%  +14.08%  (p=0.000 n=9+9)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.23µs ± 1%    5.94µs ± 2%  +13.45%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    13.0µs ± 5%    13.9µs ± 4%   +7.10%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.62µs ± 1%    2.89µs ± 1%  +10.27%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.70µs ± 1%    4.06µs ± 1%   +9.61%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     10.0µs ± 3%    10.3µs ± 5%   +2.97%  (p=0.017 n=10+10)

The diff isn't showing any seek changes either:

      flat  flat%   sum%        cum   cum%
     -70ms  2.83%  2.83%      -70ms  2.83%  runtime.madvise
      70ms  2.83%     0%       70ms  2.83%  runtime.pageIndexOf
     -50ms  2.02%  2.02%      -40ms  1.62%  github.com/cockroachdb/cockroach/pkg/storage/enginepb.ScanDecodeKeyValue
     -50ms  2.02%  4.05%      -50ms  2.02%  github.com/cockroachdb/pebble.(*mergingIter).findNextEntry
     -40ms  1.62%  5.67%      -40ms  1.62%  github.com/cockroachdb/pebble.(*Iterator).setRangeKey
      40ms  1.62%  4.05%       50ms  2.02%  github.com/cockroachdb/pebble/sstable.(*blockIter).readEntry
     -40ms  1.62%  5.67%      -40ms  1.62%  runtime.(*lfstack).push
      40ms  1.62%  4.05%       40ms  1.62%  runtime.futex
     -30ms  1.21%  5.26%       10ms   0.4%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).addAndAdvance
      30ms  1.21%  4.05%       90ms  3.64%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).iterNext

I've found benchmarking with Bazel to be highly unreliable, since these benchmarks pre-create a dataset which is then stored inside some sort of Bazel working directory that changes with the build. Running the IDE and other software will also interfere with benchmarks. I've resorted to using make bench on a gceworker to get consistent results, e.g.:

$ make bench PKG=./pkg/storage BENCHES='^BenchmarkMVCC(Get|Scan)_Pebble$' TESTFLAGS="-v -count 10" | grep BenchmarkMVCC >bench.txt

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan branch 2 times, most recently from f915137 to 76c4b76 Compare June 25, 2022 18:23
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 25, 2022

I've updated the PR to only initialize a pointSynthesizingIter when we encounter a range key. This gave a pretty nice improvement in the no-range-key case, so we're now at about a 10% regression overall. I think that's getting close enough to merge, but I'll run some kv95 benchmarks tomorrow to check the end-to-end impact, and I need to look into a couple of test anomalies too.

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.61µs ± 1%    5.16µs ± 2%  +12.02%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       34.4µs ± 1%    38.1µs ± 1%  +10.78%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.50ms ± 1%    2.85ms ± 1%  +14.02%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.40µs ± 2%    4.86µs ± 2%  +10.52%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.30µs ± 1%    5.80µs ± 1%   +9.39%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    12.9µs ± 4%    13.8µs ± 6%   +7.07%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.64µs ± 1%    2.83µs ± 2%   +7.24%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.75µs ± 1%    3.96µs ± 1%   +5.59%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     9.91µs ± 1%   10.34µs ± 5%   +4.27%  (p=0.000 n=8+10)

An MVCC scan in `TestMVCCHistories` would show an incorrect key for the
intents, using the scan start key rather than the intent key.
Furthermore, intents are listed before the scan results, but this was
not made clear by the formatting, which could cause readers to believe
they were emitted in an incorrect order.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from 76c4b76 to aa7c9bb Compare June 26, 2022 10:09
This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones if they overlap an MVCC range tombstone even if no
existing point key exists. This is based on `pointSynthesizingIter`,
which avoids additional logic in `pebbleMVCCScanner`.

Synthetic MVCC point tombstones emitted for MVCC range tombstones are
not stable, nor are they fully deterministic. For example, the start key
will be truncated by iterator bounds, so an `MVCCScan` over a given key
span may see a synthetic point tombstone at its start (if it overlaps an
MVCC range tombstone), but this will not be emitted if a broader span is
used (a different point tombstone will be emitted instead). Similarly, a
CRDB range split/merge will split/merge MVCC range tombstones, changing
which point tombstones are emitted. Furthermore, `MVCCGet` will
synthesize an MVCC point tombstone if it overlaps an MVCC range
tombstone and there is no existing point key there, while an `MVCCScan`
will not emit these. Callers must take care not to rely on such
semantics for MVCC tombstones. Existing callers have been audited to
ensure they are not affected.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from aa7c9bb to 6733f95 Compare June 26, 2022 10:40
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 26, 2022

Ran a couple of quick kv95/enc=false/nodes=3/cpu=32 benchmarks, we're now down to a ~1.5% regression. I expect we'll see a larger regression on some benchmarks, but this is good enough to merge for now. Thanks for all your help with this so far @jbowens! 🎉

This PR:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0       20437450       113540.9      1.5      1.1      3.9     10.0    167.8  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0        1075955         5977.5      4.1      3.5      8.9     14.7    130.0  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  180.0s        0       21513405       119518.4      1.6      1.1      4.7     10.5    167.8  

master:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0       20758955       115327.0      1.5      1.1      3.9     10.0    218.1  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0        1091871         6065.9      3.7      3.1      8.1     14.2    130.0  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  180.0s        0       21850826       121392.9      1.6      1.1      4.7     10.0    218.1  

@erikgrinaker
Copy link
Contributor Author

TFTR!

bors r=jbowens

@craig
Copy link
Contributor

craig bot commented Jun 26, 2022

Build failed:

@erikgrinaker
Copy link
Contributor Author

bors retry

@craig
Copy link
Contributor

craig bot commented Jun 26, 2022

Build failed:

@erikgrinaker
Copy link
Contributor Author

Groan. Third time's the charm.

bors retry

@craig
Copy link
Contributor

craig bot commented Jun 26, 2022

Build failed:

@erikgrinaker
Copy link
Contributor Author

This is getting ridiculous.

bors retry

@craig
Copy link
Contributor

craig bot commented Jun 26, 2022

Build succeeded:

@craig craig bot merged commit 460ce6a into cockroachdb:master Jun 26, 2022
@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-scan branch June 26, 2022 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants