storage: add MVCC range tombstone handling in scans and gets #82045

erikgrinaker · 2022-05-29T17:39:32Z

This patch adds MVCC range tombstone handling for scans and gets. In the
basic case, this simply means that point keys below an MVCC range
tombstone are not visible.

When tombstones are requested by the caller, the MVCC range tombstones
themselves are never exposed, to avoid having to explicitly handle these
throughout the codebase. Instead, synthetic MVCC point tombstones are
emitted at the start of MVCC range tombstones and wherever they overlap
a point key (above and below). Additionally, point gets return synthetic
point tombstones if they overlap an MVCC range tombstone even if no
existing point key exists. This is based on pointSynthesizingIter,
which avoids additional logic in pebbleMVCCScanner.

Synthetic MVCC point tombstones emitted for MVCC range tombstones are
not stable, nor are they fully deterministic. For example, the start key
will be truncated by iterator bounds, so an MVCCScan over a given key
span may see a synthetic point tombstone at its start (if it overlaps an
MVCC range tombstone), but this will not be emitted if a broader span is
used (a different point tombstone will be emitted instead). Similarly, a
CRDB range split/merge will split/merge MVCC range tombstones, changing
which point tombstones are emitted. Furthermore, MVCCGet will
synthesize an MVCC point tombstone if it overlaps an MVCC range
tombstone and there is no existing point key there, while an MVCCScan
will not emit these. Callers must take care not to rely on such
semantics for MVCC tombstones. Existing callers have been audited to
ensure they are not affected.

Point tombstone synthesis must be enabled even when the caller has not
requested tombstones, because they must always be taken into account for
conflict/uncertainty checks. However, in these cases we enable range key
masking below the read timestamp, omitting any covered points since
these are no longer needed.

Touches #70412.

Release note: None

cockroach-teamcity · 2022-05-29T17:39:49Z

This change is

erikgrinaker · 2022-06-04T15:59:58Z

@jbowens We'll need to optimize the null path here (no range keys). Here are the latest benchmarks against the parent of this PR:

name                                                           old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24                4.69µs ± 1%    5.54µs ± 0%  +18.06%  (p=0.000 n=10+8)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64-24               6.45µs ± 1%    7.60µs ± 0%  +17.84%  (p=0.000 n=9+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24              35.4µs ± 1%    41.1µs ± 1%  +16.15%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64-24              113µs ± 0%     135µs ± 1%  +19.64%  (p=0.000 n=8+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24            2.57ms ± 1%    2.99ms ± 1%  +16.55%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64-24           9.62ms ± 1%   11.68ms ± 2%  +21.41%  (p=0.000 n=9+10)
MVCCReverseScan_Pebble/rows=1/versions=1/valueSize=64-24         5.18µs ± 1%    5.93µs ± 1%  +14.37%  (p=0.000 n=9+10)
MVCCReverseScan_Pebble/rows=1/versions=10/valueSize=64-24        8.94µs ± 1%   10.54µs ± 1%  +17.87%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=100/versions=1/valueSize=64-24       47.3µs ± 1%    56.2µs ± 0%  +18.69%  (p=0.000 n=10+9)
MVCCReverseScan_Pebble/rows=100/versions=10/valueSize=64-24       320µs ± 1%     397µs ± 1%  +24.03%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=1/valueSize=64-24     3.78ms ± 1%    4.48ms ± 1%  +18.49%  (p=0.000 n=10+10)
MVCCReverseScan_Pebble/rows=10000/versions=10/valueSize=64-24    30.6ms ± 4%    37.3ms ± 3%  +21.76%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24             4.53µs ± 0%    5.22µs ± 1%  +15.35%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24            5.54µs ± 1%    6.33µs ± 1%  +14.19%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24           14.0µs ± 1%    15.0µs ± 3%   +7.47%  (p=0.000 n=8+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24              2.77µs ± 0%    3.39µs ± 1%  +22.31%  (p=0.000 n=8+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24             3.96µs ± 1%    4.77µs ± 1%  +20.51%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24            11.1µs ± 3%    12.4µs ± 4%  +12.06%  (p=0.000 n=10+10)

Most of this seems to be in Pebble. Here's a couple of profiles, along with a profile diff graph showing much of it in pebble.InterleavingIter if I'm reading this right (there's also a fair bit in SetOptions):

I also tried using the latest Pebble master, which shows a modest improvement over the current PR, but still pretty far from where we need to be:

MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      5.22µs ± 1%    5.20µs ± 1%    ~     (p=0.128 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     6.33µs ± 1%    6.34µs ± 1%    ~     (p=0.271 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    15.0µs ± 3%    15.0µs ± 3%    ~     (p=0.529 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       3.39µs ± 1%    3.32µs ± 1%  -1.90%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      4.77µs ± 1%    4.63µs ± 1%  -2.87%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     12.4µs ± 4%    12.0µs ± 4%  -3.38%  (p=0.011 n=10+10)

Appreciate you pulling at this. There is also some optimization work needed in pointSynthesizingIter, but it's a far smaller contribution so I'll hold off until we've optimized Pebble. Let me know if I can do anything to help.

jbowens · 2022-06-06T14:48:48Z

Thanks @erikgrinaker — In this "null" case, are there still range-key clear tombstones? (eg, from #82041)

erikgrinaker · 2022-06-06T14:52:56Z

Shouldn't be, no -- we're setting up a new engine for the benchmark.

jbowens · 2022-06-07T16:24:27Z

I tried removing the entire range-key iterator stack and the interleaving iterator, just to try to measure its overhead in this noop case. It looks like it has a 5-8.5% delta. I think there's definitely some performance we can claw back through optimizing the the interleaving iterator and range-key iterator.

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      4.06µs ± 1%    3.85µs ± 2%  -5.08%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.86µs ± 4%    4.57µs ± 2%  -6.02%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    10.0µs ± 2%     9.7µs ± 2%  -2.52%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.50µs ± 1%    2.28µs ± 1%  -8.54%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.38µs ± 2%    3.13µs ± 2%  -7.34%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     8.03µs ± 4%    7.45µs ± 1%  -7.27%  (p=0.000 n=10+10)

name                                                    old speed      new speed      delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10    1.97MB/s ± 0%  2.08MB/s ± 2%  +5.20%  (p=0.000 n=7+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10   1.65MB/s ± 4%  1.75MB/s ± 2%  +6.37%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10   803kB/s ± 1%   823kB/s ± 2%  +2.49%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10     3.20MB/s ± 1%  3.50MB/s ± 1%  +9.30%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10    2.36MB/s ± 1%  2.55MB/s ± 2%  +8.07%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10   1.00MB/s ± 2%  1.07MB/s ± 0%  +7.62%  (p=0.000 n=9+10)

Compared with this PR's parent SHA, the commit without the interleaving iter still has a slowdown:

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      3.47µs ± 1%    3.85µs ± 2%  +10.97%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.22µs ± 3%    4.57µs ± 2%   +8.08%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    9.32µs ± 2%    9.72µs ± 2%   +4.26%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.18µs ± 1%    2.28µs ± 1%   +4.94%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.01µs ± 4%    3.13µs ± 2%   +4.06%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     7.46µs ± 5%    7.45µs ± 1%     ~     (p=1.000 n=10+10)

name                                                    old speed      new speed      delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10    2.31MB/s ± 1%  2.08MB/s ± 2%   -9.93%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10   1.90MB/s ± 3%  1.75MB/s ± 2%   -7.49%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10   860kB/s ± 2%   823kB/s ± 2%   -4.30%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10     3.68MB/s ± 1%  3.50MB/s ± 1%   -4.73%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10    2.66MB/s ± 4%  2.55MB/s ± 2%   -3.98%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10   1.07MB/s ± 5%  1.07MB/s ± 0%     ~     (p=0.862 n=10+10)

With a batch, it appears the slowdown is exclusively with iterator construction.

Without a batch, there's the slowdown with iterator construction and it looks like some additional slowdown that scales with the number of Gets.

Going to keep digging.

erikgrinaker · 2022-06-10T07:39:04Z

Marking this as ready for review, since functional blockers have merged, but still needs optimization work.

jbowens

it looks like some additional slowdown that scales with the number of Gets.

Ah, I think this is because MVCCGet without a batch needs to always initialize an iterator, whereas when the reader is the same pebbleBatch each time, the pebbleBatch holds the iterators it creates and can reuse them.

I've merged some Pebble optimizations that help (included in #82736), especially in the MVCCGet_Pebble/batch=true case. Here's the benchmarks of this branch (an old commit from before you rebased) versus its parent:

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      3.47µs ± 1%    3.87µs ± 1%  +11.39%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.22µs ± 3%    4.66µs ± 1%  +10.37%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    9.32µs ± 2%    9.75µs ± 3%   +4.54%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.18µs ± 1%    2.25µs ± 3%   +3.30%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.01µs ± 4%    3.17µs ± 1%   +5.22%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     7.46µs ± 5%    7.15µs ± 2%   -4.10%  (p=0.000 n=10+10)

The iterator construction slowdown is still very present, and unfortunately, is going to be challenging to reduce. The current design of combined iteration constructs two iterator stacks: the point iterator stack, and the range key iterator stack, gluing them together with the interleaved iterator. Just the addition of the initialization of the range key iterator stack's various internal iterators (eg, setting fields, etc) adds a slowdown. It's also certainly going to get worse with persistence (just merged! 🎉 in cockroachdb/pebble@ae99f4f12f).

There are two approaches I can see to removing the iterator construction overhead:

Lazily construct the combined iterator: As the point iterator iterates through the LSM, if it ever encounters a file that contains range key, it bubbles that knowledge up. The pebble.Iterator constructs the range key iterator and restructures itself to initialize the combined iterator in the same spot.
Rework combined iteration to dynamically update the range key merging iterator's levels as files are opened and closed. This likely would be pretty gnarly.

cc @sumeerbhola if you have thoughts ^

Luckily breather week is a nice span of heads down time.

Reviewed 2 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)

erikgrinaker

I've merged some Pebble optimizations that help (included in #82736), especially in the MVCCGet_Pebble/batch=true case.

Awesome, thanks for the improvements! The batch=true case is getting to the point where I think we can merge this to master. batch=false still needs some work, but once we land the latest optimizations and persistence in CRDB I'll run some end-to-end benchmarks to look at the overall perf impact. I'll see if we can reclaim a couple of percents in CRDB too.

Lazily construct the combined iterator

This seems like a good first stab. It's possible that we can more aggressively reuse iterators in CRDB too, e.g. by pooling them between batches or something.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)

jbowens

Reviewed 28 of 28 files at r6, 2 of 11 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)

erikgrinaker · 2022-06-15T21:22:46Z

TFTR! Going to let this sit until we bump Pebble and do a kv95 benchmark run, but I think we're probably close enough to baseline that we can merge.

erikgrinaker · 2022-06-17T17:12:49Z

Thanks for bumping Pebble! I ran a couple of kv95 benchmarks on a 3-node 32-core cluster, shows a ~5% regression:

master:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0       33952270       113173.7      1.5      1.1      3.9     10.0    125.8  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0        1785616         5952.0      3.5      2.9      7.9     14.2     62.9  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0       35737886       119125.7      1.6      1.2      4.5     10.5    125.8

This branch (rebased):

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0       32376678       107921.9      1.6      1.2      4.1     10.0    117.4  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  300.0s        0        1706696         5689.0      3.5      2.9      7.9     14.2     75.5  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  300.0s        0       34083374       113610.9      1.7      1.2      4.7     10.5    117.4

I think we have to merge this as-is to unblock other work, but we'll need to claw much of this back later.

erikgrinaker · 2022-06-17T20:31:34Z

Microbenchmarks are still pretty bad though. I got back a few percent of the construction cost by explicitly embedding a *pointSynthesizingIter in pebbleMVCCScanner and pooling them together, which also reduced some of the interface overhead. But we're still looking at ~20%. In the small-count cases, this is mostly due to range key iterator construction. In the large-count cases, it seems to be pretty evenly split between range key handling in Pebble and pointSynthesizingIter overhead. Will try to pull at this a bit more again tomorrow.

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.51µs ± 1%    5.18µs ± 3%  +14.89%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       33.3µs ± 1%    39.5µs ± 1%  +18.54%  (p=0.000 n=9+9)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.44ms ± 1%    3.02ms ± 1%  +23.66%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.39µs ± 1%    5.03µs ± 2%  +14.48%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.29µs ± 0%    6.09µs ± 3%  +15.13%  (p=0.000 n=8+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    12.5µs ± 3%    14.3µs ± 5%  +14.43%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.65µs ± 2%    2.89µs ± 2%   +8.87%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.67µs ± 1%    4.00µs ± 2%   +9.09%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     9.75µs ± 4%   10.33µs ± 6%   +5.89%  (p=0.000 n=10+10)

erikgrinaker · 2022-06-17T21:44:05Z

Did a quick experiment to see if lazily switching to the pointSynthesizingIter would be worthwhile. This simply enables IterKeyTypePointsAndRanges and does a HasPointAndRange() call in the hot path. The improvement from this PR isn't huge, but it's something:

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         5.18µs ± 3%    5.05µs ± 1%  -2.43%  (p=0.000 n=10+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       39.5µs ± 1%    37.6µs ± 1%  -4.79%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     3.02ms ± 1%    2.83ms ± 1%  -6.21%  (p=0.000 n=10+8)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      5.03µs ± 2%    4.87µs ± 2%  -3.18%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     6.09µs ± 3%    5.96µs ± 4%    ~     (p=0.063 n=9+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    14.3µs ± 5%    14.4µs ± 5%    ~     (p=0.739 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.89µs ± 2%    2.80µs ± 2%  -3.14%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      4.00µs ± 2%    3.90µs ± 2%  -2.57%  (p=0.001 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     10.3µs ± 6%    10.1µs ± 6%    ~     (p=0.143 n=10+10)

Compared to master it's still a hefty penalty:

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.51µs ± 1%    5.05µs ± 1%  +12.10%  (p=0.000 n=9+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       33.3µs ± 1%    37.6µs ± 1%  +12.86%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.44ms ± 1%    2.83ms ± 1%  +15.98%  (p=0.000 n=9+8)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.39µs ± 1%    4.87µs ± 2%  +10.84%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.29µs ± 0%    5.96µs ± 4%  +12.61%  (p=0.000 n=8+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    12.5µs ± 3%    14.4µs ± 5%  +15.00%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.65µs ± 2%    2.80µs ± 2%   +5.45%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.67µs ± 1%    3.90µs ± 2%   +6.29%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     9.75µs ± 4%   10.06µs ± 6%   +3.17%  (p=0.018 n=10+10)

This shouldn't be a huge amount of work, I'll give it a shot before merging this. How much work would it be to implement lazy construction in Pebble @jbowens?

jbowens · 2022-06-18T16:32:02Z

How much work would it be to implement lazy construction in Pebble

I've started working on it, and I think it shouldn't be too much work to get something functional. I'm a little worried about finding a design that isn't complicated and a burden to maintain. I think I should be able to put up a PR with something good enough for now soon, and we can try to refactor to lighten the complexity afterwards.

I'm sure the addition of persistence caused a step backwards on performance, because it adds more work to iterator construction. cockroachdb/pebble#1771 should've clawed back most or all of that, but it didn't make the pebble bump on master last week.

jbowens · 2022-06-20T14:37:57Z

cockroachdb/pebble#1771 should've clawed back most or all of that

Looks like gains restricted to the batch=false/versions=100/valueSize=8. Delta compared with pebble master:

name                                                    old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-10      3.90µs ± 5%    3.84µs ± 4%    ~     (p=0.062 n=20+19)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-10     4.62µs ± 4%    4.65µs ± 4%    ~     (p=0.376 n=19+20)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-10    10.3µs ± 6%     9.5µs ±10%  -8.36%  (p=0.000 n=20+20)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-10       2.19µs ± 5%    2.21µs ± 5%    ~     (p=0.172 n=20+20)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-10      3.05µs ± 7%    3.08µs ± 4%    ~     (p=0.560 n=20+20)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-10     7.21µs ± 4%    7.28µs ± 7%    ~     (p=0.525 n=20+20)

jbowens · 2022-06-22T21:13:47Z

I tried running some benchmarks with this branch with Pebble at cockroachdb/pebble@20e506c and compared it to its parent with the same Pebble SHA:

name                                                     old time/op    new time/op    delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24       5.75µs ± 1%    6.32µs ± 1%    +9.78%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24      7.27µs ± 1%    7.88µs ± 1%    +8.45%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24     18.0µs ± 6%    19.1µs ± 5%    +5.93%  (p=0.001 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24        3.59µs ± 1%    3.75µs ± 2%    +4.48%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24       5.26µs ± 1%    5.54µs ± 1%    +5.29%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24      14.2µs ± 4%    14.7µs ± 7%    +4.10%  (p=0.001 n=9+10)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24          5.84µs ± 2%    7.59µs ± 3%   +29.90%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64-24        17.6µs ± 1%    17.1µs ± 8%      ~     (p=0.138 n=10+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24        51.4µs ± 2%   118.6µs ± 1%  +130.83%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64-24       451µs ± 1%     485µs ± 1%    +7.35%  (p=0.000 n=8+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24      3.30ms ± 4%    9.80ms ± 1%  +196.64%  (p=0.000 n=10+9)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64-24    40.6ms ± 6%    43.1ms ± 7%    +6.40%  (p=0.002 n=10+10)

The get regression is inching downwards, but the scan numbers are much, much worse than previous numbers.

Diffing profiles for BenchmarkMVCCScan_Pebble/rows=10000/versions=1/valueSize=64:

      flat  flat%   sum%        cum   cum%
    1100ms  5.03%  5.03%     2360ms 10.80%  github.com/cockroachdb/cockroach/pkg/storage.EngineKeyCompare
    1080ms  4.94%  9.97%    10560ms 48.31%  github.com/cockroachdb/pebble.(*Iterator).SeekGEWithLimit
     910ms  4.16% 14.14%      910ms  4.16%  cmpbody
    -720ms  3.29% 10.84%    -1310ms  5.99%  github.com/cockroachdb/cockroach/pkg/storage.decodeExtendedMVCCValue
     650ms  2.97% 13.82%    15800ms 72.28%  github.com/cockroachdb/cockroach/pkg/storage.(*intentInterleavingIter).SeekGE
    -610ms  2.79% 11.02%    -1870ms  8.55%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).decodeCurrentValueExtended
    -580ms  2.65%  8.37%    -2260ms 10.34%  github.com/cockroachdb/pebble.(*mergingIter).nextEntry
    -440ms  2.01%  6.36%     -470ms  2.15%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).updateCurrent
     420ms  1.92%  8.28%     1010ms  4.62%  github.com/cockroachdb/cockroach/pkg/util/encoding.encodeBytesAscendingWithoutTerminatorOrPrefix

It appears that we're performing more seeks than previously, because the CPU time is elevated throughout (*pebble.Iterator).SeekGEWithLimit.

erikgrinaker · 2022-06-24T15:26:46Z

I tried running some benchmarks with this branch with Pebble at cockroachdb/pebble@20e506c and compared it to its parent with the same Pebble SHA:

These results seem off to me. I tried these myself after rebasing this branch onto master (which is at cockroachdb/pebble@20e506c), and the results were consistent with my previous results, although it does seem like we've shaved off a few percentage points (🎉):

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.65µs ± 1%    5.24µs ± 1%  +12.54%  (p=0.000 n=10+9)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       34.5µs ± 1%    39.6µs ± 1%  +14.93%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.52ms ± 1%    3.01ms ± 1%  +19.72%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.30µs ± 1%    4.91µs ± 1%  +14.08%  (p=0.000 n=9+9)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.23µs ± 1%    5.94µs ± 2%  +13.45%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    13.0µs ± 5%    13.9µs ± 4%   +7.10%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.62µs ± 1%    2.89µs ± 1%  +10.27%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.70µs ± 1%    4.06µs ± 1%   +9.61%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     10.0µs ± 3%    10.3µs ± 5%   +2.97%  (p=0.017 n=10+10)

The diff isn't showing any seek changes either:

      flat  flat%   sum%        cum   cum%
     -70ms  2.83%  2.83%      -70ms  2.83%  runtime.madvise
      70ms  2.83%     0%       70ms  2.83%  runtime.pageIndexOf
     -50ms  2.02%  2.02%      -40ms  1.62%  github.com/cockroachdb/cockroach/pkg/storage/enginepb.ScanDecodeKeyValue
     -50ms  2.02%  4.05%      -50ms  2.02%  github.com/cockroachdb/pebble.(*mergingIter).findNextEntry
     -40ms  1.62%  5.67%      -40ms  1.62%  github.com/cockroachdb/pebble.(*Iterator).setRangeKey
      40ms  1.62%  4.05%       50ms  2.02%  github.com/cockroachdb/pebble/sstable.(*blockIter).readEntry
     -40ms  1.62%  5.67%      -40ms  1.62%  runtime.(*lfstack).push
      40ms  1.62%  4.05%       40ms  1.62%  runtime.futex
     -30ms  1.21%  5.26%       10ms   0.4%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).addAndAdvance
      30ms  1.21%  4.05%       90ms  3.64%  github.com/cockroachdb/cockroach/pkg/storage.(*pebbleMVCCScanner).iterNext

I've found benchmarking with Bazel to be highly unreliable, since these benchmarks pre-create a dataset which is then stored inside some sort of Bazel working directory that changes with the build. Running the IDE and other software will also interfere with benchmarks. I've resorted to using make bench on a gceworker to get consistent results, e.g.:

$ make bench PKG=./pkg/storage BENCHES='^BenchmarkMVCC(Get|Scan)_Pebble$' TESTFLAGS="-v -count 10" | grep BenchmarkMVCC >bench.txt

erikgrinaker · 2022-06-25T18:25:54Z

I've updated the PR to only initialize a pointSynthesizingIter when we encounter a range key. This gave a pretty nice improvement in the no-range-key case, so we're now at about a 10% regression overall. I think that's getting close enough to merge, but I'll run some kv95 benchmarks tomorrow to check the end-to-end impact, and I need to look into a couple of test anomalies too.

name                                                    old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=64-24         4.61µs ± 1%    5.16µs ± 2%  +12.02%  (p=0.000 n=9+10)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64-24       34.4µs ± 1%    38.1µs ± 1%  +10.78%  (p=0.000 n=10+10)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64-24     2.50ms ± 1%    2.85ms ± 1%  +14.02%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8-24      4.40µs ± 2%    4.86µs ± 2%  +10.52%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8-24     5.30µs ± 1%    5.80µs ± 1%   +9.39%  (p=0.000 n=10+9)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8-24    12.9µs ± 4%    13.8µs ± 6%   +7.07%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8-24       2.64µs ± 1%    2.83µs ± 2%   +7.24%  (p=0.000 n=10+10)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8-24      3.75µs ± 1%    3.96µs ± 1%   +5.59%  (p=0.000 n=9+10)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8-24     9.91µs ± 1%   10.34µs ± 5%   +4.27%  (p=0.000 n=8+10)

An MVCC scan in `TestMVCCHistories` would show an incorrect key for the intents, using the scan start key rather than the intent key. Furthermore, intents are listed before the scan results, but this was not made clear by the formatting, which could cause readers to believe they were emitted in an incorrect order. Release note: None

This patch adds MVCC range tombstone handling for scans and gets. In the basic case, this simply means that point keys below an MVCC range tombstone are not visible. When tombstones are requested by the caller, the MVCC range tombstones themselves are never exposed, to avoid having to explicitly handle these throughout the codebase. Instead, synthetic MVCC point tombstones are emitted at the start of MVCC range tombstones and wherever they overlap a point key (above and below). Additionally, point gets return synthetic point tombstones if they overlap an MVCC range tombstone even if no existing point key exists. This is based on `pointSynthesizingIter`, which avoids additional logic in `pebbleMVCCScanner`. Synthetic MVCC point tombstones emitted for MVCC range tombstones are not stable, nor are they fully deterministic. For example, the start key will be truncated by iterator bounds, so an `MVCCScan` over a given key span may see a synthetic point tombstone at its start (if it overlaps an MVCC range tombstone), but this will not be emitted if a broader span is used (a different point tombstone will be emitted instead). Similarly, a CRDB range split/merge will split/merge MVCC range tombstones, changing which point tombstones are emitted. Furthermore, `MVCCGet` will synthesize an MVCC point tombstone if it overlaps an MVCC range tombstone and there is no existing point key there, while an `MVCCScan` will not emit these. Callers must take care not to rely on such semantics for MVCC tombstones. Existing callers have been audited to ensure they are not affected. Point tombstone synthesis must be enabled even when the caller has not requested tombstones, because they must always be taken into account for conflict/uncertainty checks. However, in these cases we enable range key masking below the read timestamp, omitting any covered points since these are no longer needed. Release note: None

erikgrinaker · 2022-06-26T10:40:53Z

Ran a couple of quick kv95/enc=false/nodes=3/cpu=32 benchmarks, we're now down to a ~1.5% regression. I expect we'll see a larger regression on some benchmarks, but this is good enough to merge for now. Thanks for all your help with this so far @jbowens! 🎉

This PR:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0       20437450       113540.9      1.5      1.1      3.9     10.0    167.8  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0        1075955         5977.5      4.1      3.5      8.9     14.7    130.0  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  180.0s        0       21513405       119518.4      1.6      1.1      4.7     10.5    167.8

master:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0       20758955       115327.0      1.5      1.1      3.9     10.0    218.1  read

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0        1091871         6065.9      3.7      3.1      8.1     14.2    130.0  write

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
  180.0s        0       21850826       121392.9      1.6      1.1      4.7     10.0    218.1

erikgrinaker · 2022-06-26T11:56:01Z

TFTR!

bors r=jbowens

craig · 2022-06-26T13:08:55Z

Build failed:

GitHub CI (Cockroach)

erikgrinaker · 2022-06-26T13:09:22Z

bors retry

craig · 2022-06-26T14:16:18Z

Build failed:

GitHub CI (Cockroach)

erikgrinaker · 2022-06-26T14:17:51Z

Groan. Third time's the charm.

bors retry

craig · 2022-06-26T14:30:19Z

Build failed:

GitHub CI (Cockroach)

erikgrinaker · 2022-06-26T14:31:36Z

This is getting ridiculous.

bors retry

craig · 2022-06-26T15:40:27Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker requested review from jbowens, aliher1911 and sumeerbhola May 29, 2022 17:39

erikgrinaker self-assigned this May 29, 2022

erikgrinaker mentioned this pull request May 29, 2022

storage: add MVCC range tombstone handling for scans and gets #78946

Closed

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch 2 times, most recently from 1b7a50f to c2c3606 Compare June 4, 2022 14:55

erikgrinaker mentioned this pull request Jun 4, 2022

storage: add conflict handling for MVCC range tombstones #81189

Merged

nicktrav mentioned this pull request Jun 7, 2022

storage: optimize read path for the no-range-keys case #82559

Closed

5 tasks

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from c2c3606 to 682df9d Compare June 10, 2022 07:38

erikgrinaker marked this pull request as ready for review June 10, 2022 07:38

erikgrinaker requested review from a team as code owners June 10, 2022 07:38

erikgrinaker mentioned this pull request Jun 10, 2022

kvserver: emit MVCC range tombstones over rangefeeds #82718

Merged

jbowens reviewed Jun 10, 2022

View reviewed changes

erikgrinaker commented Jun 11, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from 682df9d to 8c874b4 Compare June 15, 2022 21:01

jbowens approved these changes Jun 15, 2022

View reviewed changes

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch 2 times, most recently from 2126b67 to 8504e23 Compare June 17, 2022 16:28

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from 8504e23 to be87e94 Compare June 17, 2022 20:23

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch 2 times, most recently from f915137 to 76c4b76 Compare June 25, 2022 18:23

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from 76c4b76 to aa7c9bb Compare June 26, 2022 10:09

erikgrinaker force-pushed the mvcc-range-tombstones-scan branch from aa7c9bb to 6733f95 Compare June 26, 2022 10:40

craig bot merged commit 460ce6a into cockroachdb:master Jun 26, 2022

erikgrinaker deleted the mvcc-range-tombstones-scan branch June 26, 2022 17:16

erikgrinaker mentioned this pull request Jul 6, 2022

roachperf: regression around 2022-06-27 #83893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: add MVCC range tombstone handling in scans and gets #82045

storage: add MVCC range tombstone handling in scans and gets #82045

erikgrinaker commented May 29, 2022 •

edited

Loading

cockroach-teamcity commented May 29, 2022

erikgrinaker commented Jun 4, 2022 •

edited

Loading

jbowens commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

jbowens commented Jun 7, 2022

erikgrinaker commented Jun 10, 2022 •

edited

Loading

jbowens left a comment

erikgrinaker left a comment

jbowens left a comment

erikgrinaker commented Jun 15, 2022

erikgrinaker commented Jun 17, 2022

erikgrinaker commented Jun 17, 2022

erikgrinaker commented Jun 17, 2022

jbowens commented Jun 18, 2022

jbowens commented Jun 20, 2022

jbowens commented Jun 22, 2022 •

edited

Loading

erikgrinaker commented Jun 24, 2022 •

edited

Loading

erikgrinaker commented Jun 25, 2022 •

edited

Loading

erikgrinaker commented Jun 26, 2022 •

edited

Loading

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

storage: add MVCC range tombstone handling in scans and gets #82045

storage: add MVCC range tombstone handling in scans and gets #82045

Conversation

erikgrinaker commented May 29, 2022 • edited Loading

cockroach-teamcity commented May 29, 2022

erikgrinaker commented Jun 4, 2022 • edited Loading

jbowens commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

jbowens commented Jun 7, 2022

erikgrinaker commented Jun 10, 2022 • edited Loading

jbowens left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jun 15, 2022

erikgrinaker commented Jun 17, 2022

erikgrinaker commented Jun 17, 2022

erikgrinaker commented Jun 17, 2022

jbowens commented Jun 18, 2022

jbowens commented Jun 20, 2022

jbowens commented Jun 22, 2022 • edited Loading

erikgrinaker commented Jun 24, 2022 • edited Loading

erikgrinaker commented Jun 25, 2022 • edited Loading

erikgrinaker commented Jun 26, 2022 • edited Loading

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

erikgrinaker commented Jun 26, 2022

craig bot commented Jun 26, 2022

erikgrinaker commented May 29, 2022 •

edited

Loading

erikgrinaker commented Jun 4, 2022 •

edited

Loading

erikgrinaker commented Jun 10, 2022 •

edited

Loading

jbowens commented Jun 22, 2022 •

edited

Loading

erikgrinaker commented Jun 24, 2022 •

edited

Loading

erikgrinaker commented Jun 25, 2022 •

edited

Loading

erikgrinaker commented Jun 26, 2022 •

edited

Loading