bench: tpccbench regression between Feb 12th and 13th #62078

nvanbenschoten · 2021-03-16T15:42:58Z

Ref: https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=aws

Between Feb 12th and Feb 13th, we saw a drop in max throughput of 16% on tpccbench running on AWS. We should determine the cause of this and resolve it.

roachtest run --cloud=aws tpccbench/nodes=3/cpu=16

The SHA before the drop was e9e3721 and after the drop was ba1a144.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2021-03-16T15:51:08Z

@sumeerbhola looking at the code changes between these two days, the big ones that stand out to me are the lockTable changes (cee3bcf and 59bda0d). In the first commit, we began reading and writing to the separated lock table. In the second commit, we turned off writing to the separated lock table, but continued reading from it.

I know you've looked at various performance profiles and landed a few optimizations over the past few weeks, but I'm not sure whether you've made a ruling on whether these changes were or were not responsible for this top-level regression. Would you be able to help make a call on that? Can we temporarily disable reading from the separated lock table and compare that to what we're seeing each night?

sumeerbhola · 2021-03-16T15:56:53Z

(for other cockroach labs folks) the previous discussion and profile is in https://cockroachlabs.slack.com/archives/C4X2J0RH6/p1615478361020000?thread_ts=1615397803.018300&cid=C4X2J0RH6

Can we temporarily disable reading from the separated lock table and compare that to what we're seeing each night?

Yes, I can try that.

tbg · 2021-03-17T15:50:29Z

Just for my sanity, what's the orange line? The efficiency? Why isn't there a label?

sumeerbhola · 2021-03-17T23:06:07Z

The kv workloads, like kv95, don't show a significant effect on these dates. I am assuming this is because those are using a fixed number of workers for the workload so not necessarily driving the machines to saturation (based on how ConnFlags.Concurrency is initialized and then used to decide how much to append to QueryLoad.WorkerFns). Is my understanding correct?
I ran tpccbench on master and with a modified intentInterleavingIter that constructs a proper EngineIterator but doesn't use it for anything useful and instead uses a fake EngineIterator that returns nothing. sumeerbhola@68468c1
The highest master passed on was 2530 warehouses and fake passed on 2875. The master run with 2545 had very high latency across all operations

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          20538           34.2  10163.4   1610.6  38654.7  60129.5 103079.2  delivery
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         200037          333.4  10392.1   1073.7  45097.2  94489.3 103079.2  newOrder
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          20655           34.4   6203.2    100.7  28991.0  49392.1  85899.3  orderStatus
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         204152          340.3   9123.7    570.4  40802.2  77309.4 103079.2  payment
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          20538           34.2   7035.6    209.7  32212.3  60129.5 103079.2  stockLevel

The high latency develops about 3min into the run after the 5min ramp-up is done
Before the badness

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  173.0s        0           74.0           51.8    260.0    906.0   1040.2   1040.2 delivery
  173.0s        0          608.0          479.2     71.3    469.8    637.5    872.4 newOrder
  173.0s        0           54.0           51.5      7.3     31.5     44.0     46.1 orderStatus
  173.0s        0          639.0          513.1     46.1    201.3    402.7    738.2 payment
  173.0s        0           45.0           51.1     22.0     71.3    117.4    117.4 stockLevel
  174.0s        0           42.0           51.7    604.0   1006.6   1140.9   1140.9 delivery
  174.0s        0          565.5          479.7    302.0    637.5    704.6    805.3 newOrder
  174.0s        0           56.9           51.6     25.2    104.9    142.6    151.0 orderStatus
  174.0s        0          521.5          513.1    192.9    453.0    570.4    671.1 payment
  174.0s        0           42.0           51.1     71.3    503.3    536.9    536.9 stockLevel
  175.0s        0           55.1           51.7    201.3    872.4   1073.7   1140.9 delivery
  175.0s        0          435.4          479.5     37.7    209.7    268.4    335.5 newOrder
  175.0s        0           53.1           51.6      7.6     46.1     50.3     75.5 orderStatus
  175.0s        0          468.5          512.9     23.1    130.0    192.9    318.8 payment
  175.0s        0           55.1           51.1     18.9     67.1     96.5    201.3 stockLevel
  176.0s        0           48.0           51.7    570.4    973.1   1208.0   1208.0 delivery
  176.0s        0          622.6          480.3    243.3    604.0    738.2   1006.6 newOrder
  176.0s        0           50.0           51.6     19.9     96.5    159.4    159.4 orderStatus
  176.0s        0          616.6          513.5    176.2    402.7    503.3    805.3 payment
  176.0s        0           62.0           51.2     60.8    260.0    268.4    285.2 stockLevel

After

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  237.0s        0           26.0           48.0  12348.0  23622.3  30064.8  30064.8 delivery
  237.0s        0          445.4          457.6  12348.0  23622.3  25769.8  28991.0 newOrder
  237.0s        0           24.0           48.3   8589.9  11811.2  11811.2  11811.2 orderStatus
  237.0s        0          284.6          477.3  11811.2  21474.8  23622.3  28991.0 payment
  237.0s        0           26.0           47.7  10200.5  13421.8  19327.4  19327.4 stockLevel
  238.0s        0           19.0           47.9  13421.8  25769.8  30064.8  30064.8 delivery
  238.0s        0          351.3          457.1  12884.9  22548.6  27917.3  30064.8 newOrder
  238.0s        0           29.0           48.2   9126.8  11811.2  12348.0  12348.0 orderStatus
  238.0s        0          284.3          476.5  12348.0  23622.3  25769.8  30064.8 payment
  238.0s        0           24.0           47.6   9663.7  12884.9  18253.6  18253.6 stockLevel
  239.0s        0            6.0           47.7  14495.5  22548.6  22548.6  22548.6 delivery
  239.0s        0          215.1          456.1  12348.0  24696.1  26843.5  30064.8 newOrder
  239.0s        0           18.0           48.1   9663.7  12348.0  12884.9  12884.9 orderStatus
  239.0s        0           74.0          474.8  11811.2  21474.8  24696.1  25769.8 payment
  239.0s        0           15.0           47.4  10200.5  13421.8  18253.6  18253.6 stockLevel
  240.0s        0           25.0           47.6  15569.3  26843.5  30064.8  30064.8 delivery
  240.0s        0          413.6          455.9  12884.9  24696.1  31138.5  32212.3 newOrder
  240.0s        0           32.0           48.0  10200.5  12884.9  13958.6  13958.6 orderStatus
  240.0s        0          296.7          474.0  12348.0  23622.3  26843.5  28991.0 payment
  240.0s        0           26.0           47.3   9126.8  12884.9  12884.9  12884.9 stockLevel

I then ran with 2700 warehouses on master in the hope of reproducing the badness. It works fine. Here are summaries from 2 runs:

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1800.0s    33232.4  95.7%    358.2    318.8    704.6    872.4   1208.0   3623.9
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1200.0s    33013.6  95.1%    400.6    318.8    838.9   1073.7   1677.7   7516.2

A comparable run with 2700 warehouses with the fake iter

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    33037.6  95.1%    230.5    201.3    469.8    570.4    805.3   2415.9

(ignore the different elapsed -- I was fiddling with it) Note the lower latency with the fake iter, even though the efficiency numbers are comparable.
Looking at the cpu utilization on these runs on master shows it is running hot, with mean utilization of ~83% across the 3 nodes. The hottest node reaches 88%. My assumption is that no reasonable production setting will run their OLTP database so hot in steady state (even in OLAP settings I have not seen steady state > 75%, and those don't suffer from the OLTP issue of slowness due to cpu resulting in increased inter-transaction contention and/or retries). A profile shows 15% of the cpu in intentInterleavingIter and 5% in SeekEngineKey (there is only 0.1% in NewEngineIterator). Corresponding cpu utilization and profiles running with 2700 warehouses with the fake iter show a cpu utilization of ~79% and 10% in intentInterleavingIter (because the 5% from SeekEngineKeyGE is eliminated), so these numbers are quite consistent.

Perhaps what is happening is that we are too close to the edge at 83% utilization (where one node got to 88%), so we could tip over into badness. Here is a screenshot of transaction restarts with master

and corresponding with the fake iter

I don't understand the nature of TPCC to know whether collapse due to increased contention is a possibility. I am looking for guidance from folks who do understand it.

I looked into why SeekEngineKey is 5% of cpu despite the many Pebble optimizations intended to reduce its cost when iterators are reused. Based on logging of scan and put counts in BatchRequests most such batches have 1 put or 1 scan. So there is very little batching in this workload so little iterator reuse.
Another question is why is tpccbench not exceeding 2530 on master, while I can run it separately at 2700. Perhaps there is some perturbation from previous runs in the search that were also running at high CPU. The Pebble read amplification stayed low in the 2700 warehouse runs I did, but maybe they didn't in tpccbench. Or maybe there is something with loading 3000 warehouses and then only running with a subset as active warehouses. I could use some ideas on this.
I did run with 2400 warehouses with the fake iter and cpu dropped from 79% to 74%. 74/79=0.94, and 2400/2700=0.88, which means cpu is not decreasing proportional to the warehouse count (there is a fixed overhead). So it could be that master gets unlucky even with lower warehouse counts wrt cpu utilization on some node getting high enough that badness ensues.

My next step will be to see if there is any scope for further optimizing SeekEngineKey, but I am not hopeful (there was nothing obvious in the profiles).

And I'm skeptical about the value of the tpccbench warehouse counts metric, since probing so close to the edge means small changes in cpu can have disproportional impact on the warehouse count.

sumeerbhola · 2021-03-18T17:01:24Z

The roachperf dashboard now shows a rise to 2815. Anyone know what changed there?

irfansharif · 2021-03-18T17:19:34Z

It could be due to #61777. There was a regression around Feb 20th (#62148) that we think was due to #59992. The partial mitigation for #59992 was #61777.

erikgrinaker · 2021-03-18T17:49:03Z

It's interesting that the same benchmark on GCE just stopped working after Feb 14:

https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=gce

Is this related to the TPCC VM overload issues you've been seeing @irfansharif?

irfansharif · 2021-03-18T18:08:03Z

Yup, exactly. Looks like #62039 is still not good enough (#62145). I'll try to resuscitate this test after I'm through other tracing related optimizations #62118. I haven't tried it, but I bet sql.txn_stats.sample_rate = 0.0 would get nurse this test back into health (by effectively disabling sampling/tracing). Sumeer, perhaps that's something you'd want to try if you're seeing these nodes fail over.

irfansharif · 2021-03-18T18:09:29Z

Sumeer, if you're able to find another benchmark that doesn't go so "close to edge" like TPC-C does, one that also shows a regression between Feb 12th-13th, that might a path of lesser resistance.

sumeerbhola · 2021-03-19T01:12:13Z

Based on logging of scan and put counts in BatchRequests most such batches have 1 put or 1 scan. So there is very little batching in this workload so little iterator reuse.

I noticed there are 3316 replicas per node. I wonder if reducing the number of replicas could result in more batching.

My next step will be to see if there is any scope for further optimizing SeekEngineKey, but I am not hopeful

I found one thing. SeekPrefixGE, when it fails to match the bloom filter can needlessly skip to the next file, which is a wasted cost, and will defeat the seek-avoidance optimization when a batch has a need to do another SeekPrefixGE. One can see it in the seekEmptyFileForward in the following profile -- this happens mainly for SeekEngineKeyGE since that is used for the locks and usually there isn't any lock. This seems almost half of the 5% we pay in SeekEngineKeyGE.

When SeekPrefixGE on the underlying file returns false due to a bloom filter non-match, levelIter would skip to the next file. This is wasteful if the upper bound of the file is beyond the prefix. Additionally, it defeats the optimization for sparse key spaces like CockroachDB's lock table, where we try to reuse the current position of the iterator -- by skipping to the next file the subsequent SeekPrefixGE will again need to reload the previous file. This behavior was first noticed when diagnosing tpcc slowness in CockroacbDB, where almost half the overhead of seeking in the lock table could be attributed to this (see cockroachdb/cockroach#62078 for details). The benchmark numbers for bloom=true/with-tombstone=false are the ones intended to benefit from this change. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 441ns ± 9% 445ns ± 7% ~ (p=0.332 n=19+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 299ns ± 3% 300ns ± 3% ~ (p=0.455 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 3.73µs ± 8% 0.82µs ± 2% -78.02% (p=0.000 n=20+16) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.78µs ±73% 1.21µs ± 7% -32.15% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 484ns ±27% 427ns ± 2% -11.83% (p=0.000 n=19+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 320ns ± 7% 300ns ± 3% -6.11% (p=0.000 n=16+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 5.07µs ±41% 0.82µs ± 2% -83.84% (p=0.000 n=20+18) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.76µs ±37% 1.21µs ± 9% -30.92% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 439ns ± 4% 436ns ± 6% ~ (p=0.109 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 435ns ±29% 307ns ± 5% -29.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 5.63µs ±19% 0.82µs ± 2% -85.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.87µs ±36% 1.24µs ± 8% -33.66% (p=0.000 n=20+20) name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) name old allocs/op new allocs/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal)

sumeerbhola · 2021-03-19T02:11:33Z

The 2700 warehouse run with this SeekPrefixGE improvement (cockroachdb/pebble#1091) looks much better in terms of latency.

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1200.0s    33345.9  96.0%    142.6    117.4    318.8    369.1    520.1   1811.9

The cpu utilization is lower and the relative cost of SeekEngineKeyGE (compared to the profile screenshot in the previous comment) is reduced.

When SeekPrefixGE on the underlying file returns false due to a bloom filter non-match, levelIter would skip to the next file. This is wasteful if the upper bound of the file is beyond the prefix. Additionally, it defeats the optimization for sparse key spaces like CockroachDB's lock table, where we try to reuse the current position of the iterator -- by skipping to the next file the subsequent SeekPrefixGE will again need to reload the previous file. This behavior was first noticed when diagnosing tpcc slowness in CockroacbDB, where almost half the overhead of seeking in the lock table could be attributed to this (see cockroachdb/cockroach#62078 for details). The benchmark numbers for bloom=true/with-tombstone=false are the ones intended to benefit from this change. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 441ns ± 9% 445ns ± 7% ~ (p=0.332 n=19+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 299ns ± 3% 300ns ± 3% ~ (p=0.455 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 3.73µs ± 8% 0.82µs ± 2% -78.02% (p=0.000 n=20+16) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.78µs ±73% 1.21µs ± 7% -32.15% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 484ns ±27% 427ns ± 2% -11.83% (p=0.000 n=19+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 320ns ± 7% 300ns ± 3% -6.11% (p=0.000 n=16+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 5.07µs ±41% 0.82µs ± 2% -83.84% (p=0.000 n=20+18) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.76µs ±37% 1.21µs ± 9% -30.92% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 439ns ± 4% 436ns ± 6% ~ (p=0.109 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 435ns ±29% 307ns ± 5% -29.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 5.63µs ±19% 0.82µs ± 2% -85.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.87µs ±36% 1.24µs ± 8% -33.66% (p=0.000 n=20+20) name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) name old allocs/op new allocs/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal)

sumeerbhola · 2021-03-19T17:44:52Z

@nvanbenschoten @petermattis
Now that the Pebble change is merged, we'll need to bump Pebble for the release-21.1 branch to pick it up. There is one other change to Pebble since the last bump (and that change precedes this one), which is the limited iteration mode cockroachdb/pebble@639dfce Even though it is intertwined in the same functions (via an optional limit parameter), the limited iteration mode is not going to be exercised by CockroachDB code in 21.1. So I think the risk of also including the limited iteration change is low.

Thoughts?

nvanbenschoten · 2021-03-19T19:05:39Z

Thank you for digging into this Sumeer. Your analysis throughout this issue has been awe-inspiring. If landing cockroachdb/pebble#1091 in this release significantly reduces the cost we pay to scan the lock table and helps close the gap on TPC-C, and we do intend to keep the empty lock table scan, then I'd like to see it get in. This will set us up well for the release and for furthering the lock table project in the next release. That said, while I've followed along with your changes in Pebble, I'm not the right person to gauge their potential risk.

When SeekPrefixGE on the underlying file returns false due to a bloom filter non-match, levelIter would skip to the next file. This is wasteful if the upper bound of the file is beyond the prefix. Additionally, it defeats the optimization for sparse key spaces like CockroachDB's lock table, where we try to reuse the current position of the iterator -- by skipping to the next file the subsequent SeekPrefixGE will again need to reload the previous file. This behavior was first noticed when diagnosing tpcc slowness in CockroacbDB, where almost half the overhead of seeking in the lock table could be attributed to this (see cockroachdb/cockroach#62078 for details). The benchmark numbers for bloom=true/with-tombstone=false are the ones intended to benefit from this change. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 441ns ± 9% 445ns ± 7% ~ (p=0.332 n=19+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 299ns ± 3% 300ns ± 3% ~ (p=0.455 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 3.73µs ± 8% 0.82µs ± 2% -78.02% (p=0.000 n=20+16) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.78µs ±73% 1.21µs ± 7% -32.15% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 484ns ±27% 427ns ± 2% -11.83% (p=0.000 n=19+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 320ns ± 7% 300ns ± 3% -6.11% (p=0.000 n=16+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 5.07µs ±41% 0.82µs ± 2% -83.84% (p=0.000 n=20+18) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.76µs ±37% 1.21µs ± 9% -30.92% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 439ns ± 4% 436ns ± 6% ~ (p=0.109 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 435ns ±29% 307ns ± 5% -29.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 5.63µs ±19% 0.82µs ± 2% -85.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.87µs ±36% 1.24µs ± 8% -33.66% (p=0.000 n=20+20) name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) name old allocs/op new allocs/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal)

erikgrinaker · 2021-03-22T19:28:03Z

FYI, I did a few kv95/enc=false/nodes=1/cpu=32 runs to see how they were affected, 5 runs at each commit taking the median result:

Description	Commit	Ops/s
Baseline	`9f97177`	74510
Sep. intents enabled	`98beba6`	71406
Sep. intents disabled (lock table reads)	`197d0cc`	71661
With Pebble leveliter fix	`197d0cc` + cockroachdb/pebble@`92fbeeabfd`	72392

Seems like we took a ~4% hit with separated intents enabled. Disabling them again didn't make much difference, and the Pebble fix only recovered about 1/3 of the drop.

tbg · 2021-03-24T13:51:41Z

@angelapwen and @stevendanna are going to start a targeted bisection into this, notes (internal) here

andy-kimball · 2021-03-30T21:18:01Z

Before we've resolved this issue, I think we need to explicitly state:

The magnitude of the drop There are a lot of numbers in the comments above, but I'm not sure what the final numbers are.
The reasons why we're accepting that drop I think we need to make an explicit, business-driven decision for why the benefits we get with the separated lock table justify a drop in this benchmark.

Looking at other layers to get offsetting improvements doesn't feel satisfying me. That's because I was hoping/expecting to do those improvements (like #57223) to improve our performance, not just to claw our way back to where we were before. The perf drop we see in this issue will put a permanent "cap" on our max performance, and we should make sure all the right stakeholders have agreed it's the right thing to do (have the PM's weighed in?).

tbg · 2021-03-31T09:14:11Z

For tpccbench/nodes=3/cpu=16, it looks like relative to the 20.2 release we went from ~1620 to ~2100 warehouses (gce), and ~2500 to ~2800 (aws) so we didn't actually go down at all. (cc @erikgrinaker to make sure I'm not saying things that are wrong for release-21.1)

The regression we last discussed is on kv95, which is essentially a contrived point-selects-only workload that is very sensitive. I would argue it isn't overly representative of anything realistic users might do, especially not in the regime at which these numbers are recorded (near full system utilization)

I was planning to bring this up in the next release-triage meeting on Monday, so I'll keep this issue open to make it pop up on the spreadsheet. I'm not sure which PM you would consult with or how they could make any kind of informed determination here, but feel free to rope someone in.

erikgrinaker · 2021-03-31T09:26:06Z

For tpccbench/nodes=3/cpu=16, it looks like relative to the 20.2 release we went from ~1620 to ~2100 warehouses (gce), and ~2500 to ~2800 (aws) so we didn't actually go down at all.

I think this is because the older TPCC benchmarks were capped at lower warehouse counts. We've had a hard time getting a good signal from TPCC runs, as individual runs vary quite a lot. If necessary we can do a full set of comparison runs, but it'll be fairly time-consuming.

The regression we last discussed is on kv95, which is essentially a contrived point-selects-only workload that is very sensitive. I would argue it isn't overly representative of anything realistic users might do

I agree, kv95 is extremely narrow, and basically the worst-case for measuring overhead in the query path. I'll do a suite of all six YCSB workloads for 20.2, 21.1, and 21.1 without separated intents, which should give us a more varied picture.

tbg · 2021-03-31T09:32:24Z

I think this is because the older TPCC benchmarks were capped at lower warehouse counts. We've had a hard time getting a good signal from TPCC runs, as individual runs vary quite a lot. If necessary we can do a full set of comparison runs, but it'll be fairly time-consuming.

If this is necessary, we have to nominate someone else to do it. This group has done more than the fair share of the work and we need to get back to actually fixing issues.

erikgrinaker · 2021-03-31T14:21:19Z

Results from a few YCSB and kv95 benchmarks on GCE, doing 5 runs each and taking the median. Commit hashes:

20.2 @ aa2d3bb
21.1 @ e6d07fa
21.1 w/o locktable @ erikgrinaker/cockroach@5a47cf1820

Details of YCSB workloads here: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads

Numbers are cumulative ops/s. The first delta is the regression from 20.2 to 21.1, the second delta is the improvement by disabling the lock table -- all relative to the 20.2 baseline.

Workload	20.2	21.1	21.1 w/o locktable	Regress	No locktable
ycsb/A/nodes=3	17860	17009	17122	-4.8%	+0.6%
ycsb/B/nodes=3	33359	31076	31375	-6.8%	+0.9%
ycsb/C/nodes=3	46690	41040	42463	-12.1%	+3.1%
ycsb/D/nodes=3	37442	33554	34860	-10.4%	+3.5%
ycsb/E/nodes=3	1873	1894	1874	+1.1%	-1.1%
ycsb/F/nodes=3	8256	8030	8173	-2.7%	+1.7%
kv95/enc=false/nodes=1/cpu=32	76313	71412	72735	-6.4%	+1.7%
kv95/enc=false/nodes=3	40295	38108	39004	-5.4%	+2.2%
kv95/enc=false/nodes=3/batch=16	4584	4598	4793	+0.3%	+4.3%

There is a bit of uncertainty here, e.g. kv95/enc=false/nodes=1/cpu=32 now shows a 1.7% point improvement without locktable reads while the earlier numbers showed a ~5.1% point improvement without locktable reads -- I'd say the error is roughly ±2%. Intermediate numbers here, and I have the raw logs available on request.

Even so, it's clear that we currently have a regression of about 5% or more. Looks like lock table reads make up about a third of this, and the rest is likely due to vectorized execution and tracing, as well as other minor regressions.

The roachperf graphs of performance over time may also be of interest: https://roachperf.crdb.dev/

petermattis · 2021-04-13T13:24:57Z

We need to run the ycsb and kv95 numbers again in light of cockroachdb/pebble#1098 being backported to 21.1. It sounds like the delta on several ycsb workloads will be eliminated.

Is there anything to do to get get a good determination of where we stand on tpcc vs 20.2? Tobi's comment (#62078 (comment)) suggests that tpcc performance might have improved, but we don't have a good stable signal from tpccbench. Is anyone concerned enough to advocate putting in the elbow grease to get a proper tpcc comparison? Or will we be satisfied with ycsb and kv95?

PS Rather than taking the median of 5 runs, in the past I've done 10 runs on each workload and the transformed the result output into go-bench format using the following script:

#!/bin/bash

if [ "$#" -ne 1 ]; then
    echo "usage: $0 <artifacts-dir>"
    exit 1
fi

for file in $(find $1 -name test.log -o -name '*ycsb.log' -o -name '*kv.log'); do
    name=$(dirname $(dirname $(realpath --relative-to=$1 $file)))
    grep -h -A1 __result $file \
        | grep -v '^--$' | grep -v __result | \
        awk "{printf \"Benchmark$name  1  %s ops/sec  %s p50  %s p95  %s p99\n\", \$4, \$6, \$7, \$8}"
done

You can then use benchstat old new to see the perf diff, which will throw out outlier numbers and give a +/- interval. I should probably check this script somewhere into the CRDB repo.

erikgrinaker · 2021-04-13T13:42:51Z

We need to run the ycsb and kv95 numbers again in light of cockroachdb/pebble#1098 being backported to 21.1. It sounds like the delta on several ycsb workloads will be eliminated.

These numbers were taken after that was backported (#62676). The YCSB numbers used to be far worse.

Is there anything to do to get get a good determination of where we stand on tpcc vs 20.2?

Anecdotally, we've seen numbers that fit well with the kv95 and ycsb regression range (5-10%). Would be happy to apply some elbow grease if we think it's worth the effort.

I've done 10 runs on each workload and the transformed the result output into go-bench format

Nice, would be useful to have this checked in somewhere.

petermattis · 2021-04-13T13:52:18Z

These numbers were taken after that was backported (#62676). The YCSB numbers used to be far worse.

You are correct about cockroachdb/pebble#1098, but there is also cockroachdb/pebble#1107. I'm not sure if that latter PR moves the needle on the CRDB-level benchmarks, though.

Anecdotally, we've seen numbers that fit well with the kv95 and ycsb regression range (5-10%). Would be happy to apply some elbow grease if we think it's worth the effort.

The tpcc numbers fit within that range?

Nice, would be useful to have this checked in somewhere.

I agree. Care to pick up this ball?

erikgrinaker · 2021-04-13T13:59:48Z

there is also cockroachdb/pebble#1107

Right, ok. I can do a few quick runs and see if it moves the needle.

The tpcc numbers fit within that range?

I didn't do any TPCC work myself, but I believe we'd seen numbers that had recovered to within 10% of the original baseline on individual runs. They're rather noisy though, so that may or may not be accurate.

Care to pick up this ball?

Will do.

erikgrinaker · 2021-04-13T15:15:28Z

Did 5 runs of the following benchmarks at release-21.1 @ cf1a794 (compared with the 20.2 and 21.1 results from above):

Workload	20.2	21.1	Regression	Prev 21.1
`ycsb/A/nodes=3`	17860	16973	-6.0%	17009
`ycsb/C/nodes=3`	46690	42438	-9.1%	41040
`kv95/enc=false/nodes=1/cpu=32`	76313	72053	-5.6%	71412

The results are within the error margin of the previous 21.1 results, so it doesn't appear as if cockroachdb/pebble#1107 had any significant effect.

petermattis · 2021-04-13T19:03:12Z

@erikgrinaker Did you use benchstat for this comparison? These benchmarks are quite noisy. Here is a view using benchstat for comparison:

~ benchstat -delta-test none out-v20.2 out-v21.1
name                             old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=1/cpu=32     73.9k ± 4%   73.1k ± 1%   -1.10%
kv95/enc=false/nodes=3            36.1k ±11%   37.8k ± 6%   +4.60%
kv95/enc=false/nodes=3/batch=16   4.25k ± 9%   4.21k ± 3%   -0.90%
ycsb/A/nodes=3                    17.1k ±11%   16.7k ± 7%   -2.37%
ycsb/B/nodes=3                    32.4k ±11%   30.7k ±12%   -5.03%
ycsb/C/nodes=3                    44.1k ± 9%   41.0k ± 6%   -6.88%
ycsb/D/nodes=3                    35.6k ± 9%   34.6k ± 5%   -2.80%
ycsb/E/nodes=3                    1.82k ±13%   1.92k ± 1%   +5.69%
ycsb/F/nodes=3                    7.80k ±11%   7.72k ±15%   -1.00%

Definitely notice that variation per run. The above is with the delta-test disabled. If you use the default benchstat -delta-test (utest) we see:

~ benchstat out-v20.2 out-v21.1
name                             old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=1/cpu=32     73.9k ± 4%   73.1k ± 1%     ~     (p=0.481 n=10+10)
kv95/enc=false/nodes=3            36.1k ±11%   37.8k ± 6%     ~     (p=0.193 n=7+10)
kv95/enc=false/nodes=3/batch=16   4.25k ± 9%   4.21k ± 3%     ~     (p=1.000 n=8+8)
ycsb/A/nodes=3                    17.1k ±11%   16.7k ± 7%     ~     (p=0.515 n=8+10)
ycsb/B/nodes=3                    32.4k ±11%   30.7k ±12%     ~     (p=0.059 n=8+9)
ycsb/C/nodes=3                    44.1k ± 9%   41.0k ± 6%     ~     (p=0.068 n=8+10)
ycsb/D/nodes=3                    35.6k ± 9%   34.6k ± 5%     ~     (p=0.743 n=8+9)
ycsb/E/nodes=3                    1.82k ±13%   1.92k ± 1%   +5.69%  (p=0.000 n=7+9)
ycsb/F/nodes=3                    7.80k ±11%   7.72k ±15%     ~     (p=0.515 n=8+10)

So benchstat thinks that most of the deltas are not statistically significant. We could probably tease out if there is a difference with more runs. Certainly looks like there is a regression on ycsb/C, but don't ignore that noise. I spent a lot of time futzing with similar benchmarks when we were switching from RocksDB to Pebble and eventually settled on doing 20 runs per test to give myself confidence.

erikgrinaker · 2021-04-13T20:04:48Z

Did you use benchstat for this comparison?

No, in order to compare with the previous numbers I figured I shouldn't change up the methodology for this run.

Here is a view using benchstat for comparison:

Thanks for doing another set of runs, that's awesome! Definitely like the benchstat reporting, the uncertainty metrics are very useful. I've noticed it's the de facto format around here, and I'll be sure to adopt it as well.

don't ignore that noise. I spent a lot of time futzing with similar benchmarks when we were switching from RocksDB to Pebble and eventually settled on doing 20 runs per test to give myself confidence.

Absolutely. These numbers came on the tail end of a regression investigation that covered a five-month timeframe and a ~35% regression. We had to cover a lot of ground, and were looking for large deltas, so a 5-run median (to ignore outliers) seemed appropriate. I agree that once we're zooming in on more specific comparisons we need to be more rigorous, I suppose I was just used to doing things a certain way at that point.

I am fairly confident that we're looking at a ~5% regression overall though (for some definition of overall), and the roachperf graphs give that impression as well. Using averages for these sample sizes tend to skew results downwards, since outliers tend to be negative (it's much more likely that a cluster anomaly causes a large slowdown than a large speedup). Might be worth discarding the outer results in either direction.

petermattis · 2021-04-13T20:31:46Z

Absolutely. These numbers came on the tail end of a regression investigation that covered a five-month timeframe and a ~35% regression. We had to cover a lot of ground, and were looking for large deltas, so a 5-run median (to ignore outliers) seemed appropriate. I agree that once we're zooming in on more specific comparisons we need to be more rigorous, I suppose I was just used to doing things a certain way at that point.

Ack. Makes complete sense to do fewer runs when there are large discrepancies that are being looked for. We're likely going to do this sort of comparison again in 6 months. Perhaps it is worthwhile to create a script or playbook to describe how the testing should be done.

I am fairly confident that we're looking at a ~5% regression overall though (for some definition of overall), and the roachperf graphs give that impression as well. Using averages for these sample sizes tend to skew results downwards, since outliers tend to be negative (it's much more likely that a cluster anomaly causes a large slowdown than a large speedup). Might be worth discarding the outer results in either direction.

I believe benchstat does exclude outliers. That's why the n=7+9 values are not always n=10+10 even though I did 10 runs.

erikgrinaker · 2021-04-13T20:38:22Z

Perhaps it is worthwhile to create a script or playbook to describe how the testing should be done.

We started one, will amend it tomorrow.

I believe benchstat does exclude outliers. That's why the n=7+9 values are not always n=10+10 even though I did 10 runs.

Indeed, nifty!

tbg · 2021-04-14T07:40:56Z

Ack. Makes complete sense to do fewer runs when there are large discrepancies that are being looked for. We're likely going to do this sort of comparison again in 6 months. Perhaps it is worthwhile to create a script or playbook to describe how the testing should be done.

We'll do this on a continuous basis going forward, though there are still details being ironed out on how that is organized.

This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided.

tbg · 2021-04-20T08:00:40Z

We discussed this yesterday in the release triage meeting and decided that we would eat the hit as it occurs on synthetic workloads that are unlikely to be representative of real workloads, plus we were reluctant to pull either separated intents or vectorized execution and noted that there are caching improvements coming in the 21.2 cycle that will more than make up for lost ground.

This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided.

63829: changefeedccl: Improve avro encoder performance r=miretskiy a=miretskiy Avoid expensive allocations (maps) when encoding datums. Improve encoder performance by ~40%, and significantly reduce memory allocations per op. ``` BenchmarkEncodeInt-16 1834214 665.4 ns/op 73 B/op 5 allocs/op BenchmarkEncodeBool-16 1975244 597.8 ns/op 33 B/op 3 allocs/op BenchmarkEncodeFloat-16 1773226 661.6 ns/op 73 B/op 5 allocs/op BenchmarkEncodeBox2D-16 628884 1740 ns/op 579 B/op 18 allocs/op BenchmarkEncodeGeography-16 1734722 713.3 ns/op 233 B/op 5 allocs/op BenchmarkEncodeGeometry-16 1495227 1208 ns/op 2737 B/op 5 allocs/op BenchmarkEncodeBytes-16 2171725 698.4 ns/op 64 B/op 5 allocs/op BenchmarkEncodeString-16 1847884 696.0 ns/op 49 B/op 4 allocs/op BenchmarkEncodeDate-16 2159253 701.6 ns/op 64 B/op 5 allocs/op BenchmarkEncodeTime-16 1857284 682.9 ns/op 81 B/op 6 allocs/op BenchmarkEncodeTimeTZ-16 833163 1405 ns/op 402 B/op 14 allocs/op BenchmarkEncodeTimestamp-16 1623998 720.5 ns/op 97 B/op 6 allocs/op BenchmarkEncodeTimestampTZ-16 1614201 719.0 ns/op 97 B/op 6 allocs/op BenchmarkEncodeDecimal-16 790902 1473 ns/op 490 B/op 23 allocs/op BenchmarkEncodeUUID-16 2216424 783.0 ns/op 176 B/op 6 allocs/op BenchmarkEncodeINet-16 1545225 817.6 ns/op 113 B/op 8 allocs/op BenchmarkEncodeJSON-16 2146824 1731 ns/op 728 B/op 21 allocs/op ``` Release Notes: None 63845: storage: pool pebbleReadOnly allocations r=nvanbenschoten a=nvanbenschoten This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate `EngineIterator` with lower and upper bounds. So this may have a small impact on #62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided. 63904: colbuilder: support 'CASE expr WHEN exprs' form r=yuzefovich a=yuzefovich Previously, we didn't support `CASE expr WHEN exprs` form of CASE expression and had to fallback to row-by-row engine. This form requires just another step of performing an equality comparison of `expr` against the projection of the current WHEN arm to decide whether this arm matched. This commit does so. Release note: None 63947: execinfra: mark 'sql.distsql.temp_storage.workmem' as public r=yuzefovich a=yuzefovich Release note (sql change): `sql.distsql.temp_storage.workmem` cluster setting is now marked as public and is included into the documentation. It determines how much RAM a single operation of a single query can use before it must spill to temporary storage. Note the operations that don't support the disk spilling will ignore this setting and are subject only to `--max-sql-memory` startup argument. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided.

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker labels Mar 16, 2021

nvanbenschoten assigned sumeerbhola Mar 16, 2021

sumeerbhola mentioned this issue Mar 19, 2021

db: optimize levelIter for non-matching bloom filter cockroachdb/pebble#1091

Merged

sumeerbhola mentioned this issue Mar 19, 2021

db: bump 21.1 to include levelIter optimization cockroachdb/pebble#1093

Merged

sumeerbhola mentioned this issue Mar 19, 2021

[release-21.1] vendor: bump pebble to bc446105db6a #62276

Closed

This was referenced Mar 20, 2021

kv: gradual kv95 performance regression since December 1st #62156

Closed

20.2 → 21.1 roachperf benchmark regressions #62322

Closed

tbg assigned stevendanna and angelapwen and unassigned sumeerbhola Mar 24, 2021

sumeerbhola mentioned this issue Mar 30, 2021

opt: avoid re-constructing entire memo on Placeholder injection? #57223

Closed

erikgrinaker reopened this Mar 31, 2021

itsbilal mentioned this issue Mar 31, 2021

*: Skip first iteration of read sample cockroachdb/pebble#1098

Merged

nvanbenschoten mentioned this issue Apr 19, 2021

storage: pool pebbleReadOnly allocations #63845

Merged

tbg closed this as completed Apr 20, 2021

nvanbenschoten mentioned this issue Apr 21, 2021

release-21.1: storage: pool pebbleReadOnly allocations #63972

Merged

tbg mentioned this issue May 17, 2021

db: read-triggered compactions causing excessive write amplification cockroachdb/pebble#1143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: tpccbench regression between Feb 12th and 13th #62078

bench: tpccbench regression between Feb 12th and 13th #62078

nvanbenschoten commented Mar 16, 2021 •

edited

Loading

nvanbenschoten commented Mar 16, 2021

sumeerbhola commented Mar 16, 2021

tbg commented Mar 17, 2021

sumeerbhola commented Mar 17, 2021

sumeerbhola commented Mar 18, 2021

irfansharif commented Mar 18, 2021

erikgrinaker commented Mar 18, 2021 •

edited

Loading

irfansharif commented Mar 18, 2021 •

edited

Loading

irfansharif commented Mar 18, 2021

sumeerbhola commented Mar 19, 2021

sumeerbhola commented Mar 19, 2021

sumeerbhola commented Mar 19, 2021

nvanbenschoten commented Mar 19, 2021

erikgrinaker commented Mar 22, 2021

tbg commented Mar 24, 2021

andy-kimball commented Mar 30, 2021

tbg commented Mar 31, 2021

erikgrinaker commented Mar 31, 2021

tbg commented Mar 31, 2021

erikgrinaker commented Mar 31, 2021 •

edited

Loading

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

tbg commented Apr 14, 2021

tbg commented Apr 20, 2021

bench: tpccbench regression between Feb 12th and 13th #62078

bench: tpccbench regression between Feb 12th and 13th #62078

Comments

nvanbenschoten commented Mar 16, 2021 • edited Loading

nvanbenschoten commented Mar 16, 2021

sumeerbhola commented Mar 16, 2021

tbg commented Mar 17, 2021

sumeerbhola commented Mar 17, 2021

sumeerbhola commented Mar 18, 2021

irfansharif commented Mar 18, 2021

erikgrinaker commented Mar 18, 2021 • edited Loading

irfansharif commented Mar 18, 2021 • edited Loading

irfansharif commented Mar 18, 2021

sumeerbhola commented Mar 19, 2021

sumeerbhola commented Mar 19, 2021

sumeerbhola commented Mar 19, 2021

nvanbenschoten commented Mar 19, 2021

erikgrinaker commented Mar 22, 2021

tbg commented Mar 24, 2021

andy-kimball commented Mar 30, 2021

tbg commented Mar 31, 2021

erikgrinaker commented Mar 31, 2021

tbg commented Mar 31, 2021

erikgrinaker commented Mar 31, 2021 • edited Loading

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

petermattis commented Apr 13, 2021

erikgrinaker commented Apr 13, 2021

tbg commented Apr 14, 2021

tbg commented Apr 20, 2021

nvanbenschoten commented Mar 16, 2021 •

edited

Loading

erikgrinaker commented Mar 18, 2021 •

edited

Loading

irfansharif commented Mar 18, 2021 •

edited

Loading

erikgrinaker commented Mar 31, 2021 •

edited

Loading