-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bench: tpccbench regression between Feb 12th and 13th #62078
Comments
@sumeerbhola looking at the code changes between these two days, the big ones that stand out to me are the lockTable changes (cee3bcf and 59bda0d). In the first commit, we began reading and writing to the separated lock table. In the second commit, we turned off writing to the separated lock table, but continued reading from it. I know you've looked at various performance profiles and landed a few optimizations over the past few weeks, but I'm not sure whether you've made a ruling on whether these changes were or were not responsible for this top-level regression. Would you be able to help make a call on that? Can we temporarily disable reading from the separated lock table and compare that to what we're seeing each night? |
(for other cockroach labs folks) the previous discussion and profile is in https://cockroachlabs.slack.com/archives/C4X2J0RH6/p1615478361020000?thread_ts=1615397803.018300&cid=C4X2J0RH6
Yes, I can try that. |
The high latency develops about 3min into the run after the 5min ramp-up is done
After
A comparable run with 2700 warehouses with the fake iter
(ignore the different elapsed -- I was fiddling with it) Note the lower latency with the fake iter, even though the efficiency numbers are comparable. Perhaps what is happening is that we are too close to the edge at 83% utilization (where one node got to 88%), so we could tip over into badness. Here is a screenshot of transaction restarts with master
My next step will be to see if there is any scope for further optimizing And I'm skeptical about the value of the tpccbench warehouse counts metric, since probing so close to the edge means small changes in cpu can have disproportional impact on the warehouse count. |
The roachperf dashboard now shows a rise to 2815. Anyone know what changed there? |
It's interesting that the same benchmark on GCE just stopped working after Feb 14: https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=gce Is this related to the TPCC VM overload issues you've been seeing @irfansharif? |
Yup, exactly. Looks like #62039 is still not good enough (#62145). I'll try to resuscitate this test after I'm through other tracing related optimizations #62118. I haven't tried it, but I bet |
Sumeer, if you're able to find another benchmark that doesn't go so "close to edge" like TPC-C does, one that also shows a regression between Feb 12th-13th, that might a path of lesser resistance. |
I noticed there are 3316 replicas per node. I wonder if reducing the number of replicas could result in more batching.
I found one thing. SeekPrefixGE, when it fails to match the bloom filter can needlessly skip to the next file, which is a wasted cost, and will defeat the seek-avoidance optimization when a batch has a need to do another SeekPrefixGE. One can see it in the seekEmptyFileForward in the following profile -- this happens mainly for SeekEngineKeyGE since that is used for the locks and usually there isn't any lock. This seems almost half of the 5% we pay in SeekEngineKeyGE. |
When SeekPrefixGE on the underlying file returns false due to a bloom filter non-match, levelIter would skip to the next file. This is wasteful if the upper bound of the file is beyond the prefix. Additionally, it defeats the optimization for sparse key spaces like CockroachDB's lock table, where we try to reuse the current position of the iterator -- by skipping to the next file the subsequent SeekPrefixGE will again need to reload the previous file. This behavior was first noticed when diagnosing tpcc slowness in CockroacbDB, where almost half the overhead of seeking in the lock table could be attributed to this (see cockroachdb/cockroach#62078 for details). The benchmark numbers for bloom=true/with-tombstone=false are the ones intended to benefit from this change. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 441ns ± 9% 445ns ± 7% ~ (p=0.332 n=19+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 299ns ± 3% 300ns ± 3% ~ (p=0.455 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 3.73µs ± 8% 0.82µs ± 2% -78.02% (p=0.000 n=20+16) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.78µs ±73% 1.21µs ± 7% -32.15% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 484ns ±27% 427ns ± 2% -11.83% (p=0.000 n=19+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 320ns ± 7% 300ns ± 3% -6.11% (p=0.000 n=16+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 5.07µs ±41% 0.82µs ± 2% -83.84% (p=0.000 n=20+18) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.76µs ±37% 1.21µs ± 9% -30.92% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 439ns ± 4% 436ns ± 6% ~ (p=0.109 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 435ns ±29% 307ns ± 5% -29.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 5.63µs ±19% 0.82µs ± 2% -85.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.87µs ±36% 1.24µs ± 8% -33.66% (p=0.000 n=20+20) name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) name old allocs/op new allocs/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal)
The 2700 warehouse run with this SeekPrefixGE improvement (cockroachdb/pebble#1091) looks much better in terms of latency.
The cpu utilization is lower and the relative cost of SeekEngineKeyGE (compared to the profile screenshot in the previous comment) is reduced. |
When SeekPrefixGE on the underlying file returns false due to a bloom filter non-match, levelIter would skip to the next file. This is wasteful if the upper bound of the file is beyond the prefix. Additionally, it defeats the optimization for sparse key spaces like CockroachDB's lock table, where we try to reuse the current position of the iterator -- by skipping to the next file the subsequent SeekPrefixGE will again need to reload the previous file. This behavior was first noticed when diagnosing tpcc slowness in CockroacbDB, where almost half the overhead of seeking in the lock table could be attributed to this (see cockroachdb/cockroach#62078 for details). The benchmark numbers for bloom=true/with-tombstone=false are the ones intended to benefit from this change. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 441ns ± 9% 445ns ± 7% ~ (p=0.332 n=19+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 299ns ± 3% 300ns ± 3% ~ (p=0.455 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 3.73µs ± 8% 0.82µs ± 2% -78.02% (p=0.000 n=20+16) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.78µs ±73% 1.21µs ± 7% -32.15% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 484ns ±27% 427ns ± 2% -11.83% (p=0.000 n=19+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 320ns ± 7% 300ns ± 3% -6.11% (p=0.000 n=16+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 5.07µs ±41% 0.82µs ± 2% -83.84% (p=0.000 n=20+18) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.76µs ±37% 1.21µs ± 9% -30.92% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 439ns ± 4% 436ns ± 6% ~ (p=0.109 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 435ns ±29% 307ns ± 5% -29.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 5.63µs ±19% 0.82µs ± 2% -85.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.87µs ±36% 1.24µs ± 8% -33.66% (p=0.000 n=20+20) name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) name old allocs/op new allocs/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal)
When SeekPrefixGE on the underlying file returns false due to a bloom filter non-match, levelIter would skip to the next file. This is wasteful if the upper bound of the file is beyond the prefix. Additionally, it defeats the optimization for sparse key spaces like CockroachDB's lock table, where we try to reuse the current position of the iterator -- by skipping to the next file the subsequent SeekPrefixGE will again need to reload the previous file. This behavior was first noticed when diagnosing tpcc slowness in CockroacbDB, where almost half the overhead of seeking in the lock table could be attributed to this (see cockroachdb/cockroach#62078 for details). The benchmark numbers for bloom=true/with-tombstone=false are the ones intended to benefit from this change. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 441ns ± 9% 445ns ± 7% ~ (p=0.332 n=19+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 299ns ± 3% 300ns ± 3% ~ (p=0.455 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 3.73µs ± 8% 0.82µs ± 2% -78.02% (p=0.000 n=20+16) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.78µs ±73% 1.21µs ± 7% -32.15% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 484ns ±27% 427ns ± 2% -11.83% (p=0.000 n=19+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 320ns ± 7% 300ns ± 3% -6.11% (p=0.000 n=16+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 5.07µs ±41% 0.82µs ± 2% -83.84% (p=0.000 n=20+18) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.76µs ±37% 1.21µs ± 9% -30.92% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 439ns ± 4% 436ns ± 6% ~ (p=0.109 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 435ns ±29% 307ns ± 5% -29.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 5.63µs ±19% 0.82µs ± 2% -85.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.87µs ±36% 1.24µs ± 8% -33.66% (p=0.000 n=20+20) name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) name old allocs/op new allocs/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal)
@nvanbenschoten @petermattis Thoughts? |
Thank you for digging into this Sumeer. Your analysis throughout this issue has been awe-inspiring. If landing cockroachdb/pebble#1091 in this release significantly reduces the cost we pay to scan the lock table and helps close the gap on TPC-C, and we do intend to keep the empty lock table scan, then I'd like to see it get in. This will set us up well for the release and for furthering the lock table project in the next release. That said, while I've followed along with your changes in Pebble, I'm not the right person to gauge their potential risk. |
When SeekPrefixGE on the underlying file returns false due to a bloom filter non-match, levelIter would skip to the next file. This is wasteful if the upper bound of the file is beyond the prefix. Additionally, it defeats the optimization for sparse key spaces like CockroachDB's lock table, where we try to reuse the current position of the iterator -- by skipping to the next file the subsequent SeekPrefixGE will again need to reload the previous file. This behavior was first noticed when diagnosing tpcc slowness in CockroacbDB, where almost half the overhead of seeking in the lock table could be attributed to this (see cockroachdb/cockroach#62078 for details). The benchmark numbers for bloom=true/with-tombstone=false are the ones intended to benefit from this change. name old time/op new time/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 441ns ± 9% 445ns ± 7% ~ (p=0.332 n=19+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 299ns ± 3% 300ns ± 3% ~ (p=0.455 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 3.73µs ± 8% 0.82µs ± 2% -78.02% (p=0.000 n=20+16) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.78µs ±73% 1.21µs ± 7% -32.15% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 484ns ±27% 427ns ± 2% -11.83% (p=0.000 n=19+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 320ns ± 7% 300ns ± 3% -6.11% (p=0.000 n=16+19) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 5.07µs ±41% 0.82µs ± 2% -83.84% (p=0.000 n=20+18) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.76µs ±37% 1.21µs ± 9% -30.92% (p=0.000 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 439ns ± 4% 436ns ± 6% ~ (p=0.109 n=20+20) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 435ns ±29% 307ns ± 5% -29.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 5.63µs ±19% 0.82µs ± 2% -85.40% (p=0.000 n=20+19) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.87µs ±36% 1.24µs ± 8% -33.66% (p=0.000 n=20+20) name old alloc/op new alloc/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00B 0.00B ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 271B ± 0% 271B ± 0% ~ (all equal) name old allocs/op new allocs/op delta IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16 0.00 0.00 ~ (all equal) IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16 1.00 ± 0% 1.00 ± 0% ~ (all equal)
FYI, I did a few
Seems like we took a ~4% hit with separated intents enabled. Disabling them again didn't make much difference, and the Pebble fix only recovered about 1/3 of the drop. |
@angelapwen and @stevendanna are going to start a targeted bisection into this, notes (internal) here |
Before we've resolved this issue, I think we need to explicitly state:
Looking at other layers to get offsetting improvements doesn't feel satisfying me. That's because I was hoping/expecting to do those improvements (like #57223) to improve our performance, not just to claw our way back to where we were before. The perf drop we see in this issue will put a permanent "cap" on our max performance, and we should make sure all the right stakeholders have agreed it's the right thing to do (have the PM's weighed in?). |
For tpccbench/nodes=3/cpu=16, it looks like relative to the 20.2 release we went from ~1620 to ~2100 warehouses (gce), and ~2500 to ~2800 (aws) so we didn't actually go down at all. (cc @erikgrinaker to make sure I'm not saying things that are wrong for release-21.1) The regression we last discussed is on kv95, which is essentially a contrived point-selects-only workload that is very sensitive. I would argue it isn't overly representative of anything realistic users might do, especially not in the regime at which these numbers are recorded (near full system utilization) I was planning to bring this up in the next release-triage meeting on Monday, so I'll keep this issue open to make it pop up on the spreadsheet. I'm not sure which PM you would consult with or how they could make any kind of informed determination here, but feel free to rope someone in. |
I think this is because the older TPCC benchmarks were capped at lower warehouse counts. We've had a hard time getting a good signal from TPCC runs, as individual runs vary quite a lot. If necessary we can do a full set of comparison runs, but it'll be fairly time-consuming.
I agree, kv95 is extremely narrow, and basically the worst-case for measuring overhead in the query path. I'll do a suite of all six YCSB workloads for 20.2, 21.1, and 21.1 without separated intents, which should give us a more varied picture. |
If this is necessary, we have to nominate someone else to do it. This group has done more than the fair share of the work and we need to get back to actually fixing issues. |
Results from a few YCSB and kv95 benchmarks on GCE, doing 5 runs each and taking the median. Commit hashes:
Details of YCSB workloads here: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads Numbers are cumulative ops/s. The first delta is the regression from 20.2 to 21.1, the second delta is the improvement by disabling the lock table -- all relative to the 20.2 baseline.
There is a bit of uncertainty here, e.g. Even so, it's clear that we currently have a regression of about 5% or more. Looks like lock table reads make up about a third of this, and the rest is likely due to vectorized execution and tracing, as well as other minor regressions. The roachperf graphs of performance over time may also be of interest: https://roachperf.crdb.dev/ |
We need to run the Is there anything to do to get get a good determination of where we stand on tpcc vs 20.2? Tobi's comment (#62078 (comment)) suggests that tpcc performance might have improved, but we don't have a good stable signal from tpccbench. Is anyone concerned enough to advocate putting in the elbow grease to get a proper tpcc comparison? Or will we be satisfied with PS Rather than taking the median of 5 runs, in the past I've done 10 runs on each workload and the transformed the result output into go-bench format using the following script: #!/bin/bash
if [ "$#" -ne 1 ]; then
echo "usage: $0 <artifacts-dir>"
exit 1
fi
for file in $(find $1 -name test.log -o -name '*ycsb.log' -o -name '*kv.log'); do
name=$(dirname $(dirname $(realpath --relative-to=$1 $file)))
grep -h -A1 __result $file \
| grep -v '^--$' | grep -v __result | \
awk "{printf \"Benchmark$name 1 %s ops/sec %s p50 %s p95 %s p99\n\", \$4, \$6, \$7, \$8}"
done You can then use |
These numbers were taken after that was backported (#62676). The YCSB numbers used to be far worse.
Anecdotally, we've seen numbers that fit well with the kv95 and ycsb regression range (5-10%). Would be happy to apply some elbow grease if we think it's worth the effort.
Nice, would be useful to have this checked in somewhere. |
You are correct about cockroachdb/pebble#1098, but there is also cockroachdb/pebble#1107. I'm not sure if that latter PR moves the needle on the CRDB-level benchmarks, though.
The tpcc numbers fit within that range?
I agree. Care to pick up this ball? |
Right, ok. I can do a few quick runs and see if it moves the needle.
I didn't do any TPCC work myself, but I believe we'd seen numbers that had recovered to within 10% of the original baseline on individual runs. They're rather noisy though, so that may or may not be accurate.
Will do. |
Did 5 runs of the following benchmarks at
The results are within the error margin of the previous 21.1 results, so it doesn't appear as if cockroachdb/pebble#1107 had any significant effect. |
@erikgrinaker Did you use
Definitely notice that variation per run. The above is with the delta-test disabled. If you use the default
So |
No, in order to compare with the previous numbers I figured I shouldn't change up the methodology for this run.
Thanks for doing another set of runs, that's awesome! Definitely like the
Absolutely. These numbers came on the tail end of a regression investigation that covered a five-month timeframe and a ~35% regression. We had to cover a lot of ground, and were looking for large deltas, so a 5-run median (to ignore outliers) seemed appropriate. I agree that once we're zooming in on more specific comparisons we need to be more rigorous, I suppose I was just used to doing things a certain way at that point. I am fairly confident that we're looking at a ~5% regression overall though (for some definition of overall), and the roachperf graphs give that impression as well. Using averages for these sample sizes tend to skew results downwards, since outliers tend to be negative (it's much more likely that a cluster anomaly causes a large slowdown than a large speedup). Might be worth discarding the outer results in either direction. |
Ack. Makes complete sense to do fewer runs when there are large discrepancies that are being looked for. We're likely going to do this sort of comparison again in 6 months. Perhaps it is worthwhile to create a script or playbook to describe how the testing should be done.
I believe |
We started one, will amend it tomorrow.
Indeed, nifty! |
We'll do this on a continuous basis going forward, though there are still details being ironed out on how that is organized. |
This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided.
We discussed this yesterday in the release triage meeting and decided that we would eat the hit as it occurs on synthetic workloads that are unlikely to be representative of real workloads, plus we were reluctant to pull either separated intents or vectorized execution and noted that there are caching improvements coming in the 21.2 cycle that will more than make up for lost ground. |
This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided.
63829: changefeedccl: Improve avro encoder performance r=miretskiy a=miretskiy Avoid expensive allocations (maps) when encoding datums. Improve encoder performance by ~40%, and significantly reduce memory allocations per op. ``` BenchmarkEncodeInt-16 1834214 665.4 ns/op 73 B/op 5 allocs/op BenchmarkEncodeBool-16 1975244 597.8 ns/op 33 B/op 3 allocs/op BenchmarkEncodeFloat-16 1773226 661.6 ns/op 73 B/op 5 allocs/op BenchmarkEncodeBox2D-16 628884 1740 ns/op 579 B/op 18 allocs/op BenchmarkEncodeGeography-16 1734722 713.3 ns/op 233 B/op 5 allocs/op BenchmarkEncodeGeometry-16 1495227 1208 ns/op 2737 B/op 5 allocs/op BenchmarkEncodeBytes-16 2171725 698.4 ns/op 64 B/op 5 allocs/op BenchmarkEncodeString-16 1847884 696.0 ns/op 49 B/op 4 allocs/op BenchmarkEncodeDate-16 2159253 701.6 ns/op 64 B/op 5 allocs/op BenchmarkEncodeTime-16 1857284 682.9 ns/op 81 B/op 6 allocs/op BenchmarkEncodeTimeTZ-16 833163 1405 ns/op 402 B/op 14 allocs/op BenchmarkEncodeTimestamp-16 1623998 720.5 ns/op 97 B/op 6 allocs/op BenchmarkEncodeTimestampTZ-16 1614201 719.0 ns/op 97 B/op 6 allocs/op BenchmarkEncodeDecimal-16 790902 1473 ns/op 490 B/op 23 allocs/op BenchmarkEncodeUUID-16 2216424 783.0 ns/op 176 B/op 6 allocs/op BenchmarkEncodeINet-16 1545225 817.6 ns/op 113 B/op 8 allocs/op BenchmarkEncodeJSON-16 2146824 1731 ns/op 728 B/op 21 allocs/op ``` Release Notes: None 63845: storage: pool pebbleReadOnly allocations r=nvanbenschoten a=nvanbenschoten This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate `EngineIterator` with lower and upper bounds. So this may have a small impact on #62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided. 63904: colbuilder: support 'CASE expr WHEN exprs' form r=yuzefovich a=yuzefovich Previously, we didn't support `CASE expr WHEN exprs` form of CASE expression and had to fallback to row-by-row engine. This form requires just another step of performing an equality comparison of `expr` against the projection of the current WHEN arm to decide whether this arm matched. This commit does so. Release note: None 63947: execinfra: mark 'sql.distsql.temp_storage.workmem' as public r=yuzefovich a=yuzefovich Release note (sql change): `sql.distsql.temp_storage.workmem` cluster setting is now marked as public and is included into the documentation. It determines how much RAM a single operation of a single query can use before it must spill to temporary storage. Note the operations that don't support the disk spilling will ignore this setting and are subject only to `--max-sql-memory` startup argument. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly. I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation. ``` name old time/op new time/op delta KV/Scan/Native/rows=1-16 30.9µs ± 4% 29.9µs ± 6% -3.29% (p=0.000 n=20+20) KV/Scan/Native/rows=100-16 54.2µs ± 4% 52.7µs ± 5% -2.84% (p=0.002 n=20+20) KV/Scan/Native/rows=10-16 34.0µs ± 3% 33.1µs ± 6% -2.64% (p=0.001 n=20+20) KV/Scan/Native/rows=1000-16 253µs ± 5% 255µs ± 5% ~ (p=0.659 n=20+20) KV/Scan/Native/rows=10000-16 2.16ms ± 4% 2.14ms ± 3% ~ (p=0.072 n=20+20) name old alloc/op new alloc/op delta KV/Scan/Native/rows=1-16 8.69kB ± 0% 7.49kB ± 0% -13.79% (p=0.000 n=20+19) KV/Scan/Native/rows=10-16 10.1kB ± 0% 8.9kB ± 0% -11.87% (p=0.000 n=20+18) KV/Scan/Native/rows=100-16 22.7kB ± 0% 21.5kB ± 0% -5.29% (p=0.000 n=17+19) KV/Scan/Native/rows=1000-16 174kB ± 0% 172kB ± 0% -0.66% (p=0.000 n=19+19) KV/Scan/Native/rows=10000-16 1.51MB ± 0% 1.51MB ± 0% -0.05% (p=0.000 n=16+19) name old allocs/op new allocs/op delta KV/Scan/Native/rows=1-16 71.0 ± 0% 62.0 ± 0% -12.68% (p=0.000 n=20+20) KV/Scan/Native/rows=10-16 75.0 ± 0% 66.0 ± 0% -12.00% (p=0.000 n=20+19) KV/Scan/Native/rows=100-16 79.0 ± 0% 70.0 ± 0% -11.39% (p=0.000 n=19+19) KV/Scan/Native/rows=1000-16 87.8 ± 1% 79.0 ± 0% -9.97% (p=0.000 n=20+16) KV/Scan/Native/rows=10000-16 113 ± 2% 103 ± 2% -8.19% (p=0.000 n=17+19) ``` We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078. Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided.
Ref: https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=aws
Between Feb 12th and Feb 13th, we saw a drop in max throughput of 16% on
tpccbench
running on AWS. We should determine the cause of this and resolve it.The SHA before the drop was e9e3721 and after the drop was ba1a144.
The text was updated successfully, but these errors were encountered: