-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release: 21.1 microbenchmark regression investigation and sign-off #62212
Comments
…he fields in protos This commit removes the `XXX_NoUnkeyedLiteral`, `XXX_unrecognized`, and `XXX_sizecache` auto-generated fields from generated protobuf structs in the raft package. This was done for all of the same reasons CockroachDB removed the generation of these fields in cockroachdb/cockroach#38404. They come with very limited advantages but moderate disadvantages. `XXX_NoUnkeyedLiteral` and `XXX_sizecache` were only enabled recently in cc7b4fa, and this appears to have been unintentional. Meanwhile, `XXX_unrecognized` has been around for longer and has arguably more reason to stay because it can assist with forwards compatibility. However, any real mixed-version upgrade story for this package is mostly untold at this point, and keeping this field seems just as likely to cause unexpected bugs (e.g. a field was propagated but not updated correctly when passed through an old version) as it seems to fix real issues, so it also doesn't warrant its cost. This reduces the in-memory representation size of all Raft protos. Notably, it reduces the memory size of an `Entry` proto from *80 bytes* to *48 bytes* and the memory size of a `Message` proto from *392 bytes* to *264 bytes*. Both of these structs are used frequently, and often in slices, where this wasted space really starts to add up. This was motivated by a regression in microbenchmarks in CockroachDB due to cc7b4fa, which was caught in cockroachdb/cockroach#62212.
The benchmark was not using the subtest's `b.N`, so it was getting bogus results. This confused cockroachdb#62212, which saw a regression from 0.000008611111111 nanoseconds to 0.0000101 nanoseconds. With this fix, the benchmark no longer shows any movement over the past release.
…he fields in protos This commit removes the `XXX_NoUnkeyedLiteral`, `XXX_unrecognized`, and `XXX_sizecache` auto-generated fields from generated protobuf structs in the raft package. This was done for all of the same reasons CockroachDB removed the generation of these fields in cockroachdb/cockroach#38404. They come with very limited advantages but moderate disadvantages. `XXX_NoUnkeyedLiteral` and `XXX_sizecache` were only enabled recently in cc7b4fa, and this appears to have been unintentional. Meanwhile, `XXX_unrecognized` has been around for longer and has arguably more reason to stay because it can assist with forwards compatibility. However, any real mixed-version upgrade story for this package is mostly untold at this point, and keeping this field seems just as likely to cause unexpected bugs (e.g. a field was propagated but not updated correctly when passed through an old version) as it seems to fix real issues, so it also doesn't warrant its cost. This reduces the in-memory representation size of all Raft protos. Notably, it reduces the memory size of an `Entry` proto from *80 bytes* to *48 bytes* and the memory size of a `Message` proto from *392 bytes* to *264 bytes*. Both of these structs are used frequently, and often in slices, where this wasted space really starts to add up. This was motivated by a regression in microbenchmarks in CockroachDB due to cc7b4fa, which was caught in cockroachdb/cockroach#62212.
Recent benchmark results revealed a performance regression in TableReader caused by the recent introduction of the descriptor builder pattern which deep-copies descpb messages using protoutil.Clone. See issue cockroachdb#62212 and commit 15ee387. This commit adds an unsafe table descriptor builder which is used with table descriptors in execinfrapb messages. This unsafe builder does not perform a deep copy, however in this particular context this isn't a problem, because the descpb.TableDescriptor only exists as a means to serialize table descriptors across nodes in a read-only fashion and is known to not be used in any other capacity. Release note: None
Recent benchmark results revealed a performance regression in TableReader caused by the recent introduction of the descriptor builder pattern which deep-copies descpb messages using protoutil.Clone. See issue cockroachdb#62212 and commit 15ee387. This commit adds an unsafe table descriptor constructor which is used with table descriptors in execinfrapb messages. This unsafe constructor does not perform a deep copy. In this particular context this isn't a problem, because the table descriptor proto only exists as a means to serialize table descriptors across nodes in a read-only fashion and is known to not be used in any other capacity. Release note: None
PSA comment: the bisect might be incorrect (not sure on the conditions though) - for example, for |
PSA: for folks looking into the benchmark regressions, I’d recommend explicitly specifying |
62350: tracing: elide logtag creation for noop span r=erikgrinaker a=tbg Release note: None 62381: cli: update locality help to refer to region/zone r=rmloveland a=rmloveland Fixes #62375. Previously, the help output for `--locality` and `--locality-addr` referred to "data centers" as part of their usage explanations. This needed to change because we expect the majority of users to be running CockroachDB on a cloud provider that primarily uses terms like 'region' and 'availability zone'. To address this, we updated the help strings to use those terms, rather than 'datacenter', etc. Release note (cli): Updated the output of `--locality` and `--locality-addr` flags to use terms that match cloud provider names for things such as 'region' and 'zone'. 62497: server: refactor StartTenant() to pkg/server/tenant.go r=tbg a=Azhng Previously, StartTenant() function resided in pkg/server/testserver.go. However, since this function is not solely used for testing purposes, it is moved to its own file. Release note: None 62513: colexecjoin: optimize the initialization of the merge joiner r=yuzefovich a=yuzefovich Whenever the merge joiner doesn't know whether the group continues into the next batch, it appends all tuples belonging to the group currently being built to the "buffered group". Internally, that append requires us that we perform a deep copy of the relevant tuples from the input batch into a scratch space since `Enqueue` method only supports full-batch operation. Previously, we would always allocate the scratch batch in `Init` with maximum possible size; however, this is too much when the input data sets are small. This commit switches the allocation to be lazy and utilizes our `ResetMaybeReallocate` method of dynamic batch sizing. This shows up somewhat noticeably, for example, on `SQL/MultinodeCockroach/InterleavedSelect` benchmark (which is using the merge join since the interleaved reader joiner has been removed). Addresses: #62212. Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Rich Loveland <[email protected]> Co-authored-by: Azhng <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
62388: sql: add tabledesc.NewUnsafeImmutable constructor r=postamar a=postamar sql: add tabledesc.NewUnsafeImmutable constructor Recent benchmark results revealed a performance regression in TableReader caused by the recent introduction of the descriptor builder pattern which deep-copies descpb messages using protoutil.Clone. See issue #62212 and commit 15ee387. This commit adds an unsafe table descriptor constructor which is used with table descriptors in execinfrapb messages. This unsafe constructor does not perform a deep copy. In this particular context this isn't a problem, because the table descriptor proto only exists as a means to serialize table descriptors across nodes in a read-only fashion and is known to not be used in any other capacity. Release note: None 62500: teamcity: copy roachtest weekly script to build/ r=rail a=rickystewart This new script is just the content of the "custom script" of the Cockroach > Nightlies > Roachtest Weekly build configuration; but with a reference to `make` replaced by `mkrelease`. (See https://github.com/cockroachdb/dev-inf/issues/347.) Release note: None Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Recent benchmark results revealed a performance regression in TableReader caused by the recent introduction of the descriptor builder pattern which deep-copies descpb messages using protoutil.Clone. See issue cockroachdb#62212 and commit 15ee387. This commit adds an unsafe table descriptor constructor which is used with table descriptors in execinfrapb messages. This unsafe constructor does not perform a deep copy. In this particular context this isn't a problem, because the table descriptor proto only exists as a means to serialize table descriptors across nodes in a read-only fashion and is known to not be used in any other capacity. Release note: None
Recent benchmark results revealed a performance regression in TableReader caused by the recent introduction of the descriptor builder pattern which deep-copies descpb messages using protoutil.Clone. See issue cockroachdb#62212 and commit 15ee387. This commit adds an unsafe table descriptor constructor which is used with table descriptors in execinfrapb messages. This unsafe constructor does not perform a deep copy. In this particular context this isn't a problem, because the table descriptor proto only exists as a means to serialize table descriptors across nodes in a read-only fashion and is known to not be used in any other capacity. Release note: None
62294: util/interval: fix BenchmarkBTreeDeleteInsertCloneEachTime r=nvanbenschoten a=nvanbenschoten The benchmark was not using the subtest's `b.N`, so it was getting bogus results. This confused #62212, which saw a regression from 0.000008611111111 nanoseconds to 0.0000101 nanoseconds. With this fix, the benchmark no longer shows any movement over the past release and produces much more realistic results: ``` cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkBTreeDeleteInsertCloneEachTime/count=16/reset=false-16 2367748 492.7 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=16/reset=true-16 6090852 191.7 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=128/reset=false-16 991534 1102 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=128/reset=true-16 2328866 508.7 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=1024/reset=false-16 550702 2088 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=1024/reset=true-16 1209684 1054 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=8192/reset=false-16 463558 2176 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=8192/reset=true-16 724923 1542 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=65536/reset=false-16 264100 4833 ns/op BenchmarkBTreeDeleteInsertCloneEachTime/count=65536/reset=true-16 429841 2467 ns/op ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
This commit picks up etcd-io/etcd#12790, which resolves a regression observed in cockroachdb#62212. Picks up the following four commits: - etcd-io/etcd@87258ef - etcd-io/etcd@2c2456b - etcd-io/etcd@ebc0174 - etcd-io/etcd@e51c697 ``` name old time/op new time/op delta EntryCache-16 114µs ± 7% 96µs ±12% -15.29% (p=0.000 n=19+19) EntryCacheClearTo-16 37.7µs ± 4% 34.9µs ± 4% -7.55% (p=0.000 n=17+18) name old alloc/op new alloc/op delta EntryCacheClearTo-16 1.28kB ± 0% 0.77kB ± 0% -40.00% (p=0.000 n=20+20) EntryCache-16 83.3kB ± 0% 50.0kB ± 0% -39.95% (p=0.000 n=19+20) name old allocs/op new allocs/op delta EntryCache-16 3.00 ± 0% 3.00 ± 0% ~ (all equal) EntryCacheClearTo-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) ```
62819: go.mod: bump etcd/raft to pick up proto size improvements r=tbg a=nvanbenschoten This commit picks up etcd-io/etcd#12790, which resolves a regression observed in #62212. Picks up the following four commits, each of which I've reviewed for safety: - etcd-io/etcd@87258ef - etcd-io/etcd@2c2456b - etcd-io/etcd@ebc0174 - etcd-io/etcd@e51c697 ``` name old time/op new time/op delta EntryCache-16 114µs ± 7% 96µs ±12% -15.29% (p=0.000 n=19+19) EntryCacheClearTo-16 37.7µs ± 4% 34.9µs ± 4% -7.55% (p=0.000 n=17+18) name old alloc/op new alloc/op delta EntryCacheClearTo-16 1.28kB ± 0% 0.77kB ± 0% -40.00% (p=0.000 n=20+20) EntryCache-16 83.3kB ± 0% 50.0kB ± 0% -39.95% (p=0.000 n=19+20) name old allocs/op new allocs/op delta EntryCache-16 3.00 ± 0% 3.00 ± 0% ~ (all equal) EntryCacheClearTo-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
@sumeerbhola I want to bring to your attention that cee3bcf is the cause of the several regressions in the microbenchmarks of SQL package. For example, here is the impact of that commit on
I know that you have been looking into addressing the impact of that change, and I want to call out that it has impact outside of simple reads like KV95 workload. |
I looked into As @rytaft pointed out, bisecting this regression is actually quite hard because the benchmarks are affected by changes in multiple layers. I arrived at a couple (but I'm sure there are others) commits that had some impact:
|
61948: roachtest/cdc: update status when watching changefeed latency r=andreimatei a=andreimatei When watching the latency, we were logging every second. This patch reduces the frequency to 1/10s and makes the test also update its status so that the current latency high-water mark is visible in the web UI. Release note: None 62166: sql: add roach test / workload for testing connection latencies r=rafiss a=RichardJCai Add roach test and workload test for testing connection latencies. Release note: None Making it a private test right now (not available through `cockroach workload`) Need to also add it to teamcity and make sure the results are reported in roachperf, will open PR there. Resolves #59394 62287: cdc: Improve avro encoder performance. r=miretskiy a=miretskiy Add avro encoder benchmarks. Avoid expensive allocations (maps) when encoding datums. Improve encoder performance by ~40%, and significantly reduce memory allocations per op. Release Notes: None 62883: sqlsmith: add support for ALTER TYPE ... DROP VALUE r=rafiss a=arulajmani Now that `ALTER TYPE ... DROP VALUE` is a thing, let's add it to SQL Smith. Closes #62812 Release note: None 62922: roachtest/tpccbench: don't require workload binary r=andreimatei a=andreimatei tpccbench was copying around the `workload` binary, but it then doesn't use it; it uses `cockroach workload ...`. This patch removes the copying. Release note: None 62925: colexec: optimize CASE, AND, and OR projection operators r=yuzefovich a=yuzefovich Previously, in CASE, AND, and OR projection operators we would always allocate internal selection vectors of maximum capacity in the constructor. This is inefficient when the query processes only small data sets. This commit makes the allocations lazy and of tight sizes based on the batch lengths. Addresses: #62212 Release note: None 62926: bench,sql: add logScope to some benchmarks r=yuzefovich a=yuzefovich This commits adds `defer log.Scope(b).Close(b)` to some benchmarks in `bench`, `colexec`, and `rowexec` packages. Note that in some cases this might not matter, but I didn't think through each carefully since it shouldn't do any harm. Release note: None 62962: sql: fix bug in close of ieSyncResultChannel r=yuzefovich a=ajwerner When finishing sending on the ieSyncResultChannel we do not need to and should not close the channel used to unblock the sender. Doing so can result in a panic if the reader is concurrently trying to unblock the sender. We could close that channel but there's no obvious reason to. Fixes #62939 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: richardjcai <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: arulajmani <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
@yuzefovich thanks for bringing this up. I ran this benchmark on master with some instrumentation and looked at profiles. There are a few things to note:
|
63020: colfetcher: use the limit hint to size the batch r=yuzefovich a=yuzefovich We recently merged a couple of commits to take advantage of the optimizer-driven estimated row count that will be output by the cfetcher and of the limit hint when available. However, one scenario was overlooked - if there is no estimate but the limit hint is available, we don't use the latter to size the original batch, and now we do. As an additional optimization this commit refactors the initial state of the state machine so that we didn't allocate the output batch in the constructor of the ColBatchScan but do it only on the very first call to `NextBatch`. This was prompted by the fact that the limit hint is currently only set in `StartScan` which is called in `ColBatchScan.Init`, so without some changes, in the case when the limit hint can be used to size the batch we would allocate a redundant batch of size one in `NewColBatchScan`. I also think that the new initial state of the state machine makes a bit more sense to me. Additionally, this commit generalizes the logic to keep track of the limit hint as "remaining rows to be fetched." Addresses: #62212. Release note: None (this is a follow-up to a recently merged commit that had a release note) Co-authored-by: Yahor Yuzefovich <[email protected]>
This commit picks up etcd-io/etcd#12790, which resolves a regression observed in cockroachdb#62212. Picks up the following four commits: - etcd-io/etcd@87258ef - etcd-io/etcd@2c2456b - etcd-io/etcd@ebc0174 - etcd-io/etcd@e51c697 ``` name old time/op new time/op delta EntryCache-16 114µs ± 7% 96µs ±12% -15.29% (p=0.000 n=19+19) EntryCacheClearTo-16 37.7µs ± 4% 34.9µs ± 4% -7.55% (p=0.000 n=17+18) name old alloc/op new alloc/op delta EntryCacheClearTo-16 1.28kB ± 0% 0.77kB ± 0% -40.00% (p=0.000 n=20+20) EntryCache-16 83.3kB ± 0% 50.0kB ± 0% -39.95% (p=0.000 n=19+20) name old allocs/op new allocs/op delta EntryCache-16 3.00 ± 0% 3.00 ± 0% ~ (all equal) EntryCacheClearTo-16 1.00 ± 0% 1.00 ± 0% ~ (all equal) ```
Pretty much all major regressions have been investigated, most have been fixed, and I don't think anyone is planning on looking more into them, so I'm closing this issue. Thanks everyone who joined in on the fun! |
This time, split out roughly by team! Of the 568 Go microbenchmarks in the
cockroachdb/cockroach
repository, a large handful have statistically significant regressions between therelease-20.2
branch and themaster
branch.We should determine the cause of each of these regressions and decide on whether they are deliberate or accidental and warrant further investigation.
Most of this process was automated with benchdiff. This tool can also be helpful to explore specific regressions here using a command like:
Or, to automate the process of bisecting a regression, the tool can be used with a command like:
SQL
Packages:
Spreadsheet
Regressions (check to sign off):
KV/Update/SQL
, @nvanbenschoten, bisected to 5910ab4, test changed, unclear why allocs/op changed, but 👍Distinct/Ordered/...
, @yuzefovich, bisected to 76968ff, the benchmark changed slightly (the source incurs more allocations now), yet the throughput has actually improvedDistinct/Unordered/...
, @yuzefovich signed off because the regression is expected and is only on a small subset of cases; bisected on the first try to 76968ff, on the second try to d9f7bcb. The true cause is actually 11e6312.TableReader/rows=16
, @postamar + @ajwerner, bisected to 15ee387, this one looks real and serious, over a 25% reduction in throughput. After 15ee387 is addressed, there still remains 10% hit or so because of cee3bcf which is discussed elsewhere. @yuzefovich signed offRouter/MIRROR
, not an actual regression, the difference is explained byTAGS=crdb_test
being added if the benchmark binary built viamake test
Append/bytes/AppendWithSel/NullProbability=0.2
, can't reproduceMergeJoiner
, not reproducibleJoinReader/reqOrdering=false
@yuzefovich signed off, same asTableReader/rows=16
above.LogicalProjOp/OR,useSel=true,hasNulls=true
, @yuzefovich signed off since this benchmark is known to be a bit flaky. The absolute speed is still very high, not worth digging in.RowChannelPipeline/length=4
, no regression, same asRouter/MIRROR
aboveSortAll/rows=16
, can't reproduceAllSpooler
, can't reproduceSQL/MultinodeCockroach/InterleavedSelect/count=10
,@yuzefovich saw 20% hit, out of which 6% or so can be attributed to using the vectorized engine, and the remaining difference likely has the same root cause asthe significant increase in allocs is due to the vectorized engine and is expected, but the usage of vec engine doesn't actually reduce the throughput. Still there is about 6-8% hit when comparing against 20.2. @yuzefovich signed off since it doesn't seem worth looking into.TableReader/rows=16
Select2/Cockroach
@yuzefovich is investigating. The vectorized engine is responsible for 29% hit in throughput.The perf hit is reduced to 13% by colexec: optimize CASE, AND, and OR projection operators #62925, the remaining difference is caused by the fact the CASE, AND/OR logical operations have higher overhead in the vectorized engine, and this benchmark is basically the worst case for it. @yuzefovich signed off.VirtualTableQueries/select_crdb_internal.tables_with_1_fk
@rafiss investigated and found no culprit. since this test is new in 21.1 (but got backported to 20.2) it's not that easy to bisect. also, theroundtrips
stat did decrease, which is more important since this is I/O bound not CPU/mem bound.EndToEnd/tpcc-new-order/EndToEnd
@rytaft bisected to sql: decrease vectorize_row_count_threshold to 0 #55713, "decrease vectorize_row_count_threshold to 0".IsDatumEqual
@rytaft investigated, and this is not a regression. If anything, looks like a small improvement.WideTableIgnoreColumns/Cockroach
, actually an improvement, likely same asRouter/MIRROR
aboveScan/Cockroach/count=1000/limit=10
@yuzefovich signed off. The 2x or so hit is explained by the switch to the vectorized engine by default for everything, and it is mitigated substantially by sql: hint scan batch size by expected row count #62282. The remaining difference is likely explained bymitigated by colfetcher: use the limit hint to size the batch #63020TableReader/rows=16
above.Planning
@rytaft investigated, at least half of the regression is due to using the vectorized engine. Couldn't pin the remainder of the regression to a single commit.VecSkipScan/Bench_scan_with_skip
, actually an improvement, likely same asRouter/MIRROR
aboveAlterTableAddCheckConstraint/alter_table_add_3_check_constraints
NameResolution/MultinodeCockroach
@rafiss investigated; bisected to sql: decrease vectorize_row_count_threshold to 0 #55713 (vectorize_row_count_threshold=0). seems expected, and also doesn't matter since the test is supposed to be testing name resolution.SQL/Cockroach/InsertFK/count=1000/nFks=1
@RaduBerinde bisected to *: configure the use of separated intents in a cluster #59829, more info in there.Phases/kv-read-const
@mgartner bisected to opt: use safer struct initialization pattern #58386, mitigated partially by opt: speed up new init pattern #62767IndexConstraints
@mgartner bisected to opt/idxconstraint: recognize stored column expressions #55867 and opt: use safer struct initialization pattern #58386, regression is minimal and unavoidabletime/op
,alloc/op
,allocs/op
, andspeed
.roundtrips
looks good though!Bulk IO
Packages:
Spreadsheet
Regressions (check to sign off):
None, though the two most important packages here are skipped.
KV
Packages:
Spreadsheet
Regressions (check to sign off):
BenchmarkEntryCache
: @nvanbenschoten, bisected to 6e4c3ae, fix in raft: disable XXX_NoUnkeyedLiteral, XXX_unrecognized, and XXX_sizecache fields in protos etcd-io/etcd#12790BenchmarkEntryCacheClearTo
: @nvanbenschoten, bisected to 6e4c3ae, fix in raft: disable XXX_NoUnkeyedLiteral, XXX_unrecognized, and XXX_sizecache fields in protos etcd-io/etcd#12790SystemConfigDeltaFilter
: @nvanbenschoten, bisected to 836f39c,ZoneConfig
memory size increased, 👍Storage
Packages:
Spreadsheet
Regressions (check to sign off):
PebbleMapIteration
ScanDecodeKeyValue/getTs=true
: @nvanbenschoten, bisected to 11abdda, 1ns difference due to new switch case inDecodeKey
, 👍Misc
Packages:
Spreadsheet
Regressions (check to sign off):
BenchmarkHeader
: @knz, bisected to a74b36aBTreeDeleteInsertCloneEachTime/count=65536/reset=false
: @nvanbenschoten, broken benchmark, fixed in util/interval: fix BenchmarkBTreeDeleteInsertCloneEachTime #62294ParseTimestampComparison/2003-06-12_04:05:06.789_America/New_York/ParseInLocation
: @otan, bisected to 90efaafThe text was updated successfully, but these errors were encountered: