release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

nvanbenschoten · 2020-12-17T23:02:01Z

Backport 1/1 commits from #58014.

/cc @cockroachdb/release

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to release-20.1 and release-20.2

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's scatterRanges function which often resulted in 9
scatters of the warehouse table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher kv.snapshot_rebalance.max_rate). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen tpccbench/nodes=6/cpu=16/multi-az go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.

cockroach-teamcity · 2020-12-17T23:02:09Z

This change is

nvanbenschoten requested a review from ajwerner December 17, 2020 23:02

nvanbenschoten merged commit 5793419 into cockroachdb:release-20.2 Dec 21, 2020

This was referenced Dec 21, 2020

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #58063

Closed

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #58071

Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58083

Closed

nvanbenschoten deleted the backport20.2-58014 branch December 31, 2020 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

nvanbenschoten commented Dec 17, 2020

cockroach-teamcity commented Dec 17, 2020

release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

Conversation

nvanbenschoten commented Dec 17, 2020

cockroach-teamcity commented Dec 17, 2020