Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #57864

Closed
cockroach-teamcity opened this issue Dec 12, 2020 · 1 comment · Fixed by #58014
Closed

roachtest: tpccbench/nodes=9/cpu=4/chaos/partition failed #57864

cockroach-teamcity opened this issue Dec 12, 2020 · 1 comment · Fixed by #58014
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=9/cpu=4/chaos/partition failed on master@960b4cfc54c8b78df56e62f07d7a07b986ceacff:

		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:824
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2292
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:821
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2626
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_151328.514_n10_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2515122-1607756763-21-n10cpu4:10 -- ./workload run tpcc --warehouses=2000 --active-warehouses=1210 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 --method=simple {pgurl:10} --histograms=perf/warehouses=1210/stats.json returned
		  | stderr:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | R TABLE customer SCATTER": driver: bad connection
		  | W201212 15:34:35.725321 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201212 15:37:37.603978 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE order_line SCATTER": driver: bad connection
		  | W201212 15:37:42.368579 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not determine if tables are partitioned: EOF
		  | W201212 15:40:44.334699 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE order_line SCATTER": driver: bad connection
		  | W201212 15:43:46.424375 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201212 15:46:48.695266 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | W201212 15:49:50.607559 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201212 15:52:52.618407 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | W201212 15:55:54.594818 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE order_line SCATTER": driver: bad connection
		  | W201212 15:58:56.655218 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201212 16:01:58.607541 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | W201212 16:05:00.172228 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | W201212 16:08:02.375768 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE history SCATTER": EOF
		  | W201212 16:11:04.296977 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | E201212 16:14:04.384111 1 workload/cli/run.go:382  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | Error: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 10. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=2000 --active-warehouses=1210 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 --method=simple {pgurl:10} --histograms=perf/warehouses=1210/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=9/cpu=4/chaos/partition
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 12, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=9/cpu=4/chaos/partition failed on master@de91557c0634ab3797356d445ba39e37d45d8205:

		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:824
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2292
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:821
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2626
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_141150.475_n10_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2522960-1608102201-21-n10cpu4:10 -- ./workload run tpcc --warehouses=2000 --active-warehouses=1050 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 --method=simple {pgurl:10} --histograms=perf/warehouses=1050/stats.json returned
		  | stderr:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | ER": driver: bad connection
		  | W201216 14:32:57.601218 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201216 14:35:59.488653 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201216 14:39:01.338487 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": EOF
		  | W201216 14:42:03.244921 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | W201216 14:45:05.333069 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE order_line SCATTER": driver: bad connection
		  | W201216 14:48:07.606648 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE order_line SCATTER": driver: bad connection
		  | W201216 14:51:09.515731 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE order_line SCATTER": driver: bad connection
		  | W201216 14:54:11.533041 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201216 14:57:13.506214 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201216 15:00:15.565252 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE customer SCATTER": driver: bad connection
		  | W201216 15:03:17.522853 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE order_line SCATTER": driver: bad connection
		  | W201216 15:05:59.187737 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | W201216 15:08:12.706118 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | W201216 15:11:14.614153 1 workload/cli/run.go:368  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE district SCATTER": EOF
		  | E201216 15:14:14.701554 1 workload/cli/run.go:382  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | Error: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE stock SCATTER": driver: bad connection
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 10. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=2000 --active-warehouses=1050 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s --partitions=3 --method=simple {pgurl:10} --histograms=perf/warehouses=1050/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=9/cpu=4/chaos/partition
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue Dec 17, 2020
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to `release-20.1` and `release-20.2`

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Co-authored-by: Nathan VanBenschoten <[email protected]>
craig bot pushed a commit that referenced this issue Dec 17, 2020
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=nvanbenschoten a=nvanbenschoten

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to `release-20.1` and `release-20.2`

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #58014 Dec 17, 2020
@craig craig bot closed this as completed in 9dc433d Dec 17, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 17, 2020
Fixes cockroachdb#48255.
Fixes cockroachdb#53443.
Fixes cockroachdb#54258.
Fixes cockroachdb#54570.
Fixes cockroachdb#55599.
Fixes cockroachdb#55688.
Fixes cockroachdb#55817.
Fixes cockroachdb#55939.
Fixes cockroachdb#56996.
Fixes cockroachdb#57062.
Fixes cockroachdb#57864.

This needs to be backported to `release-20.1` and `release-20.2`

In cockroachdb#55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 17, 2020
Fixes cockroachdb#48255.
Fixes cockroachdb#53443.
Fixes cockroachdb#54258.
Fixes cockroachdb#54570.
Fixes cockroachdb#55599.
Fixes cockroachdb#55688.
Fixes cockroachdb#55817.
Fixes cockroachdb#55939.
Fixes cockroachdb#56996.
Fixes cockroachdb#57062.
Fixes cockroachdb#57864.

This needs to be backported to `release-20.1` and `release-20.2`

In cockroachdb#55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant