Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55817

Closed
cockroach-teamcity opened this issue Oct 21, 2020 · 24 comments · Fixed by #58014
Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #55817

cockroach-teamcity opened this issue Oct 21, 2020 · 24 comments · Fixed by #58014
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@d67a35edddabcdd18954196a5e20bfd2a55a27e4:

		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2642
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:851
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:759
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:576
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 6: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1657,context.go:140,cluster.go:1646,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2381239-1603261938-66-n7cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: skipped
		6: dead
		2: 24945
		4: 25872
		5: 26108
		3: 26172
		1: 27300
		Error: UNCLASSIFIED_PROBLEM: 6: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 6: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-20.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 21, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Oct 21, 2020
@nvanbenschoten
Copy link
Member

This looks a lot like what we've been seeing recently in #55599 on the release-20.1 branch.

We see that the node was OOM killed:

[14624.526749] Out of memory: Kill process 25185 (cockroach) score 953 or sacrifice child
[14624.535067] Killed process 25185 (cockroach) total-vm:22608332kB, anon-rss:14043276kB, file-rss:0kB, shmem-rss:0kB
[14625.487363] oom_reaper: reaped process 25185 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We see from the logs that the system has 14GiB of memory available, and so this anon-rss of 14043276kB is an issue:

I201021 15:11:40.614092 1 server/config.go:433 ⋮ system total memory: ‹14 GiB›

But strangely, we don't see this much memory in use in the memory profiles, which were taken 2 minutes before the OOM, the memory stats, last captured 1 minute before the OOM, or the runtime stats log, last reported about 10 seconds before the OOM. In the memory stats, we see about 1.6 GB of memory allocated ("Alloc": 1743265512) in Go and about 3.9 GB of memory allocated in CGo ("CGoAllocatedBytes": 4180412592) by Pebble. This agrees with the last runtime stats log we see:

I201021 15:24:11.937855 558 server/status/runtime.go:522 â‹® [n6] runtime stats: 7.1 GiB RSS, 5413 goroutines, 1.5 GiB/409 MiB/2.3 GiB GO alloc/idle/total, 3.9 GiB/4.8 GiB CGO alloc/total, 36039.6 CGO/sec, 1030.9/94.8 %(u/s)time, 0.0 %gc (5x), 135 MiB/116 MiB (r/w)net

So within these 10 seconds, there must have been a large amount of heap-allocated memory to push us over the 14 GB limit.

One thing that's pretty apparent from the logs is that the cluster is not super happy during this run. We see many requests take minutes to complete. It's not clear why this is. Most of the log lines related to this come from within the DistSender and within contention handling, so one possibility is that transactions somewhere deadlocked. But the workload logs do not agree with this, as throughput never stalls during the lifetime of the workload. So it's not clear why the cluster was so unhappy.

One unsubstantiated theory for why this may lead to an OOM is because we seem to be very loud in the logs once these slow requests complete. We see that the node has 5413 active goroutines a few seconds before the OOM. Perhaps the requests all completed within some short interval and all performed verbose logging in response.

@nvanbenschoten
Copy link
Member

The next thing to look at is the goroutine dumps to determine where goroutines were when everything was unhappy.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@93b4f0405144660ac8dd4f24f1a588c09dcc3814:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (8) output in run_140214.202_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2412993-1604387756-66-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json returned
		  | stderr:
		  | I201103 14:02:15.822295 1 workload/cli/run.go:356  creating load generator...
		  | E201103 17:30:35.625699 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5750 connections...
		  | Initializing 28750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@7f10df809db5076075b6ec63bc744b62109ee459:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (8) output in run_144654.587_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2425022-1604733173-63-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=3175 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=3175/stats.json returned
		  | stderr:
		  | I201107 14:46:56.190518 1 workload/cli/run.go:356  creating load generator...
		  | E201107 18:54:30.674656 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=3175 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=3175/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 6350 connections...
		  | Initializing 31750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@0884565b5005096ad7ac713a89f4f6d70a4e2406:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_133220.520_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2439965-1605251372-65-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json returned
		  | stderr:
		  | I201113 13:32:22.023743 1 workload/cli/run.go:356  creating load generator...
		  | E201113 17:43:17.662218 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5350 connections...
		  | Initializing 26750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@36822fee607976556f265c683da0eaffcfdfec8b:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_140143.823_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2444358-1605424077-63-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json returned
		  | stderr:
		  | I201115 14:01:45.310543 1 workload/cli/run.go:356  creating load generator...
		  | E201115 18:01:06.879608 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5750 connections...
		  | Initializing 28750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@36822fee607976556f265c683da0eaffcfdfec8b:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_035558.862_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2445552-1605475705-64-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json returned
		  | stderr:
		  | I201116 03:56:00.364332 1 workload/cli/run.go:356  creating load generator...
		  | E201116 07:49:10.168251 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5350 connections...
		  | Initializing 26750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@4aabe9625bcaf74d6c25bfc3c88cfe7662cb2d80:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_131541.843_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2453753-1605683603-64-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json returned
		  | stderr:
		  | I201118 13:15:43.327765 1 workload/cli/run.go:356  creating load generator...
		  | E201118 18:40:39.361061 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5150 connections...
		  | Initializing 25750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@abd8a5e802a06822843870e4a358aa3cd789a3c9:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_135108.706_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2460657-1605856304-68-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json returned
		  | stderr:
		  | I201120 13:51:10.223046 1 workload/cli/run.go:356  creating load generator...
		  | E201120 15:27:08.300582 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5750 connections...
		  | Initializing 28750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@8729c06ed8e3baba67ab5651588b00e015e696ee:

		  | runtime.futexsleep(0xc000c80148, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7f7d25ffacb8 sp=0x7f7d25ffac68 pc=0x44c4a6
		  | runtime.notesleep(0xc000c80148)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7f7d25ffacf0 sp=0x7f7d25ffacb8 pc=0x424dff
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1918 +0xc5 fp=0x7f7d25ffad18 sp=0x7f7d25ffacf0 pc=0x456cc5
		  | runtime.findrunnable(0xc000065800, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2479 +0xa7f fp=0x7f7d25ffae00 sp=0x7f7d25ffad18 pc=0x45849f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2677 +0x2d7 fp=0x7f7d25ffae68 sp=0x7f7d25ffae00 pc=0x459117
		  | runtime.park_m(0xc0003d2780)
		  | 	/usr/local/go/src/runtime/proc.go:2845 +0x9d fp=0x7f7d25ffae98 sp=0x7f7d25ffae68 pc=0x4596bd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7f7d25ffaea8 sp=0x7f7d25ffae98 pc=0x486ddb
		  | rax    0xca
		  | rbx    0xc000c80000
		  | rcx    0x48aa63
		  | rdx    0x0
		  | rdi    0xc000c80148
		  | rsi    0x80
		  | rbp    0x7f7d25ffaca8
		  | rsp    0x7f7d25ffac60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc00050ac00
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48aa61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  9405 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=3875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=3875/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=3875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=3875/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:814: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@17c55e166b24fb7a6e27c81fcd68b05c7a9b9849:

		  | runtime.futexsleep(0xc000d00148, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7f84297f9cb8 sp=0x7f84297f9c68 pc=0x44c4a6
		  | runtime.notesleep(0xc000d00148)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7f84297f9cf0 sp=0x7f84297f9cb8 pc=0x424dff
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1918 +0xc5 fp=0x7f84297f9d18 sp=0x7f84297f9cf0 pc=0x456cc5
		  | runtime.findrunnable(0xc000057000, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2479 +0xa7f fp=0x7f84297f9e00 sp=0x7f84297f9d18 pc=0x45849f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2677 +0x2d7 fp=0x7f84297f9e68 sp=0x7f84297f9e00 pc=0x459117
		  | runtime.park_m(0xc000102900)
		  | 	/usr/local/go/src/runtime/proc.go:2845 +0x9d fp=0x7f84297f9e98 sp=0x7f84297f9e68 pc=0x4596bd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7f84297f9ea8 sp=0x7f84297f9e98 pc=0x486ddb
		  | rax    0xca
		  | rbx    0xc000d00000
		  | rcx    0x48aa63
		  | rdx    0x0
		  | rdi    0xc000d00148
		  | rsi    0x80
		  | rbp    0x7f84297f9ca8
		  | rsp    0x7f84297f9c60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc000802c00
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48aa61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  6671 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=2675/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:814: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@a3063889b89770d9ecca6d7b83f19998de5bfc7a:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_132401.465_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2471900-1606288069-67-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json returned
		  | stderr:
		  | I201125 13:24:02.934335 1 workload/cli/run.go:356  creating load generator...
		  | E201125 21:04:30.325084 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5350 connections...
		  | Initializing 26750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@d9cf6cc8d0bc86468df889755426bafeadea7e36:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_132628.264_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2475252-1606375053-66-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json returned
		  | stderr:
		  | I201126 13:26:29.737321 1 workload/cli/run.go:356  creating load generator...
		  | E201126 21:28:38.362002 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5350 connections...
		  | Initializing 26750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten
Copy link
Member

Failing in the same way as #55688. @rafiss have you had a chance to look into that failure? If so, were you able to isolate it at all?

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@745f1a1f5a603fb03357790e416e2560aabcc4c3:

		  | runtime.futexsleep(0xc0004d5548, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7ff65b7fdcb8 sp=0x7ff65b7fdc68 pc=0x44c4a6
		  | runtime.notesleep(0xc0004d5548)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7ff65b7fdcf0 sp=0x7ff65b7fdcb8 pc=0x424dff
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1918 +0xc5 fp=0x7ff65b7fdd18 sp=0x7ff65b7fdcf0 pc=0x456cc5
		  | runtime.findrunnable(0xc000066000, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2479 +0xa7f fp=0x7ff65b7fde00 sp=0x7ff65b7fdd18 pc=0x45849f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2677 +0x2d7 fp=0x7ff65b7fde68 sp=0x7ff65b7fde00 pc=0x459117
		  | runtime.park_m(0xc0004e8300)
		  | 	/usr/local/go/src/runtime/proc.go:2845 +0x9d fp=0x7ff65b7fde98 sp=0x7ff65b7fde68 pc=0x4596bd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7ff65b7fdea8 sp=0x7ff65b7fde98 pc=0x486ddb
		  | rax    0xca
		  | rbx    0xc0004d5400
		  | rcx    0x48aa63
		  | rdx    0x0
		  | rdi    0xc0004d5548
		  | rsi    0x80
		  | rbp    0x7ff65b7fdca8
		  | rsp    0x7ff65b7fdc60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc000583200
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48aa61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  6840 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=2875/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:814: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@rafiss
Copy link
Collaborator

rafiss commented Dec 3, 2020

No I have not yet been able to look into it. But it seems both #55688 and this are now failing because of some other problem that causes a crash.

/home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2488192-1606892428-67-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json
I201202 12:56:50.968009 1 workload/cli/run.go:357 â‹® creating load generator...
SIGSEGV: segmentation violation

...

goroutine 1 [semacquire, 514 minutes]:
runtime.gopark(0x3addaa8, 0x5d7dfe0, 0xc000d31912, 0x4)
	/usr/local/go/src/runtime/proc.go:306 +0xe5 fp=0xc000577010 sp=0xc000576ff0 pc=0x452fa5
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:312
runtime.semacquire1(0xc000d159c0, 0xc00103f000, 0x1, 0x0)
	/usr/local/go/src/runtime/sema.go:144 +0x1c5 fp=0xc000577078 sp=0xc000577010 pc=0x463ac5
sync.runtime_Semacquire(0xc000d159c0)
	/usr/local/go/src/runtime/sema.go:56 +0x45 fp=0xc0005770a8 sp=0xc000577078 pc=0x4849a5
sync.(*WaitGroup).Wait(0xc000d159b8)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65 fp=0xc0005770d0 sp=0xc0005770a8 pc=0x4944a5
golang.org/x/sync/errgroup.(*Group).Wait(0xc000d159b0, 0xc000d1b160, 0xc00103f148)
	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:40 +0x31 fp=0xc0005770e8 sp=0xc0005770d0 pc=0xbdf351
github.com/cockroachdb/cockroach/pkg/workload/tpcc.scatterRanges(0xc000fac270, 0x0, 0x5)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/tpcc/ddls.go:219 +0x285 fp=0xc0005771f8 sp=0xc0005770e8 pc=0xdb4045
github.com/cockroachdb/cockroach/pkg/workload/tpcc.(*tpcc).partitionAndScatterWithDB(0xc000583e00, 0xc000fac270, 0xc0000d6780, 0x1cd)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/tpcc/tpcc.go:741 +0x65 fp=0xc000577268 sp=0xc0005771f8 pc=0xdcc2c5
github.com/cockroachdb/cockroach/pkg/workload/tpcc.(*tpcc).partitionAndScatter(0xc000583e00, 0xc000d491e0, 0x6, 0xd, 0x0, 0x0)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/tpcc/tpcc.go:720 +0xf4 fp=0xc0005772c8 sp=0xc000577268 pc=0xdcc1f4
github.com/cockroachdb/cockroach/pkg/workload/tpcc.(*tpcc).Ops(0xc000583e00, 0x3d91e80, 0xc000d30300, 0xc000d491e0, 0x6, 0xd, 0xc000f4c040, 0x0, 0x0, 0x0, ...)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/tpcc/tpcc.go:585 +0x8d fp=0xc000577560 sp=0xc0005772c8 pc=0xdcb1ed
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun.func1(0x7fa86c6415f8, 0xc000583e00, 0xc000d491e0, 0x6, 0xd, 0xc000f4c040, 0xc000f4c080, 0x3d91e80, 0xc000d30300, 0x1, ...)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:367 +0x1e4 fp=0xc0005776d0 sp=0xc000577560 pc=0xdad524
github.com/cockroachdb/cockroach/pkg/workload/cli.runRun(0x3d73640, 0xc000583e00, 0xc000d491e0, 0x6, 0xd, 0x1f42a82, 0x4, 0x0, 0x0)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:384 +0x5ef fp=0xc000577c30 sp=0xc0005776d0 pc=0xdaa84f
github.com/cockroachdb/cockroach/pkg/workload/cli.CmdHelper.func1(0xc000f6c780, 0xc000d491e0, 0x6, 0xd, 0xc000f6e800, 0x1f4252a)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/run.go:184 +0x246 fp=0xc000577d28 sp=0xc000577c30 pc=0xdacf66
github.com/cockroachdb/cockroach/pkg/workload/cli.HandleErrs.func1(0xc000f6c780, 0xc000d491e0, 0x6, 0xd)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/workload/cli/cli.go:74 +0x5d fp=0xc000577da8 sp=0xc000577d28 pc=0xdac11d

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@a40588bf7114e16bf7e54f3b8dc64788a04225bc:

		  |
		  | goroutine 0 [idle]:
		  | runtime.epollwait(0x7fa800000004, 0x7fa854ff8708, 0x3e700000080, 0x0, 0x3e7, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
		  | 	/usr/local/go/src/runtime/sys_linux_amd64.s:725 +0x20 fp=0x7fa854ff86c0 sp=0x7fa854ff86b8 pc=0x48ac80
		  | runtime.netpoll(0x3b91832f, 0x216396681f01)
		  | 	/usr/local/go/src/runtime/netpoll_epoll.go:126 +0x92 fp=0x7fa854ff8d18 sp=0x7fa854ff86c0 pc=0x44c172
		  | runtime.findrunnable(0xc00004a800, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2442 +0x785 fp=0x7fa854ff8e00 sp=0x7fa854ff8d18 pc=0x4581a5
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2677 +0x2d7 fp=0x7fa854ff8e68 sp=0x7fa854ff8e00 pc=0x459117
		  | runtime.park_m(0xc000583980)
		  | 	/usr/local/go/src/runtime/proc.go:2845 +0x9d fp=0x7fa854ff8e98 sp=0x7fa854ff8e68 pc=0x4596bd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7fa854ff8ea8 sp=0x7fa854ff8e98 pc=0x486ddb
		  | rax    0xfffffffffffffffc
		  | rbx    0x3e7
		  | rcx    0x48ac80
		  | rdx    0x80
		  | rdi    0x4
		  | rsi    0x7fa854ff8708
		  | rbp    0x7fa854ff8d08
		  | rsp    0x7fa854ff86b8
		  | r8     0x0
		  | r9     0x0
		  | r10    0x3e7
		  | r11    0x246
		  | r12    0x3
		  | r13    0xc000702780
		  | r14    0x0
		  | r15    0xc0000a1860
		  | rip    0x48ac80
		  | rflags 0x246
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  6596 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=2675/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:814: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@rafiss
Copy link
Collaborator

rafiss commented Dec 3, 2020

Looking at the latest failure in #55817 (comment)

logs/3.unredacted/cockroach.log shows this happening thousands of times

I201203 21:37:49.002792 967649 kv/kvserver/raft_snapshot_queue.go:129 ⋮ [n3,raftsnapshot,s3,r1144/2:‹/Table/60/1/24{09/"{…-31/"|…}›] skipping snapshot; replica is likely a learner in the process of being added: (n6,s6):27LEARNER

Then there's a couple hundred of these

W201203 21:50:23.323730 11571 kv/kvserver/replica_command.go:2437 ⋮ [n3,s3,r5220/1:‹/Table/53/1/{598/52…-600/92…}›] failed to scatter lease to s1: context canceled
(1) context canceled
Error types: (1) *errors.errorString
I201203 21:50:23.324807 10909 kv/kvserver/replica_command.go:1046 ⋮ [n3,s3,r2423/2:‹/Table/55/1/12{76/9/…-84/5/…}›] could not promote ‹[n6,s6]› to voter, rolling back: ‹snapshot failed: rpc error: code = Canceled desc = context canceled›
W201203 21:50:23.324662 10342 kv/kvclient/kvcoord/dist_sender.go:1499 ⋮ [n3,client=‹10.128.1.74:44688›,hostnossl,user=root] slow range RPC: ‹have been waiting 30851.68s (1 attempts) for RPC AdminScatter [/Table/55/1/965/4/2566,/Table/55/1/972/10/2372) to r864:/Table/55/1/9{65/4/2566-72/10/2372} [(n5,s5):1, (n3,s3):2, (n2,s2):3, next=4, gen=72, sticky=1607000019.459262326,0]; resp: (err: <nil>), *roachpb.AdminScatterResponse›

There's also a few

E201203 20:39:07.192010 854235 kv/kvserver/queue.go:1087 ⋮ [n3,merge,s3,r65/2:‹/Table/60{-/1/29/"\…}›] ‹operation "merge queue process replica 65" timed out after 6m20s›

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@4fced82f7e6dad8e3ed67ea2154582f3c4188340:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_125433.309_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2498186-1607152081-64-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json returned
		  | stderr:
		  | I201205 12:54:34.816552 1 workload/cli/run.go:356  creating load generator...
		  | E201205 15:11:58.668136 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5150 connections...
		  | Initializing 25750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@c15afb605a18772689caf46c3dd74c4aca33badd:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_130805.151_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2500126-1607324473-68-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json returned
		  | stderr:
		  | I201207 13:08:06.646590 1 workload/cli/run.go:356  creating load generator...
		  | E201207 18:44:35.082311 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5350 connections...
		  | Initializing 26750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 7, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@89dd2d14c29787378c434ed54937757ef5d9877c:

		  | runtime.futexsleep(0xc000c0a148, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7f177d7f9cb8 sp=0x7f177d7f9c68 pc=0x44c4a6
		  | runtime.notesleep(0xc000c0a148)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7f177d7f9cf0 sp=0x7f177d7f9cb8 pc=0x424dff
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1918 +0xc5 fp=0x7f177d7f9d18 sp=0x7f177d7f9cf0 pc=0x456cc5
		  | runtime.findrunnable(0xc00004f800, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2479 +0xa7f fp=0x7f177d7f9e00 sp=0x7f177d7f9d18 pc=0x45849f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2677 +0x2d7 fp=0x7f177d7f9e68 sp=0x7f177d7f9e00 pc=0x459117
		  | runtime.park_m(0xc000102a80)
		  | 	/usr/local/go/src/runtime/proc.go:2845 +0x9d fp=0x7f177d7f9e98 sp=0x7f177d7f9e68 pc=0x4596bd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7f177d7f9ea8 sp=0x7f177d7f9e98 pc=0x486ddb
		  | rax    0xca
		  | rbx    0xc000c0a000
		  | rcx    0x48aa63
		  | rdx    0x0
		  | rdi    0xc000c0a148
		  | rsi    0x80
		  | rbp    0x7f177d7f9ca8
		  | rsp    0x7f177d7f9c60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc0002c8900
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48aa61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  8597 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=3875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=3875/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=3875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=3875/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:814: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@0de15384d560d1b4f09315d88204b6b2d7dfb32c:

		  | runtime.futexsleep(0xc000c08148, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7f1359ffacb8 sp=0x7f1359ffac68 pc=0x44c4a6
		  | runtime.notesleep(0xc000c08148)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7f1359ffacf0 sp=0x7f1359ffacb8 pc=0x424dff
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1918 +0xc5 fp=0x7f1359ffad18 sp=0x7f1359ffacf0 pc=0x456cc5
		  | runtime.findrunnable(0xc00006b000, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2479 +0xa7f fp=0x7f1359ffae00 sp=0x7f1359ffad18 pc=0x45849f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2677 +0x2d7 fp=0x7f1359ffae68 sp=0x7f1359ffae00 pc=0x459117
		  | runtime.park_m(0xc0002c6780)
		  | 	/usr/local/go/src/runtime/proc.go:2845 +0x9d fp=0x7f1359ffae98 sp=0x7f1359ffae68 pc=0x4596bd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7f1359ffaea8 sp=0x7f1359ffae98 pc=0x486ddb
		  | rax    0xca
		  | rbx    0xc000c08000
		  | rcx    0x48aa63
		  | rdx    0x0
		  | rdi    0xc000c08148
		  | rsi    0x80
		  | rbp    0x7f1359ffaca8
		  | rsp    0x7f1359ffac60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc000582780
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48aa61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  7004 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=2675/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2675 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2675/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:814: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@82e9f59f73098e3f6d5a0684526216df456df690:

		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:834
		  | [...repeated from below...]
		Wraps: (6) error running tpcc load generator
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2290
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:831
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) output in run_130836.502_n7_workload_run_tpcc
		Wraps: (9) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2515143-1607757097-66-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json returned
		  | stderr:
		  | I201212 13:08:37.991719 1 workload/cli/run.go:356  creating load generator...
		  | E201212 19:06:39.190854 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2575 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2575/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 5150 connections...
		  | Initializing 25750 workers and preparing statements...
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@2308ed0493a3f0b77f1e0412a6e18a4ebe0fb7d2:

		  | runtime.futexsleep(0xc000f00548, 0x0, 0xffffffffffffffff)
		  | 	/usr/local/go/src/runtime/os_linux.go:45 +0x46 fp=0x7fb42effccb8 sp=0x7fb42effcc68 pc=0x44c4a6
		  | runtime.notesleep(0xc000f00548)
		  | 	/usr/local/go/src/runtime/lock_futex.go:159 +0x9f fp=0x7fb42effccf0 sp=0x7fb42effccb8 pc=0x424dff
		  | runtime.stopm()
		  | 	/usr/local/go/src/runtime/proc.go:1924 +0xc5 fp=0x7fb42effcd18 sp=0x7fb42effccf0 pc=0x456cc5
		  | runtime.findrunnable(0xc00006d000, 0x0)
		  | 	/usr/local/go/src/runtime/proc.go:2485 +0xa7f fp=0x7fb42effce00 sp=0x7fb42effcd18 pc=0x45849f
		  | runtime.schedule()
		  | 	/usr/local/go/src/runtime/proc.go:2683 +0x2d7 fp=0x7fb42effce68 sp=0x7fb42effce00 pc=0x459117
		  | runtime.park_m(0xc0004aa300)
		  | 	/usr/local/go/src/runtime/proc.go:2851 +0x9d fp=0x7fb42effce98 sp=0x7fb42effce68 pc=0x4596bd
		  | runtime.mcall(0x800000)
		  | 	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7fb42effcea8 sp=0x7fb42effce98 pc=0x486ddb
		  | rax    0xca
		  | rbx    0xc000f00400
		  | rcx    0x48aa63
		  | rdx    0x0
		  | rdi    0xc000f00548
		  | rsi    0x80
		  | rbp    0x7fb42effcca8
		  | rsp    0x7fb42effcc60
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc000602780
		  | r14    0x0
		  | r15    0x0
		  | rip    0x48aa61
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  | bash: line 1:  6824 Aborted                 (core dumped) bash -c "./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' 'postgres://[email protected]:26257?sslmode=disable' --histograms=perf/warehouses=2875/stats.json"
		  | Error: COMMAND_PROBLEM: exit status 134
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=2875 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-6} --histograms=perf/warehouses=2875/stats.json
		  |   | ```
		  | Wraps: (3) exit status 134
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	test_runner.go:814: test timed out (10h0m0s)

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue Dec 17, 2020
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to `release-20.1` and `release-20.2`

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Co-authored-by: Nathan VanBenschoten <[email protected]>
craig bot pushed a commit that referenced this issue Dec 17, 2020
58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=nvanbenschoten a=nvanbenschoten

Fixes #48255.
Fixes #53443.
Fixes #54258.
Fixes #54570.
Fixes #55599.
Fixes #55688.
Fixes #55817.
Fixes #55939.
Fixes #56996.
Fixes #57062.
Fixes #57864.

This needs to be backported to `release-20.1` and `release-20.2`

In #55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of #56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #58014 Dec 17, 2020
@craig craig bot closed this as completed in 9dc433d Dec 17, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 17, 2020
Fixes cockroachdb#48255.
Fixes cockroachdb#53443.
Fixes cockroachdb#54258.
Fixes cockroachdb#54570.
Fixes cockroachdb#55599.
Fixes cockroachdb#55688.
Fixes cockroachdb#55817.
Fixes cockroachdb#55939.
Fixes cockroachdb#56996.
Fixes cockroachdb#57062.
Fixes cockroachdb#57864.

This needs to be backported to `release-20.1` and `release-20.2`

In cockroachdb#55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 17, 2020
Fixes cockroachdb#48255.
Fixes cockroachdb#53443.
Fixes cockroachdb#54258.
Fixes cockroachdb#54570.
Fixes cockroachdb#55599.
Fixes cockroachdb#55688.
Fixes cockroachdb#55817.
Fixes cockroachdb#55939.
Fixes cockroachdb#56996.
Fixes cockroachdb#57062.
Fixes cockroachdb#57864.

This needs to be backported to `release-20.1` and `release-20.2`

In cockroachdb#55688 (comment),
we saw that the failures to create load generators in tpccbench were due to
long-running SCATTER operations. These operations weren't stuck, but were very
slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In
hindsight, this should have been expected, as scatter has the potential to
rebalance data and was being run of datasets on the order of 100s of GBs or even
TBs in size.

But this alone did not explain why we used to see this issue infrequently and
only recently began seeing it regularly. We determined that the most likely
reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a
race condition in tpcc's `scatterRanges` function which often resulted in 9
scatters of the `warehouse` table instead of 1 scatter of each table in the
database. So before this PR, we were often (but not always due to the racey
nature of the bug) avoiding the scatter on all but the dataset's smallest table.
After this PR, we were always scattering all 9 tables in the dataset, leading to
much larger rebalancing.

To address these issues, this commit removes the per-iteration scattering in
tpccbench. Scattering on each search iteration was a misguided decision. It
wasn't needed because we already scatter once during dataset import (with a
higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the
potential to slow down the test significantly and cause issues like the one were
are fixing here.

With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing
6 out of 10 times to succeeding 10 out of 10 times. This change appears to have
no impact on performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants