Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #52206

Closed
cockroach-teamcity opened this issue Jul 31, 2020 · 7 comments
Closed

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #52206

cockroach-teamcity opened this issue Jul 31, 2020 · 7 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on release-20.1@20ea783887c1f33ab925cc8233041c54b58da1c5:

	tpcc.go:903,tpcc.go:615,test_runner.go:754: monitor failure: monitor task failed: error running tpcc load generator: output in run_194441.202_n4_workload_run_tpcc: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2134119-1596176904-81-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=0 {pgurl:1-3} --histograms=perf/warehouses=941/stats.json returned: context canceled
		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2541
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:880
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:791
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:615
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:754
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2597
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:863
		  | [...repeated from below...]
		Wraps: (6) 1 safe detail enclosed
		Wraps: (7) error running tpcc load generator
		Wraps: (8) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2201
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:860
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2531
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (9) 2 safe details enclosed
		Wraps: (10) output in run_194441.202_n4_workload_run_tpcc
		Wraps: (11) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2134119-1596176904-81-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=0 {pgurl:1-3} --histograms=perf/warehouses=941/stats.json returned
		  | stderr:
		  |
		  | stdout:
		Wraps: (12) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (13) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *withstack.withStack (9) *safedetails.withSafeDetails (10) *errutil.withMessage (11) *main.withCommandDetails (12) *secondary.withSecondaryError (13) *errors.errorString

More

Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-20.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 31, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Jul 31, 2020
@andreimatei
Copy link
Contributor

I think this is the same as the first failures in #48255.
We've waited 6h to load data:

12:30:07 cluster.go:2190: [w2] > ./workload fixtures load tpcc --warehouses=5000 --scatter --checks=false --partitions=3 --zones="us-east1-b,us-west1-b,europe-west2-b" {pgurl:1}
17:02:44 tpcc.go:689: waiting 1h40m0s for rebalancing
17:02:45 cluster.go:545: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2134119-1596176904-81-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=1h40m0s --scatter --tolerate-errors {pgurl:1-3,5-7,9-11}
18:45:12 test.go:188: test status: stopping cluster

@nvanbenschoten does that sound ok?

@nvanbenschoten
Copy link
Member

Yes, it looks like that's part of the issue. The other part is the 1h40m delay we add in to wait for rebalancing (see #44999). It's unclear whether even that was enough because we then proceed to badly fail each iteration of the test.

I wonder if we should introduce a minimum warehouse count for these kinds of tests. Right now, they'll just search all the way down to 0 if there are issues. This will likely lead to a test timeout instead of a more descriptive "test failed due to perf issues" error.

I think the other piece of this is that we don't take the expected test time into account when scheduling roachtests. We can see from TC (https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&testNameId=-7667002519850730298&tab=testDetails) that when this test passes, it usually takes a little over 6h to run. If it starts running too late into the overall roachtest run, then there's a good chance that it will not complete in time. I think this might actually be fairly common because of the way we prioritize cluster re-use. I wonder if we should add a new heuristic to run tests in reverse order of their expected duration. Or just throw a bit more parallelism and CPU quota at the nightly roachtests. Is there any reason to not crank to parallelism way up? Just that doing so would undermine the cluster re-use policy if not combined with a "diversity" heuristic for choosing what to run concurrently?

@andreimatei
Copy link
Contributor

If it starts running too late into the overall roachtest run, then there's a good chance that it will not complete in time.

Is there an overall timeout for all the tests, separate from the per-test timeout?

@nvanbenschoten
Copy link
Member

Yes, 20 hours: https://github.com/cockroachdb/cockroach/blob/master/build/teamcity-nightly-roachtest.sh#L114.

But we don't seem to be hitting that here. We're actually hitting the 10h timeout. That doesn't add up, even with the 4 hour import. Looking at the logs, we see:

19:44:41 cluster.go:2190: [w2] > ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=1 {pgurl:5-7} --histograms=perf/warehouses=941/stats.json
19:44:41 cluster.go:2190: [w2] > ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=0 {pgurl:1-3} --histograms=perf/warehouses=941/stats.json
19:44:41 cluster.go:2190: [w2] > ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=2 {pgurl:9-11} --histograms=perf/warehouses=941/stats.json
2: 6842
6: 5476
1: 8928
3: 8120
7: 8340
5: 5868
9: 6637
11: 5085
10: 7446
22:27:46 test_runner.go:774: [w2] dumped stacks to __stacks

Notice the 3 hour pause for a ~20m operation. Looking at the __stacks.txt, we see three of these suspicious-looking goroutines:

goroutine 358158 [syscall, 163 minutes]:
syscall.Syscall6(0xf7, 0x1, 0x6ca7, 0xc00055d620, 0x1000004, 0x0, 0x0, 0x75e763, 0xc0033de300, 0xc00055d660)
        /usr/local/go/src/syscall/asm_linux_amd64.s:41 +0x5
os.(*Process).blockUntilWaitable(0xc003134210, 0x6ebcce, 0x6d5540, 0xc00055d710)
        /usr/local/go/src/os/wait_waitid.go:31 +0x98
os.(*Process).wait(0xc003134210, 0x4200240, 0x4200248, 0x4200238)
        /usr/local/go/src/os/exec_unix.go:22 +0x39
os.(*Process).Wait(...)
        /usr/local/go/src/os/exec.go:125
os/exec.(*Cmd).Wait(0xc0003f0000, 0x0, 0x0)
        /usr/local/go/src/os/exec/exec.go:507 +0x60
os/exec.(*Cmd).Run(0xc0003f0000, 0x41ff9b8, 0xc0018fc380)
        /usr/local/go/src/os/exec/exec.go:341 +0x5c
main.execCmdEx(0x4ae9980, 0xc001e0a040, 0xc0033de120, 0xc000d92300, 0x5, 0x8, 0x0, 0x0, 0x0, 0x0, ...)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:475 +0x62a
main.execCmd(...)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:360
main.(*cluster).RunL(0xc0011271d0, 0x4ae9980, 0xc001e0a040, 0xc0033de120, 0xc000ce9038, 0x1, 0x1, 0xc00055dd40, 0x1, 0x1, ...)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2210 +0x25c
main.(*cluster).RunE(0xc0011271d0, 0x4ae9980, 0xc001e0a040, 0xc000ce9038, 0x1, 0x1, 0xc00055dd40, 0x1, 0x1, 0xc002b6c530, ...)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2191 +0x2e8
main.runTPCCBench.func3.1(0x4ae9980, 0xc001e0a040, 0x6e8434, 0xc000f964f0)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:860 +0x593
main.(*monitor).Go.func1(0x0, 0x0)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2531 +0xc8
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc0010c3800, 0xc000f540a0)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66

@nvanbenschoten
Copy link
Member

Now check out the workload logs (run_194441.202_n4_workload_run_tpcc.failed.log, run_194441.202_n8_workload_run_tpcc.failed.log, and run_194441.202_n12_workload_run_tpcc.failed.log):

run_194441.202_n4_workload_run_tpcc: 19:44:41 cluster.go:380: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2134119-1596176904-81-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=0 {pgurl:1-3} --histograms=perf/warehouses=941/stats.json
run_194441.202_n4_workload_run_tpcc: 22:27:47 cluster.go:2192: > result: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2134119-1596176904-81-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=0 {pgurl:1-3} --histograms=perf/warehouses=941/stats.json returned: context canceled
(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2134119-1596176904-81-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --active-warehouses=941 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=0 {pgurl:1-3} --histograms=perf/warehouses=941/stats.json returned
  | stderr:
  |
  | stdout:
Wraps: (2) secondary error attachment
  | signal: killed
  | (1) signal: killed
  | Error types: (1) *exec.ExitError
Wraps: (3) context canceled
Error types: (1) *main.withCommandDetails (2) *secondary.withSecondaryError (3) *errors.errorString
run_194441.202_n4_workload_run_tpcc: 22:27:47 cluster.go:2194: (note: incoming context was canceled: context canceled

Why did this command get stuck on all three load gen nodes?

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on release-20.1@607d3fb91a15fbd4613f22396d5828c4d27d390b:

		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2641
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:878
		  | [...repeated from below...]
		Wraps: (6) 1 safe detail enclosed
		Wraps: (7) error running tpcc load generator
		Wraps: (8) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2245
		  | main.runTPCCBench.func3.1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:875
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2575
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (9) 2 safe details enclosed
		Wraps: (10) output in run_181029.968_n12_workload_run_tpcc
		Wraps: (11) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2172557-1597213822-81-n12cpu4-geo:12 -- ./workload run tpcc --warehouses=5000 --active-warehouses=1000 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=2 {pgurl:9-11} --histograms=perf/warehouses=1000/stats.json returned
		  | stderr:
		  | 43627 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 10m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: failed to initialize the load generator: preparing 
		  | 		UPDATE district
		  | 		SET d_next_o_id = d_next_o_id + 1
		  | 		WHERE d_w_id = $1 AND d_id = $2
		  | 		RETURNING d_tax, d_next_o_id: context deadline exceeded
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 12. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=5000 --active-warehouses=1000 --tolerate-errors --scatter --ramp=3m0s --duration=15m0s --partitions=3 --partition-affinity=2 {pgurl:9-11} --histograms=perf/warehouses=1000/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		  | Initializing 2000 connections...
		  | Initializing 10000 workers and preparing statements...
		Wraps: (12) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *withstack.withStack (9) *safedetails.withSafeDetails (10) *errutil.withMessage (11) *main.withCommandDetails (12) *exec.ExitError

More

Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on release-20.1@0df968b1e237ccd88cef89e851bcd52e90932280:

		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2585
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2593
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:782
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:754
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 6: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1612,context.go:135,cluster.go:1601,test_runner.go:823: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2214769-1598248877-70-n12cpu4-geo --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 4673
		8: skipped
		1: 3873
		2: 4866
		12: skipped
		6: dead
		7: 4562
		5: 4200
		11: 4202
		9: 4178
		10: 4283
		Error: UNCLASSIFIED_PROBLEM: 6: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/[email protected]/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 6: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

4 participants