Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: scaledata/filesystem_simulator/nodes=6 failed #51215

Closed
cockroach-teamcity opened this issue Jul 9, 2020 · 17 comments
Closed

roachtest: scaledata/filesystem_simulator/nodes=6 failed #51215

cockroach-teamcity opened this issue Jul 9, 2020 · 17 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).scaledata/filesystem_simulator/nodes=6 failed on master@1b5d070c93375d3e14c146241e8bafde349529bd:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2459
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2467
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:112
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2515
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2119
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:108
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2449
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_112602.974_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2076266-1594275003-70-n7cpu4:7 -- ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.86:26257,10.128.15.211:26257,10.128.15.210:26257,10.128.0.57:26257,10.128.0.62:26257,10.128.0.54:26257'  returned
		  | stderr:
		  | 90dc89e2 1 0 147 default}
		  | 2020/07/09 11:32:57 ExecuteTx retry attempt 1 failed, started at 2020-07-09 11:32:56.842796059 +0000 UTC m=+413.158827865, now = 2020-07-09 11:32:57.266879716 +0000 UTC m=+413.582911548, took 424.083683ms
		  | 2020/07/09 11:32:57 pq error - Error code : 57014, Error class : 57
		  | 2020/07/09 11:32:57 pq error - Error code : 57014, Error class : 57
		  | 2020/07/09 11:32:57 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/09 11:32:57 postgres error code is 57014 and class is 57
		  | 2020/07/09 11:32:57 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.86:26257,10.128.15.211:26257,10.128.15.210:26257,10.128.0.57:26257,10.128.0.62:26257,10.128.0.54:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem_simulator/nodes=6
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 9, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Jul 9, 2020
@irfansharif
Copy link
Contributor

Duplicate of #51208.

@irfansharif
Copy link
Contributor

Actually I'll keep this open since it's on master.

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem_simulator/nodes=6 failed on master@542c83e7db1694208e09c7fea7d350db4993b897:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2459
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2467
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:112
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2515
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2119
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:108
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2449
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_095955.594_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2079204-1594361493-84-n7cpu4:7 -- ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.15.226:26257,10.128.0.146:26257,10.128.0.148:26257,10.128.15.224:26257,10.128.0.79:26257,10.128.0.143:26257'  returned
		  | stderr:
		  | 4c47-9918-df7e227e7ba9
		  | 2020/07/10 10:02:55 ExecuteTx retry attempt 1 failed, started at 2020-07-10 10:02:55.213356219 +0000 UTC m=+178.906071006, now = 2020-07-10 10:02:55.572017402 +0000 UTC m=+179.264732263, took 358.661257ms
		  | 2020/07/10 10:02:55 pq error - Error code : 57014, Error class : 57
		  | 2020/07/10 10:02:55 pq error - Error code : 57014, Error class : 57
		  | 2020/07/10 10:02:55 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/10 10:02:55 postgres error code is 57014 and class is 57
		  | 2020/07/10 10:02:55 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.15.226:26257,10.128.0.146:26257,10.128.0.148:26257,10.128.15.224:26257,10.128.0.79:26257,10.128.0.143:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem_simulator/nodes=6

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem_simulator/nodes=6 failed on master@12b58af06d9104bc585bc0773bbe29d68ff2d826:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2482
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2490
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:112
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2538
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2142
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:108
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2472
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_112139.287_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2084629-1594620495-72-n7cpu4:7 -- ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.167:26257,10.128.0.172:26257,10.128.0.161:26257,10.128.0.173:26257,10.128.0.174:26257,10.128.0.171:26257'  returned
		  | stderr:
		  | domDB chose DB at index 3
		  | 2020/07/13 11:22:52 ExecuteTx retry attempt 1 failed, started at 2020-07-13 11:22:52.524715459 +0000 UTC m=+72.516578841, now = 2020-07-13 11:22:52.768805922 +0000 UTC m=+72.760669356, took 244.090515ms
		  | 2020/07/13 11:22:52 pq error - Error code : 57014, Error class : 57
		  | 2020/07/13 11:22:52 pq error - Error code : 57014, Error class : 57
		  | 2020/07/13 11:22:52 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/13 11:22:52 postgres error code is 57014 and class is 57
		  | 2020/07/13 11:22:52 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.167:26257,10.128.0.172:26257,10.128.0.161:26257,10.128.0.173:26257,10.128.0.174:26257,10.128.0.171:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem_simulator/nodes=6
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem_simulator/nodes=6 failed on master@161d68e8c2149d401738d2c52cf70071a1c3b610:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2482
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2490
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:112
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2538
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2142
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:108
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2472
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_122812.495_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2087249-1594710321-71-n7cpu4:7 -- ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.15.223:26257,10.128.0.138:26257,10.128.0.144:26257,10.128.15.221:26257,10.128.15.222:26257,10.128.0.140:26257'  returned
		  | stderr:
		  | 8d64 1 3 76 default}
		  | 2020/07/14 12:30:03 ExecuteTx retry attempt 1 failed, started at 2020-07-14 12:30:03.536954552 +0000 UTC m=+110.328639453, now = 2020-07-14 12:30:03.779562754 +0000 UTC m=+110.571247674, took 242.608221ms
		  | 2020/07/14 12:30:03 pq error - Error code : 57014, Error class : 57
		  | 2020/07/14 12:30:03 pq error - Error code : 57014, Error class : 57
		  | 2020/07/14 12:30:03 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/14 12:30:03 postgres error code is 57014 and class is 57
		  | 2020/07/14 12:30:03 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem_simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.15.223:26257,10.128.0.138:26257,10.128.0.144:26257,10.128.15.221:26257,10.128.15.222:26257,10.128.0.140:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem_simulator/nodes=6
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

Things fail after an unexpected query execution cancellation error.

2020/07/10 10:02:55 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled

There's nothing in the client test code (https://github.com/cockroachdb/rksql) as far as I can tell that would send forward query cancellations. I don't think these are statement timeouts either, the statements are tiny, terminate quickly, and I think they would propagate a different error tag (haven't verified).

@yuzefovich: There have been some changes around context cancellation in the vectorized engine (and it seems like there is a bug in related area) which could be connected to the behavior you’re seeing

The discussion in #51375 (comment) and #51518 (comment) seem relevant.

@irfansharif
Copy link
Contributor

I'll note that this failure started after we resolved #50687, but if I had to guess I'd guess that #50687 was just masking this particular error mode (and may have been in part due to ed02ab5/#50388 as well). It's a bit hard to reproduce, I haven't been able to running ~10 times (which admittedly is not a lot).

@knz
Copy link
Contributor

knz commented Jul 21, 2020

I looked into this. The actual log output is this:

2020/07/14 12:30:03 ExecuteTx retry attempt 1 failed, started at 2020-07-14 12:30:03.536954552 +0000 UTC m=+110.328639453, now = 2020-07-14 12:30:03.779562754 +0000 UTC m=+110.571247674, took 242.608221ms
2020/07/14 12:30:03 pq error - Error code : 57014, Error class : 57
2020/07/14 12:30:03 pq error - Error code : 57014, Error class : 57
2020/07/14 12:30:03 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
2020/07/14 12:30:03 postgres error code is 57014 and class is 57

This tells us 3 things:

  • the "query execution canceled" error was generated server-side; it's not a side effect of the context timeout set by the client app.
  • the cancellation took place after ~240ms so it's not due to a txn too long.
  • in the source code, this particular error can occur EITHER from an explicit SQL CANCEL request, or a natural execution timeout.

I don't see explicit CANCEL requests, so I go to fish in the node logs to see more details. Unfortunately I don't see these query timeouts being canceled in logs.

I would suggest a couple of "next actions" here:

  1. change the distsql execution code to separate QueryCanceledErrors between "canceled by operator" and "canceled because of system timeout". This is essential to identify where to look during troubleshooting.

  2. when timing out because of a system timeout, also log the location in the code which set up the timeout. We want to know the origin of the failing request.

  3. stress test the roachtest using roachprod and the same setup as CI to see how frequently the failure occurs. MAybe the VMs are not properly sized and this test overloads CPU, network, disk IOPS or a combination thereof. Maybe that can be solved by changing the sizing of the agents used for this test.

@asubiotto @yuzefovich do you want to create an issue just for items 1 and 2?

@knz
Copy link
Contributor

knz commented Jul 21, 2020

sorry my last sentence was meant to be "For items 1 and 2" not "2 and 3"

@irfansharif
Copy link
Contributor

In all my prior attempts at reproing it I was running local builds, just tried a couple (by that I mean literally 2) of non local runs and was able to run into it each time.

bin/roachtest run scaledata/filesystem-simulator/nodes=3 --cockroach ./cockroach --roachprod bin/roachprod

This is on your run-of-the-mill n4cpu4s. I'll optimistically mark this as "reproducible" and hand it over to @yuzefovich and @asubiotto. Seems it's a pretty pervasive problem (#51647).

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem-simulator/nodes=6 failed on master@e9a4f83e3eee59510f97db2c6e0df9b57cf6b944:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2541
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2549
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:119
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2597
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2201
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:115
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2531
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_110458.929_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2107908-1595398673-120-n7cpu4:7 -- ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.56:26257,10.128.0.61:26257,10.128.0.95:26257,10.128.0.18:26257,10.128.0.92:26257,10.128.0.81:26257'  returned
		  | stderr:
		  | 83-a4ab-44fa0089bf61 0 default}
		  | 2020/07/22 11:06:10 ExecuteTx retry attempt 1 failed, started at 2020-07-22 11:06:10.646014074 +0000 UTC m=+71.003253324, now = 2020-07-22 11:06:10.903303088 +0000 UTC m=+71.260542408, took 257.289084ms
		  | 2020/07/22 11:06:10 pq error - Error code : 57014, Error class : 57
		  | 2020/07/22 11:06:10 pq error - Error code : 57014, Error class : 57
		  | 2020/07/22 11:06:10 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/22 11:06:10 postgres error code is 57014 and class is 57
		  | 2020/07/22 11:06:10 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.56:26257,10.128.0.61:26257,10.128.0.95:26257,10.128.0.18:26257,10.128.0.92:26257,10.128.0.81:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem-simulator/nodes=6

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

  1. change the distsql execution code to separate QueryCanceledErrors between "canceled by operator" and "canceled because of system timeout". This is essential to identify where to look during troubleshooting.

I think my original statement that we could see "query canceled" errors because statement_timeout has been reached is wrong - in

if res != nil && ctx.Err() != nil && res.Err() != nil {
if queryTimedOut {
res.SetError(sqlbase.QueryTimeoutError)
} else {
res.SetError(sqlbase.QueryCanceledError)
}

we return a different query execution canceled due to statement timeout error. And I'm not aware of any other system timeouts in SQL layer, so I don't think there is anything to do for this item.

  1. when timing out because of a system timeout, also log the location in the code which set up the timeout. We want to know the origin of the failing request.

This is similar to point 1 - I'm only aware of statement_timeout session setting which is set by the user, so I don't think there is anything to do here either.

@knz do you agree with my reasoning?

For more context, we have been seeing "query canceled" errors because of a bug in the vectorized engine which I have a PR (#51772) to fix. I'm hoping that once that goes in, these "query canceled" errors will go away.

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem-simulator/nodes=6 failed on master@b8a50cc4d062293915969cdc83e3ec4d057cede5:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2541
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2549
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:119
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2597
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2201
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:115
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2531
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_102552.866_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2111252-1595484018-119-n7cpu4:7 -- ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.83:26257,10.128.0.80:26257,10.128.0.90:26257,10.128.0.96:26257,10.128.0.94:26257,10.128.0.81:26257'  returned
		  | stderr:
		  | DB.RandomDB chose DB at index 5
		  | 2020/07/23 10:26:57 ExecuteTx retry attempt 1 failed, started at 2020-07-23 10:26:57.690341872 +0000 UTC m=+64.108126250, now = 2020-07-23 10:26:57.905068479 +0000 UTC m=+64.322852928, took 214.726678ms
		  | 2020/07/23 10:26:57 pq error - Error code : 57014, Error class : 57
		  | 2020/07/23 10:26:57 pq error - Error code : 57014, Error class : 57
		  | 2020/07/23 10:26:57 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/23 10:26:57 postgres error code is 57014 and class is 57
		  | 2020/07/23 10:26:57 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.83:26257,10.128.0.80:26257,10.128.0.90:26257,10.128.0.96:26257,10.128.0.94:26257,10.128.0.81:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem-simulator/nodes=6

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem-simulator/nodes=6 failed on master@bfa6307c292ef4dfed4a53cb99f506e6dab26533:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2541
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2549
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:119
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:757
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2597
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2201
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:115
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2531
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_110049.170_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2114210-1595571129-70-n7cpu4:7 -- ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.18:26257,10.128.0.5:26257,10.128.0.10:26257,10.128.0.15:26257,10.128.0.8:26257,10.128.0.24:26257'  returned
		  | stderr:
		  | 9ca7-a1554b0a9f46 1 0 37 default}
		  | 2020/07/24 11:02:04 ExecuteTx retry attempt 1 failed, started at 2020-07-24 11:02:00.512674288 +0000 UTC m=+70.636792800, now = 2020-07-24 11:02:04.003336641 +0000 UTC m=+74.127455223, took 3.490662423s
		  | 2020/07/24 11:02:04 pq error - Error code : 57014, Error class : 57
		  | 2020/07/24 11:02:04 pq error - Error code : 57014, Error class : 57
		  | 2020/07/24 11:02:04 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/24 11:02:04 postgres error code is 57014 and class is 57
		  | 2020/07/24 11:02:04 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.18:26257,10.128.0.5:26257,10.128.0.10:26257,10.128.0.15:26257,10.128.0.8:26257,10.128.0.24:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem-simulator/nodes=6

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem-simulator/nodes=6 failed on master@a16eb55ed96239dcd288aa1c2f80f306559f0f0b:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2541
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2549
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:119
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:754
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2597
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2201
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:115
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2531
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_122222.529_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2116591-1595658375-62-n7cpu4:7 -- ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.15.229:26257,10.128.0.168:26257,10.128.15.227:26257,10.128.0.165:26257,10.128.0.166:26257,10.128.15.228:26257'  returned
		  | stderr:
		  | 492d718d4 0 default}
		  | 2020/07/25 12:26:10 ExecuteTx retry attempt 1 failed, started at 2020-07-25 12:26:07.338109671 +0000 UTC m=+224.085106142, now = 2020-07-25 12:26:10.291452707 +0000 UTC m=+227.038449205, took 2.953343063s
		  | 2020/07/25 12:26:10 pq error - Error code : 57014, Error class : 57
		  | 2020/07/25 12:26:10 pq error - Error code : 57014, Error class : 57
		  | 2020/07/25 12:26:10 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/25 12:26:10 postgres error code is 57014 and class is 57
		  | 2020/07/25 12:26:10 pq: query execution canceled
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.15.229:26257,10.128.0.168:26257,10.128.15.227:26257,10.128.0.165:26257,10.128.0.166:26257,10.128.15.228:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem-simulator/nodes=6

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).scaledata/filesystem-simulator/nodes=6 failed on master@3edbe4aeb3c7300e6690cb2222a8d5c01e920bf4:

		(1) attached stack trace
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2541
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2549
		  | main.runSqlapp
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:119
		  | main.registerScaleData.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:49
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:754
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2597
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  | main.(*cluster).RunE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2201
		  | main.runSqlapp.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:115
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2531
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (6) 2 safe details enclosed
		Wraps: (7) output in run_113256.155_n7_filesystemsimulator_
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2117697-1595744317-63-n7cpu4:7 -- ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.56:26257,10.128.0.70:26257,10.128.15.229:26257,10.128.15.230:26257,10.128.0.4:26257,10.128.0.128:26257'  returned
		  | stderr:
		  | tarted at 2020-07-26 11:33:56.75012345 +0000 UTC m=+59.879151717, now = 2020-07-26 11:33:58.18611116 +0000 UTC m=+61.315139461, took 1.435987744s
		  | 2020/07/26 11:33:58 pq error - Error code : 57014, Error class : 57
		  | 2020/07/26 11:33:58 pq error - Error code : 57014, Error class : 57
		  | 2020/07/26 11:33:58 Aborting Retries because this error of type *pq.Error is not retryable : pq: query execution canceled
		  | 2020/07/26 11:33:58 postgres error code is 57014 and class is 57
		  | 2020/07/26 11:33:58 pq: query execution canceled
		  | 2020/07/26 11:33:58 Removing &{acf791c7-ce9c-42a7-b541-ac1657ea835f 1 0 253 default}
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 7. Command with error:
		  |   | ```
		  |   | ./filesystem-simulator  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.56:26257,10.128.0.70:26257,10.128.15.229:26257,10.128.15.230:26257,10.128.0.4:26257,10.128.0.128:26257'
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *withstack.withStack (6) *safedetails.withSafeDetails (7) *errutil.withMessage (8) *main.withCommandDetails (9) *exec.ExitError

More

Artifacts: /scaledata/filesystem-simulator/nodes=6

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

#51772 landed yesterday afternoon, and there wasn't any failure from the nightly run yesterday, so closing 🤞

Thanks @yuzefovich!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

6 participants