Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=false failed #84796

Closed
cockroach-teamcity opened this issue Jul 21, 2022 · 5 comments
Closed

roachtest: clearrange/checks=false failed #84796

cockroach-teamcity opened this issue Jul 21, 2022 · 5 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-unactionable This was closed because it was unactionable.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 21, 2022

roachtest.clearrange/checks=false failed with artifacts on master @ 457d724622e4fa2e62d6f4e7926509dbc7d18511:

		  |  1050.0s        0            0.0         4239.2      0.0      0.0      0.0      0.0 write
		  |  1051.0s        0            0.0         4235.2      0.0      0.0      0.0      0.0 write
		  | I220721 08:08:00.382610 376 workload/pgx_helpers.go:79  [-] 3  pgx logger [error]: Exec logParams=map[args:[8159777662128839569 91] err:ERROR: result is ambiguous: error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate] (SQLSTATE 40003) pid:11265693 sql:kv-2 time:19.998728297s]
		  | I220721 08:08:00.382615 375 workload/pgx_helpers.go:79  [-] 5  pgx logger [error]: Exec logParams=map[args:[2498480036115007573 3f] err:ERROR: result is ambiguous: error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate] (SQLSTATE 40003) pid:10863663 sql:kv-2 time:19.994303201s]
		  | I220721 08:08:00.382615 388 workload/pgx_helpers.go:79  [-] 4  pgx logger [error]: Exec logParams=map[args:[2770579487352219294 b3] err:ERROR: result is ambiguous: error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate] (SQLSTATE 40003) pid:10666432 sql:kv-2 time:20.022435116s]
		  | Error: ERROR: result is ambiguous: error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate] (SQLSTATE 40003)
		  | COMMAND_PROBLEM: exit status 1
		Wraps: (4) secondary error attachment
		  | COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 2. Command with error:
		  |   | ``````
		  |   | ./cockroach workload run kv --concurrency=32 --duration=1h
		  |   | ``````
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,clearrange.go:206,clearrange.go:39,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runClearRange
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:206
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClearRange.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/clearrange.go:39
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:896
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-17869

Epic CRDB-16238

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 21, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 21, 2022
@nicktrav
Copy link
Collaborator

Looks like n1 locked up around 08:07:30:

Screen Shot 2022-07-21 at 8 00 49 AM

That lines up with the logs on n1, at which time there's a sizable gap in the logs:

W220721 08:07:39.337328 780 2@rpc/clock_offset.go:221 ⋮ [n1,rnode=2,raddr=‹10.142.0.106:26257›,class=system,heartbeat] 1462  latency jump (prev avg 1.09ms, current 2.44ms)
W220721 08:11:44.219587 5217397 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1466  ‹[core]›‹[Server #1] grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"›

On n2, we see connectivity issues to n1 around the same time:

W220721 08:07:46.334269 3182 kv/kvserver/closedts/sidetransport/receiver.go:139 ⋮ [n2] 1125  closed timestamps side-transport connection dropped from node: 1
I220721 08:07:52.362771 431 kv/kvserver/replica_rangefeed.go:691 ⋮ [n2,s2,r10/3:‹/Table/{6-8}›] 1126  RangeFeed closed timestamp 1658390857.303843493,0 is behind by 15.058923525s
W220721 08:07:55.239608 1109 server/node.go:905 ⋮ [n2,summaries] 1127  health alerts detected: {Alerts:[{StoreID:2 Category:‹METRICS› Description:‹ranges.underreplicated› Value:102}]}
I220721 08:07:57.365165 6543408 kv/kvserver/replica_rangefeed.go:718 ⋮ [n2,s2,r21/3:‹/Table/2{0-1}›] 1128  RangeFeed failed to nudge: ‹retry rangefeed (REASON_NO_LEASEHOLDER)›
W220721 08:07:57.491292 257 kv/kvserver/raft_transport.go:716 ⋮ [n2] 1129  while processing outgoing Raft queue to node 1: ‹rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout›:
W220721 08:07:57.682020 6543570 kv/kvserver/raft_transport.go:711 ⋮ [n2] 1130  creating batch client for node 1 failed: ‹rpc error: code = Canceled desc = grpc: the client connection is closing›
W220721 08:07:57.682020 6543570 kv/kvserver/raft_transport.go:711 ⋮ [n2] 1130 +(1) ‹rpc error: code = Canceled desc = grpc: the client connection is closing›
W220721 08:07:57.682020 6543570 kv/kvserver/raft_transport.go:711 ⋮ [n2] 1130 +Error types: (1) *status.Error
W220721 08:07:57.683120 6543571 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1131  ‹[core]›‹[Channel #11 SubChannel #12] grpc: addrConn.createTransport failed to connect to {›

Tentatively chalking this one up to a hardware issue, and thus will remove the release-blocker label.

It is interesting that both this roachtest and the checks=true variant both failed in the same run, with the same type of error, namely:

ERROR: result is ambiguous: error=rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout [propagate] (SQLSTATE 40003)

@nicktrav nicktrav removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 21, 2022
@nicktrav nicktrav self-assigned this Jul 21, 2022
@nicktrav nicktrav added the T-storage Storage Team label Jul 21, 2022
@nicktrav
Copy link
Collaborator

nicktrav commented Jul 21, 2022

Kicked off a manual run (logs).

@nicktrav
Copy link
Collaborator

Looks like the manual run (at the same SHA) failed with the same error. Another disk stall:

* FATAL: disk stall detected: unable to sync log files within 20s
*
cockroach exited with code 8: Thu Jul 21 21:13:36 UTC 2022

I've kicked off another couple of runs (logs, logs), but at more recent SHAs.

@nicktrav
Copy link
Collaborator

The two additional runs were fine. Hoping this was just a case of getting unlucky with hardware.

Will leave this open for a few more days, just in case it collects some more failures.

@nicktrav
Copy link
Collaborator

Will leave this open for a few more days, just in case it collects some more failures.

Given this hasn't fired again in the past few days, I'm going to close this.

@nicktrav nicktrav added the X-unactionable This was closed because it was unactionable. label Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-unactionable This was closed because it was unactionable.
Projects
None yet
Development

No branches or pull requests

2 participants