Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpcc/warehouses=4000/geo failed #88451

Closed
cockroach-teamcity opened this issue Sep 22, 2022 · 3 comments
Closed

roachtest: import/tpcc/warehouses=4000/geo failed #88451

cockroach-teamcity opened this issue Sep 22, 2022 · 3 comments
Assignees
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. sync-me T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 22, 2022

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on release-22.2 @ a33d71dcd904c771d1297323d8d206b8b59d40bf:

		  | I220922 11:00:49.406500 118 ccl/workloadccl/fixture.go:481  [-] 2  imported 213 KiB in warehouse table (4000 rows, 0 index entries, took 2.89905329s, 0.07 MiB/s)
		  | I220922 11:00:49.587118 119 ccl/workloadccl/fixture.go:481  [-] 3  imported 3.9 MiB in district table (40000 rows, 0 index entries, took 3.079702989s, 1.28 MiB/s)
		  | I220922 11:00:54.697132 124 ccl/workloadccl/fixture.go:481  [-] 4  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 8.189486118s, 0.96 MiB/s)
		  | I220922 11:02:01.597690 123 ccl/workloadccl/fixture.go:481  [-] 5  imported 546 MiB in new_order table (36000000 rows, 0 index entries, took 1m15.090089254s, 7.27 MiB/s)
		  | I220922 11:05:43.670160 122 ccl/workloadccl/fixture.go:481  [-] 6  imported 6.5 GiB in order table (120000000 rows, 120000000 index entries, took 4m57.162605335s, 22.45 MiB/s)
		  | I220922 11:06:11.373131 121 ccl/workloadccl/fixture.go:481  [-] 7  imported 8.6 GiB in history table (120000000 rows, 0 index entries, took 5m24.865630229s, 27.13 MiB/s)
		  | I220922 11:22:40.057993 120 ccl/workloadccl/fixture.go:481  [-] 8  imported 69 GiB in customer table (120000000 rows, 120000000 index entries, took 21m53.550423517s, 53.81 MiB/s)
		  | I220922 11:38:12.798540 126 ccl/workloadccl/fixture.go:481  [-] 9  imported 69 GiB in order_line table (1200015993 rows, 0 index entries, took 37m26.290838654s, 31.34 MiB/s)
		  | Error: importing fixture: importing table stock: pq: pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: exhausted retries: replica unavailable: (n8,s8):5 unable to serve request to r3:/System/{NodeLivenessMax-tsd} [(n4,s4):6, (n3,s3):2, (n2,s2):3, (n5,s5):4, (n8,s8):5, next=7, gen=12]: closed timestamp: 1663846136.733042161,0 (2022-09-22 11:28:56); raft status: {"id":"5","term":10,"vote":"0","commit":3262,"lead":"6","raftState":"StateFollower","applied":3262,"progress":{},"leadtransferee":"0"}: encountered poisoned latch /System/StatusNode/[email protected],0
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 1. Command with error:
		  | ``````
		  | ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081'
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

	monitor.go:127,import.go:154,import.go:181,test_runner.go:908: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:181
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:908
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6340
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:233
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-19829

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 22, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Sep 22, 2022
@adityamaru
Copy link
Contributor

encountered poisoned latch /System/StatusNode/[email protected],0 this is a new one

@stevendanna
Copy link
Collaborator

Possibly interesting, we also see:

W220922 11:30:33.872239 1023 2@rpc/clock_offset.go:315 ⋮ [n8] 1512  uncertain remote offset ‹off=171.88065ms, err=297.084442ms, at=2022-09-22 11:30:19.823901951 +0000 UTC› for maximum tolerated offset 400ms, treating as healthy

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 28, 2022

This just looks to me like the network fell apart. n8 lost contact with n1,n2,n3,n4:

W220922 11:29:01.296081 471 2@rpc/clock_offset.go:223 ⋮ [n8,rnode=4,raddr=‹10.164.0.41:26257›,class=system,heartbeat] 1187  latency jump (prev avg 133.82ms, current 2008.39ms)
W220922 11:29:02.268668 487 2@rpc/clock_offset.go:223 ⋮ [n8,rnode=1,raddr=‹10.154.0.22:26257›,class=default,heartbeat] 1188  latency jump (prev avg 149.16ms, current 404.20ms)
W220922 11:29:03.344091 777 2@rpc/clock_offset.go:223 ⋮ [n8,rnode=3,raddr=‹10.164.0.40:26257›,class=system,heartbeat] 1189  latency jump (prev avg 136.55ms, current 2008.77ms)
W220922 11:29:05.372452 866 kv/kvserver/raft_transport.go:602 ⋮ [n8] 1190  while processing outgoing Raft queue to node 3: ‹EOF›:
W220922 11:29:05.892624 966 kv/kvserver/raft_transport.go:602 ⋮ [n8] 1191  while processing outgoing Raft queue to node 4: ‹rpc error: code = Unavailable desc = error reading from server: EOF›:
W220922 11:29:06.232001 771 kv/kvserver/raft_transport.go:602 ⋮ [n8] 1193  while processing outgoing Raft queue to node 1: ‹rpc error: code = Unavailable desc = error reading from server: EOF›:
W220922 11:29:06.536403 559 kv/kvserver/raft_transport.go:602 ⋮ [n8] 1194  while processing outgoing Raft queue to node 2: ‹rpc error: code = Unavailable desc = error reading from server: EOF›:

As did n7:

W220922 11:28:59.515330 512 2@rpc/clock_offset.go:223 ⋮ [n7,rnode=4,raddr=‹10.164.0.41:26257›,class=system,heartbeat] 1388  latency jump (prev avg 133.81ms, current 519.30ms)
W220922 11:28:59.713189 536 2@rpc/clock_offset.go:223 ⋮ [n7,rnode=2,raddr=‹10.154.0.19:26257›,class=default,heartbeat] 1389  latency jump (prev avg 126.27ms, current 258.35ms)
W220922 11:28:59.743611 518 2@rpc/clock_offset.go:223 ⋮ [n7,rnode=1,raddr=‹10.154.0.22:26257›,class=default,heartbeat] 1390  latency jump (prev avg 126.39ms, current 319.66ms)
W220922 11:29:01.760642 988 2@rpc/clock_offset.go:223 ⋮ [n7,rnode=3,raddr=‹10.164.0.40:26257›,class=default,heartbeat] 1391  latency jump (prev avg 133.69ms, current 499.63ms)
W220922 11:29:06.416934 714 kv/kvserver/raft_transport.go:602 ⋮ [n7] 1395  while processing outgoing Raft queue to node 4: ‹rpc error: code = Unavailable desc = error reading from server: EOF›:
W220922 11:29:09.783104 586 kv/kvserver/raft_transport.go:602 ⋮ [n7] 1403  while processing outgoing Raft queue to node 1: ‹rpc error: code = Unavailable desc = error reading from server: EOF›:
W220922 11:29:11.679936 65732 kv/kvserver/raft_transport.go:602 ⋮ [n7] 1406  while processing outgoing Raft queue to node 2: ‹rpc error: code = Unavailable desc = error reading from server: EOF›:

And I'm seeing the same on n6 and n5 too, so we likely had a network partition between n1-4 and n5-8.

encountered poisoned latch /System/StatusNode/[email protected],0 this is a new one

The salient bit here is the start of the message, "replica unavailable":

replica unavailable: (n8,s8):5 unable to serve request to r3:/System/{NodeLivenessMax-tsd} [(n4,s4):6, (n3,s3):2, (n2,s2):3, (n5,s5):4, (n8,s8):5, next=7, gen=12]: closed timestamp: 1663846136.733042161,0 (2022-09-22 11:28:56); raft status: {"id":"5","term":10,"vote":"0","commit":3262,"lead":"6","raftState":"StateFollower","applied":3262,"progress":{},"leadtransferee":"0"}: encountered poisoned latch /System/StatusNode/[email protected],0`

These errors come from the replica circuit breakers, and indicate that the range is unavailable (almost always because it lost quorum).

Closing this out as network flake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. sync-me T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants