roachtest: restore/nodeShutdown/worker failed #80821

cockroach-teamcity · 2022-04-30T05:51:05Z

roachtest.restore/nodeShutdown/worker failed with artifacts on release-21.1 @ f275355bdb6b1c4698185c2ad003298b149359ec:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/restore/nodeShutdown/worker/run_1
	jobs.go:149,restore.go:288,test_runner.go:733: unexpectedly found job 757686027142758402 in state reverting
		(1) attached stack trace
		  -- stack trace:
		  | main.jobSurvivesNodeShutdown.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jobs.go:89
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2666
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) unexpectedly found job 757686027142758402 in state reverting
		Error types: (1) *withstack.withStack (2) *errutil.leafError

	cluster.go:1667,context.go:91,cluster.go:1656,test_runner.go:820: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-5060799-1651297277-04-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead (exit status 137)
		2: 11308
		4: 11032
		1: 11787
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh restore/nodeShutdown/worker

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!

Jira issue: CRDB-15502}

The text was updated successfully, but these errors were encountered:

msbutler · 2022-05-03T13:55:27Z

The restore job failed on node 2, which picked up the job after node 3 shut down during execution. A first pass of the logs suggests a potential problem in kv land.

It seems that node 2 picked the job up cleanly:

I220430 05:50:25.300749 4173 jobs/registry.go:1210 ⋮ [n2] 147  RESTORE job ‹757686027142758402›: stepping through state running with error: <nil>
I220430 05:50:25.839663 4173 1@util/log/event_log.go:32 ⋮ [n2,job=‹757686027142758402›] 148 ={"Timestamp":1651297825834832385,"EventType":"restore","JobID":757686027142758402,"JobType":"RESTORE","Description":"‹RESTORE TABLE bank.bank FROM 'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=100,ranges=10,rows=10000000,seed=1/bank'›","User":"‹root›","DescriptorIDs":[53],"Status":"running"}
W220430 05:50:25.844383 4173 ccl/backupccl/restore_job.go:909 ⋮ [n2,job=‹757686027142758402›] 149  statistics for table: ‹bank›, table ID: 53 not found in the backup. Query performance on this table could suffer until statistics are recomputed.

But. node 2's logs suggests some problems handling the shut down node's raft queue ( note context canceled (without success):

W220430 05:50:28.936998 2941 kv/kvserver/raft_transport.go:657 ⋮ [n2] 182  while processing outgoing Raft queue to node 3: ‹rpc error: code = Unavailable desc = transport is closing›:
W220430 05:50:28.937240 2551 vendor/google.golang.org/grpc/internal/channelz/logging.go:73 ⋮ [-] 183  ‹grpc: addrConn.createTransport failed to connect to {10.128.0.91:26257  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...›
I220430 05:50:28.937308 2552 2@rpc/context.go:658 ⋮ [n2] 184  closing ‹[{targetAddr:10.128.0.91:26257 nodeID:3 class:1} {targetAddr:10.128.0.91:26257 nodeID:0 class:1}]›
W220430 05:50:28.937721 995 kv/kvserver/raft_transport.go:657 ⋮ [n2] 185  while processing outgoing Raft queue to node 3: ‹rpc error: code = Unavailable desc = transport is closing›:
I220430 05:50:28.938007 4421 kv/kvserver/replica_command.go:412 ⋮ [n2,s2,r57/2:‹/{Table/53/1/2…-Max}›] 186  initiating a split of this range at key ‹/Table/53/1/3197976› [r59] (‹manual›)‹; delayed by 0.6s to resolve: replica r57/1 not caught up: StateProbe match=0 next=11 paused; context canceled (without success); r57/1 is being probed (may or may not need a Raft snapshot)›
W220430 05:50:28.938411 1028 kv/kvserver/closedts/sidetransport/receiver.go:139 ⋮ [n2] 187  closed timestamps side-transport connection dropped from node: 3
I220430 05:50:28.938474 316 2@rpc/context.go:658 ⋮ [n2] 188  closing ‹[{targetAddr:10.128.0.91:26257 nodeID:3 class:0} {targetAddr:10.128.0.91:26257 nodeID:0 class:0}]›
I220430 05:50:28.938835 4804 kv/kvserver/replica_command.go:412 ⋮ [n2,s2,r57/2:‹/{Table/53/1/2…-Max}›] 189  initiating a split of this range at key ‹/Table/53/1/3487239› [r58] (‹manual›)‹; delayed by 0.7s to resolve: replica r57/1 not caught up: StateProbe match=0 next=11 paused; context canceled (without success); r57/1 is being probed (may or may not need a Raft snapshot)›
I220430 05:50:28.938847 4824 kv/kvserver/replica_command.go:412 ⋮ [n2,s2,r57/2:‹/{Table/53/1/2…-Max}›] 190  initiating a split of this range at key ‹/Table/53/1/2787527› [r60] (‹manual›)‹; delayed by 0.6s to resolve: replica r57/1 not caught up: StateProbe match=0 next=11 paused; context canceled (without success); r57/1 is being probed (may or may not need a Raft snapshot)›

Immediately after the log messages above, this log line appears
I220430 05:50:28.939361 4173 jobs/registry.go:1210 ⋮ [n2] 192 RESTORE job ‹757686027142758402›: stepping through state reverting with error: importing 41 ranges: context canceled

I do wish this error message was a bit more descriptive. This error message was formed when the restoreCtx returned an error while waiting for the restore tasks (i.e. likely while the restore_processor and split_and_scatter processors were running):

See: restore_job:777

	if err := ctxgroup.GoAndWait(restoreCtx, tasks...); err != nil {
		// This leaves the data that did get imported in case the user wants to
		// retry.
		// TODO(dan): Build tooling to allow a user to restart a failed restore.
		return emptyRowCount, errors.Wrapf(err, "importing %d ranges", len(importSpans))
	}

tbg · 2022-05-05T07:05:56Z

The errors are a consequence of n3 having been shut down by the test in the moments prior to the job failing. They don't suggest a KV problem per se.

Context cancellation is a real annoyance to track down the source of. I worked on, but never merged, a way to possibly make this better here. Luckily, there is an active proposal to add context cancellation reasons upstream. This doesn't help today, but hopefully one day it will.

In the meantime, from a KV point of view, this isn't likely to be a blocker, but if Bulk can figure out what exactly timed out here we could take a closer look.

Given that this is 21.1, which isn't seeing many changes, I would not mark this as a blocker given the low risk of having broken anything.

msbutler · 2022-05-05T12:31:57Z

@tbg thank you for the explanation! I'll remove the release blocker and will reinvestigate if another failure pops up.

mwang1026 · 2022-05-17T21:45:05Z

Taking KV off @msbutler until / if you find something pointing to KV then feel free to kick it back over

cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 30, 2022

cockroach-teamcity added this to the 21.1 milestone Apr 30, 2022

blathers-crl bot added the T-disaster-recovery label Apr 30, 2022

msbutler added the T-kv KV Team label May 3, 2022

msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 5, 2022

mwang1026 removed the T-kv KV Team label May 17, 2022

jlinder added sync-me and removed sync-me labels May 20, 2022

cockroach-teamcity mentioned this issue Jul 10, 2022

roachtest: restore/nodeShutdown/worker failed #84148

Closed

cockroach-teamcity mentioned this issue Jul 19, 2022

roachtest: (triaged) restore/nodeShutdown/worker failed [from transient admin split error] #84635

Closed

This was referenced Sep 22, 2022

roachtest: restore/nodeShutdown/worker failed #88469

Closed

roachtest: restore/nodeShutdown/worker failed #88667

Closed

stevendanna closed this as completed Jun 8, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/nodeShutdown/worker failed #80821

roachtest: restore/nodeShutdown/worker failed #80821

cockroach-teamcity commented Apr 30, 2022 •

edited by cockroach-jira-scripts

Loading

msbutler commented May 3, 2022 •

edited

Loading

tbg commented May 5, 2022

msbutler commented May 5, 2022

mwang1026 commented May 17, 2022

roachtest: restore/nodeShutdown/worker failed #80821

roachtest: restore/nodeShutdown/worker failed #80821

Comments

cockroach-teamcity commented Apr 30, 2022 • edited by cockroach-jira-scripts Loading

msbutler commented May 3, 2022 • edited Loading

tbg commented May 5, 2022

msbutler commented May 5, 2022

mwang1026 commented May 17, 2022

cockroach-teamcity commented Apr 30, 2022 •

edited by cockroach-jira-scripts

Loading

msbutler commented May 3, 2022 •

edited

Loading