Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/nodeShutdown/worker failed #80821

Closed
cockroach-teamcity opened this issue Apr 30, 2022 · 4 comments
Closed

roachtest: restore/nodeShutdown/worker failed #80821

cockroach-teamcity opened this issue Apr 30, 2022 · 4 comments
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 30, 2022

roachtest.restore/nodeShutdown/worker failed with artifacts on release-21.1 @ f275355bdb6b1c4698185c2ad003298b149359ec:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/restore/nodeShutdown/worker/run_1
	jobs.go:149,restore.go:288,test_runner.go:733: unexpectedly found job 757686027142758402 in state reverting
		(1) attached stack trace
		  -- stack trace:
		  | main.jobSurvivesNodeShutdown.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jobs.go:89
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2666
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) unexpectedly found job 757686027142758402 in state reverting
		Error types: (1) *withstack.withStack (2) *errutil.leafError

	cluster.go:1667,context.go:91,cluster.go:1656,test_runner.go:820: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-5060799-1651297277-04-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead (exit status 137)
		2: 11308
		4: 11032
		1: 11787
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh restore/nodeShutdown/worker

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-15502

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 30, 2022
@cockroach-teamcity cockroach-teamcity added this to the 21.1 milestone Apr 30, 2022
@msbutler
Copy link
Collaborator

msbutler commented May 3, 2022

The restore job failed on node 2, which picked up the job after node 3 shut down during execution. A first pass of the logs suggests a potential problem in kv land.

It seems that node 2 picked the job up cleanly:

I220430 05:50:25.300749 4173 jobs/registry.go:1210 ⋮ [n2] 147  RESTORE job ‹757686027142758402›: stepping through state running with error: <nil>
I220430 05:50:25.839663 4173 1@util/log/event_log.go:32 ⋮ [n2,job=‹757686027142758402›] 148 ={"Timestamp":1651297825834832385,"EventType":"restore","JobID":757686027142758402,"JobType":"RESTORE","Description":"‹RESTORE TABLE bank.bank FROM 'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=100,ranges=10,rows=10000000,seed=1/bank'›","User":"‹root›","DescriptorIDs":[53],"Status":"running"}
W220430 05:50:25.844383 4173 ccl/backupccl/restore_job.go:909 ⋮ [n2,job=‹757686027142758402›] 149  statistics for table: ‹bank›, table ID: 53 not found in the backup. Query performance on this table could suffer until statistics are recomputed.

But. node 2's logs suggests some problems handling the shut down node's raft queue ( note context canceled (without success):

W220430 05:50:28.936998 2941 kv/kvserver/raft_transport.go:657 ⋮ [n2] 182  while processing outgoing Raft queue to node 3: ‹rpc error: code = Unavailable desc = transport is closing›:
W220430 05:50:28.937240 2551 vendor/google.golang.org/grpc/internal/channelz/logging.go:73 ⋮ [-] 183  ‹grpc: addrConn.createTransport failed to connect to {10.128.0.91:26257  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...›
I220430 05:50:28.937308 2552 2@rpc/context.go:658 ⋮ [n2] 184  closing ‹[{targetAddr:10.128.0.91:26257 nodeID:3 class:1} {targetAddr:10.128.0.91:26257 nodeID:0 class:1}]›
W220430 05:50:28.937721 995 kv/kvserver/raft_transport.go:657 ⋮ [n2] 185  while processing outgoing Raft queue to node 3: ‹rpc error: code = Unavailable desc = transport is closing›:
I220430 05:50:28.938007 4421 kv/kvserver/replica_command.go:412 ⋮ [n2,s2,r57/2:‹/{Table/53/1/2…-Max}›] 186  initiating a split of this range at key ‹/Table/53/1/3197976› [r59] (‹manual›)‹; delayed by 0.6s to resolve: replica r57/1 not caught up: StateProbe match=0 next=11 paused; context canceled (without success); r57/1 is being probed (may or may not need a Raft snapshot)›
W220430 05:50:28.938411 1028 kv/kvserver/closedts/sidetransport/receiver.go:139 ⋮ [n2] 187  closed timestamps side-transport connection dropped from node: 3
I220430 05:50:28.938474 316 2@rpc/context.go:658 ⋮ [n2] 188  closing ‹[{targetAddr:10.128.0.91:26257 nodeID:3 class:0} {targetAddr:10.128.0.91:26257 nodeID:0 class:0}]›
I220430 05:50:28.938835 4804 kv/kvserver/replica_command.go:412 ⋮ [n2,s2,r57/2:‹/{Table/53/1/2…-Max}›] 189  initiating a split of this range at key ‹/Table/53/1/3487239› [r58] (‹manual›)‹; delayed by 0.7s to resolve: replica r57/1 not caught up: StateProbe match=0 next=11 paused; context canceled (without success); r57/1 is being probed (may or may not need a Raft snapshot)›
I220430 05:50:28.938847 4824 kv/kvserver/replica_command.go:412 ⋮ [n2,s2,r57/2:‹/{Table/53/1/2…-Max}›] 190  initiating a split of this range at key ‹/Table/53/1/2787527› [r60] (‹manual›)‹; delayed by 0.6s to resolve: replica r57/1 not caught up: StateProbe match=0 next=11 paused; context canceled (without success); r57/1 is being probed (may or may not need a Raft snapshot)›

Immediately after the log messages above, this log line appears
I220430 05:50:28.939361 4173 jobs/registry.go:1210 ⋮ [n2] 192 RESTORE job ‹757686027142758402›: stepping through state reverting with error: importing 41 ranges: context canceled

I do wish this error message was a bit more descriptive. This error message was formed when the restoreCtx returned an error while waiting for the restore tasks (i.e. likely while the restore_processor and split_and_scatter processors were running):

See: restore_job:777

	if err := ctxgroup.GoAndWait(restoreCtx, tasks...); err != nil {
		// This leaves the data that did get imported in case the user wants to
		// retry.
		// TODO(dan): Build tooling to allow a user to restart a failed restore.
		return emptyRowCount, errors.Wrapf(err, "importing %d ranges", len(importSpans))
	}

@msbutler msbutler added the T-kv KV Team label May 3, 2022
@tbg
Copy link
Member

tbg commented May 5, 2022

The errors are a consequence of n3 having been shut down by the test in the moments prior to the job failing. They don't suggest a KV problem per se.

Context cancellation is a real annoyance to track down the source of. I worked on, but never merged, a way to possibly make this better here. Luckily, there is an active proposal to add context cancellation reasons upstream. This doesn't help today, but hopefully one day it will.

In the meantime, from a KV point of view, this isn't likely to be a blocker, but if Bulk can figure out what exactly timed out here we could take a closer look.

Given that this is 21.1, which isn't seeing many changes, I would not mark this as a blocker given the low risk of having broken anything.

@msbutler
Copy link
Collaborator

msbutler commented May 5, 2022

@tbg thank you for the explanation! I'll remove the release blocker and will reinvestigate if another failure pops up.

@msbutler msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 5, 2022
@mwang1026 mwang1026 removed the T-kv KV Team label May 17, 2022
@mwang1026
Copy link

Taking KV off @msbutler until / if you find something pointing to KV then feel free to kick it back over

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants