roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787

cockroach-teamcity · 2021-08-12T06:14:27Z

roachtest.backup/nodeShutdown/coordinator/n4cpu4 failed with artifacts on release-21.1 @ c4d0e7baee3925541eed599ae771abb95c97732b:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/backup/nodeShutdown/coordinator/n4cpu4/run_1
	jobs.go:131,backup.go:132,test_runner.go:733: unexpectedly found job 683797475003269122 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.jobSurvivesNodeShutdown.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jobs.go:79
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2666
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) unexpectedly found job 683797475003269122 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.leafError

	cluster.go:1667,context.go:89,cluster.go:1656,test_runner.go:820: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3290967-1628748601-01-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		4: 10659
		3: 10607
		1: 11767
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh backup/nodeShutdown/coordinator/n4cpu4

/cc @cockroachdb/bulk-io _{This test on roachdash | Improve this report!

Jira issue: CRDB-9262}

The text was updated successfully, but these errors were encountered:

adityamaru · 2021-08-13T14:46:34Z

The BACKUP is retrying once with:

W210812 06:14:02.318880 10994 ccl/backupccl/backup_job.go:482 ⋮ [n1,job=‹683797475003269122›] 631  BACKUP job encountered retryable error: exporting 41 ranges: inbox communication error: ‹rpc error: code = Canceled desc = context cancel

but then it appears to fail with:
I210812 06:14:02.372798 10994 jobs/registry.go:1190 ⋮ [n1] 632 BACKUP job ‹683797475003269122›: stepping through state reverting with error: failed to run backup: exporting 40 ranges: unable to dial n2: ‹breaker open›

My suspicion is that we are still planning a spec on node 2 (the node we shutdown) on retry, and when distsql tries to set up a flow it realizes it cannot dial the node. Ideally, SetupAllNodesPlanning should not be returning node 2 once it is shut down but I wonder if its view is not up to date. Either way, I think we can add this error to our set of retryable errors so that we retry again in the hope that we won't plan a flow on the shutdown node.

In cockroachdb#68787 we saw a backup job failing on retry because of attempting to plan a flow on a node that has been shutdown. This should not usually happen since everytime we plan the flow we fetch the nodes that can participate in the distsql flow. In case we do encounter a `breaker open` error we can retry hoping that the next time the flow is planned it doesn't attempt to dial the dead node. Release note: None

68899: utilccl: add ErrBreakerOpen to bulk retryable error r=pbardea a=adityamaru In #68787 we saw a backup job failing on retry because of attempting to plan a flow on a node that has been shutdown. This should not usually happen since everytime we plan the flow we fetch the nodes that can participate in the distsql flow. In case we do encounter a `breaker open` error we can retry hoping that the next time the flow is planned it doesn't attempt to dial the dead node. Release note: None 68961: backupccl: deflake TestBackupRestoreSystemJobProgress r=pbardea a=adityamaru In the case of backup we only update the jobs fraction progressed if we have exported a complete span. This change adds some logic to wait until atleast one complete span has been exported, before checking for a progress update. Release note: None Co-authored-by: Aditya Maru <[email protected]>

cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 12, 2021

cockroach-teamcity added this to the 21.1 milestone Aug 12, 2021

blathers-crl bot added the T-disaster-recovery label Aug 12, 2021

adityamaru removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 13, 2021

adityamaru mentioned this issue Aug 13, 2021

utilccl: add ErrBreakerOpen to bulk retryable error #68899

Merged

adityamaru mentioned this issue Aug 16, 2021

roachtest: backup/nodeShutdown/worker/n4cpu4 failed #68974

Closed

mwang1026 assigned adityamaru Aug 16, 2021

cockroach-teamcity mentioned this issue Feb 2, 2022

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #75866

Closed

cockroach-teamcity mentioned this issue Mar 29, 2022

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #78726

Closed

cockroach-teamcity mentioned this issue Nov 10, 2022

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #91644

Closed

adityamaru closed this as completed Jun 7, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787

cockroach-teamcity commented Aug 12, 2021 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Aug 13, 2021

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787

Comments

cockroach-teamcity commented Aug 12, 2021 • edited by cockroach-jira-scripts Loading

adityamaru commented Aug 13, 2021

cockroach-teamcity commented Aug 12, 2021 •

edited by cockroach-jira-scripts

Loading