Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787

Closed
cockroach-teamcity opened this issue Aug 12, 2021 · 1 comment
Closed

roachtest: backup/nodeShutdown/coordinator/n4cpu4 failed #68787

cockroach-teamcity opened this issue Aug 12, 2021 · 1 comment
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 12, 2021

roachtest.backup/nodeShutdown/coordinator/n4cpu4 failed with artifacts on release-21.1 @ c4d0e7baee3925541eed599ae771abb95c97732b:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/backup/nodeShutdown/coordinator/n4cpu4/run_1
	jobs.go:131,backup.go:132,test_runner.go:733: unexpectedly found job 683797475003269122 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.jobSurvivesNodeShutdown.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jobs.go:79
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2666
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (2) unexpectedly found job 683797475003269122 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.leafError

	cluster.go:1667,context.go:89,cluster.go:1656,test_runner.go:820: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3290967-1628748601-01-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		4: 10659
		3: 10607
		1: 11767
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh backup/nodeShutdown/coordinator/n4cpu4

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-9262

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 12, 2021
@cockroach-teamcity cockroach-teamcity added this to the 21.1 milestone Aug 12, 2021
@adityamaru
Copy link
Contributor

The BACKUP is retrying once with:

W210812 06:14:02.318880 10994 ccl/backupccl/backup_job.go:482 ⋮ [n1,job=‹683797475003269122›] 631  BACKUP job encountered retryable error: exporting 41 ranges: inbox communication error: ‹rpc error: code = Canceled desc = context cancel

but then it appears to fail with:
I210812 06:14:02.372798 10994 jobs/registry.go:1190 ⋮ [n1] 632 BACKUP job ‹683797475003269122›: stepping through state reverting with error: failed to run backup: exporting 40 ranges: unable to dial n2: ‹breaker open›

My suspicion is that we are still planning a spec on node 2 (the node we shutdown) on retry, and when distsql tries to set up a flow it realizes it cannot dial the node. Ideally, SetupAllNodesPlanning should not be returning node 2 once it is shut down but I wonder if its view is not up to date. Either way, I think we can add this error to our set of retryable errors so that we retry again in the hope that we won't plan a flow on the shutdown node.

@adityamaru adityamaru removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 13, 2021
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 13, 2021
In cockroachdb#68787
we saw a backup job failing on retry because of attempting
to plan a flow on a node that has been shutdown. This should
not usually happen since everytime we plan the flow we fetch
the nodes that can participate in the distsql flow.

In case we do encounter a `breaker open` error we can retry
hoping that the next time the flow is planned it doesn't
attempt to dial the dead node.

Release note: None
craig bot pushed a commit that referenced this issue Aug 16, 2021
68899: utilccl: add ErrBreakerOpen to bulk retryable error r=pbardea a=adityamaru

In #68787
we saw a backup job failing on retry because of attempting
to plan a flow on a node that has been shutdown. This should
not usually happen since everytime we plan the flow we fetch
the nodes that can participate in the distsql flow.

In case we do encounter a `breaker open` error we can retry
hoping that the next time the flow is planned it doesn't
attempt to dial the dead node.

Release note: None

68961: backupccl: deflake TestBackupRestoreSystemJobProgress r=pbardea a=adityamaru

In the case of backup we only update the jobs fraction
progressed if we have exported a complete span. This
change adds some logic to wait until atleast one complete
span has been exported, before checking for a progress update.

Release note: None

Co-authored-by: Aditya Maru <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants