Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/nodeShutdown/coordinator failed #85879

Closed
cockroach-teamcity opened this issue Aug 10, 2022 · 8 comments
Closed

roachtest: restore/nodeShutdown/coordinator failed #85879

cockroach-teamcity opened this issue Aug 10, 2022 · 8 comments
Assignees
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 10, 2022

roachtest.restore/nodeShutdown/coordinator failed with artifacts on release-22.1 @ e8a6797dd6b1482b740c7c9ec681a5e84dd8a8d8:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/coordinator/run_1
	monitor.go:127,jobs.go:154,restore.go:322,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 786558903758422018 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:322
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 786558903758422018 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-18473

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 10, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.1 milestone Aug 10, 2022
@benbardin
Copy link
Collaborator

benbardin commented Aug 10, 2022

Under the hood, from artifacts/logs/3.cockroach.log:

teamcity-6034855-1660108662-02-n4cpu4-0003> I220810 05:26:24.244809 4265 jobs/registry.go:1202 ⋮ [n3] 304 RESTORE job 786558903758422018: stepping through state reverting with error: importing 41 ranges: splitting key ‹×›: change replicas of r78 failed: received invalid ChangeReplicasTrigger LEAVE_JOINT: after=[(n1,s1):1LEARNER (n3,s3):2 (n4,s4):3 (n2,s2):4] next=5 to remove self (leaseholder); lhRemovalAllowed: true; proposed descriptor: r78:‹×› [(n1,s1):1LEARNER, (n3,s3):2, (n4,s4):3, (n2,s2):4, next=5, gen=21, sticky=1660112729.056216949,0]: replica cannot hold lease
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/ccl/backupccl.restore
  | 	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/pkg/ccl/backupccl/restore_job.go:385
  | github.com/cockroachdb/cockroach/pkg/ccl/backupccl.restoreWithRetry
  | 	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/pkg/ccl/backupccl/restore_job.go:145
  | github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreResumer).doResume
  | 	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/pkg/ccl/backupccl/restore_job.go:1430
  | github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreResumer).Resume
  | 	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/pkg/ccl/backupccl/restore_job.go:1215
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine.func2
  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1236
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).stepThroughStateMachine
  | 	github.com/cockroachdb/cockroach/pkg/jobs/registry.go:1237
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).runJob
  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:415
  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).resumeJob.func1
  | 	github.com/cockroachdb/cockroach/pkg/jobs/adopt.go:336
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494
  | runtime.goexit
  | 	GOROOT/src/runtime/asm_amd64.s:1581```

@benbardin
Copy link
Collaborator

Even further under the hood, this appears to be an issue from AdminSplit. We think #85405 may help. Reassigning to KV in case more work is needed.

@benbardin benbardin added T-kv KV Team and removed T-disaster-recovery labels Aug 10, 2022
@irfansharif irfansharif self-assigned this Aug 17, 2022
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 17, 2022

I don't think this is the same as the issue that is fixed by #85405, though it is related.

What we see here is that we're trying to trying to exit a joint configuration and are hitting an error because the current leaseholder is still a VOTER_DEMOTING_LEARNER and is being demoted to a LEARNER. We do try to transfer the lease away from this replica in maybeTransferLeaseDuringLeaveJoint. This must indicate that the lease was transferred back to this replica for some reason, which has been allowed since #83686.

Next steps:

  1. this is not a release blocker, because kvserver: allow voter demoting to get a lease, in case there's an incoming voter #83686 already made it into v22.1.4.
  2. we should improve the error here (kv: improve IllegalReplicationChangeError #86319)
  3. we should extend kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot #85405 to retry on IsIllegalReplicationChangeError.
  4. kvserver: allow voter demoting to get a lease, in case there's an incoming voter #83686 made it possible to acquire the lease as a VOTER_DEMOTING and to receive the lease through a lease transfer as a VOTER_DEMOTING. Would it make sense to disallow the second case? cc. @shralex.

@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 17, 2022
@irfansharif irfansharif removed their assignment Aug 17, 2022
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 17, 2022
Include information about the replica that is returning the error and
about the current range descriptor.

Related to cockroachdb#85879.

Release justification: low risk, improves debugging.
craig bot pushed a commit that referenced this issue Aug 18, 2022
86319: kv: improve IllegalReplicationChangeError r=nvanbenschoten a=nvanbenschoten

Include information about the replica that is returning the error and about the current range descriptor.

Related to #85879.

Release justification: low risk, improves debugging.

86414: build: bump Pebble metamorphic test duration r=nicktrav a=jbowens

Increase the Pebble metamorphic test duration to 6h. We want the increase test
coverage at least for the duration of stability.

Release justification: Non-production code changes
Release note: None

86418: tree: optimize AvailableTypes r=ajwerner a=ajwerner

Check out this profile from `@msirek:`

![pasted image 0](https://user-images.githubusercontent.com/1839234/185487480-4f33551c-ce0e-4d74-91c7-ab15c101f4dd.png)


Release justification: Minor change with no impact on correctness

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Aug 18, 2022
Include information about the replica that is returning the error and
about the current range descriptor.

Related to #85879.

Release justification: low risk, improves debugging.
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/nodeShutdown/coordinator failed with artifacts on release-22.1 @ 714fa0ad80c499cbd96ba97c560a9b414c61104f:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/coordinator/run_1
	monitor.go:127,jobs.go:154,restore.go:322,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 790239396815568898 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:322
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 790239396815568898 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/nodeShutdown/coordinator failed with artifacts on release-22.1 @ f0da2ccc9f3641e4b8252dced7b3c42dca2f2dc1:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/coordinator/run_1
	monitor.go:127,jobs.go:154,restore.go:322,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 792504951490052098 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:322
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 792504951490052098 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/nodeShutdown/coordinator failed with artifacts on release-22.1 @ 9eb4da2a351b2fe4706be9aa4af27ee2e60b405e:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/coordinator/run_1
	monitor.go:127,jobs.go:154,restore.go:322,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 800148903845363714 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:322
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 800148903845363714 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/nodeShutdown/coordinator failed with artifacts on release-22.1 @ a769b09b2f664a04b5bf513a12eb21bdbfdb7ca3:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/restore/nodeShutdown/coordinator/run_1
	monitor.go:127,jobs.go:154,restore.go:322,test_runner.go:883: monitor failure: monitor task failed: unexpectedly found job 801281139947241474 in state failed
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func3
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:322
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:883
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:95
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) unexpectedly found job 801281139947241474 in state failed
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Oct 10, 2022

Fixed by #89564 and #89621.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants