Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/nodeShutdown/worker failed #88667

Closed
cockroach-teamcity opened this issue Sep 24, 2022 · 9 comments · Fixed by #89564 or #89340
Closed

roachtest: restore/nodeShutdown/worker failed #88667

cockroach-teamcity opened this issue Sep 24, 2022 · 9 comments · Fixed by #89564 or #89340
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 24, 2022

roachtest.restore/nodeShutdown/worker failed with artifacts on master @ 89f4ad907a1756551bd6864c3e8516eeff6b0e0a:

test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:304,test_runner.go:928: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:304
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 2: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

	test_runner.go:1059,test_runner.go:958: test timed out (0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-19950

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 24, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Sep 24, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.restore/nodeShutdown/worker failed with artifacts on master @ 51c8aae748d338549400c047796c6c9b892527da:

test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:304,test_runner.go:928: monitor failure: monitor command failure: unexpected node event: 4: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:304
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 4: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

	test_runner.go:1059,test_runner.go:958: test timed out (0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@stevendanna
Copy link
Collaborator

Here is another one stuck on AdminSplit. However, with the new logging added because of #87837 we can see the reason:

W220925 15:29:05.982361 7926 kv/kvserver/replica_command.go:557 ⋮ [n1,s1,r69/1:‹/Table/106/1/{1770930-2060193}›,*roachpb.AdminSplitRequest] 32863  retrying split after err: refusing to transfer lease to (n3,s3):4VOTER_INCOMING because target may need a Raft snapshot: ‹replica in StateProbe›

Note that n3 in this test has been killed to simulate a worker failure.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 26, 2022

The refusing to transfer lease ... message is related to the lease transfer safety checks that we added in this release.

We see that the split is stuck on a previous joint config, which is trying to transfer away the range lease to the incoming voter (logic added in the previous release). However, the transfer is being rejected because the incoming voter may need a snapshot.

Why can't the transferrer determine whether the incoming voter needs a snapshot? Because the voter_incoming has been killed by the test.

So this is quite an interesting situation. In the past, we would have transferred the lease to the dead node, which would have eventually expired, and then allowed another replica in the incoming side of the joint config to acquire the lease.

We now get stuck because we refuse to perform a potentially unsafe lease transfer. In fact, this safety check is not even a false positive. The lease transfer would be unsafe — the target is dead! But then we don't have a way out of this situation, which seems quite bad. So the joint config gets stuck.

What can we do here? Some options are:

  1. Allow the unsafe lease transfer in this case. Perhaps by adding a BypassLeaseTransferSafetyChecks flag to LeaseTransfer
  2. Transfer the lease to someone else in maybeTransferLeaseDuringLeaveJoint when we detect (from liveness?) that the VOTER_INCOMING is dead
  3. I'm not seeing many other options

cc. @shralex. I think option 1 might actually be the best here. We've seen fallout from adding these safety checks to the lease transfer performed during a joint config. While it's well-intentioned, it's also hard to justify because we know that the incoming voter was just sent a snapshot.

@aayushshah15

This comment was marked as outdated.

@nvanbenschoten
Copy link
Member

Allow the unsafe lease transfer in this case. Perhaps by adding a BypassLeaseTransferSafetyChecks flag to LeaseTransfer

A concern with this approach is that the lease transferee needs the be the replica that completes the joint configuration. If we blindly allow lease transfers to dead nodes, then it's possible for the lease transfer to complete, the lease to expire, and then the previous leaseholder to once again acquire the lease (permitted as of #83686). This could then repeat indefinitely without any other replica acquiring the lease and completing the joint configuration.

To guard against that, it does feel like we'll need some variant of option 2.

Note that this isn't a new issue — I think we can already hit this retry loop in v22.1.

@aayushshah15
Copy link
Contributor

It really feels like what we need is a way for the current leaseholder to invalidate it's current lease, perhaps by issuing a synthetic lease transfer to a non-existing replica (or something of the sort), and amending the logic modified in #83686 to guarantee that only the other (non VOTER_DEMOTING) replicas can acquire the lease in such cases.

So the rough idea would be something like:

  1. The current leaseholder initiates the joint replication change, becomes a demoting voter.
  2. It tries to transfer the lease away to the incoming voter, which fails if the incoming voter is dead.
  3. (perhaps after a couple retry attempts) the current leaseholder simply invalidates its lease and bails. Note that this would mean that it commits some sort of synthetic lease to the range.
  4. A future request (like an AdminSplit or a subsequent AdminChangeReplicas call) forces a lease acquisition on some other (non VOTER_DEMOTING) replica.

@cockroach-teamcity
Copy link
Member Author

roachtest.restore/nodeShutdown/worker failed with artifacts on master @ 84384b50c023dd4c05fff76af85a6975f5d2b0ab:

test artifacts and logs in: /artifacts/restore/nodeShutdown/worker/run_1
	monitor.go:127,jobs.go:154,restore.go:304,test_runner.go:928: monitor failure: monitor task failed: getting the job status: read tcp 172.17.0.3:54916 -> 34.73.127.38:26257: read: connection reset by peer
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:154
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerRestoreNodeShutdown.func2
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/restore.go:304
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:928
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.jobSurvivesNodeShutdown.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/jobs.go:84
		  | main.(*monitorImpl).Go.func1
		  | 	main/pkg/cmd/roachtest/monitor.go:105
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1594
		Wraps: (6) getting the job status
		Wraps: (7) read tcp 172.17.0.3:54916 -> 34.73.127.38:26257
		Wraps: (8) read
		Wraps: (9) connection reset by peer
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *net.OpError (8) *os.SyscallError (9) syscall.Errno

	test_runner.go:1059,test_runner.go:958: test timed out (0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

@aayushshah15, @shralex, and I met last week to discuss this issue. In the short term, we decided that option 1 from #88667 (comment) is the safest choice.

I made that change and tested it out on this test. Without any changes, this test has a failure rate of about 7% and gets stuck in a refusing to transfer lease ... target may need a Raft snapshot loop. With that change, it no longer gets stuck in the same way.

However, it still has a 5% failure rate with a different failure mode. The issue is that even if we allow the lease transfer to go through to the dead node, the VOTER_DEMOTING is still allowed to re-acquire the lease. As a result, the test fails in the same way we saw in #85879, with the job hitting a replica cannot hold lease error. The VOTER_DEMOTING transfers the lease, reacquires it, then tries to exit the joint configuration, which fails because it still holds the lease. Arguably, this should be a retryable error. However, there's no reason to believe that if it was, this wouldn't retry forever.

What we need to step 4 from #85879 (comment) (the one remaining step from that list). Once the VOTER_DEMOTING transfers away the lease, it shouldn't re-acquire it. I made that change as well and found that the failure rate dropped to 0%.

To summarize, we need the following two changes:

  1. Option 1 from roachtest: restore/nodeShutdown/worker failed #88667 (comment). This is only needed in v22.2.
  2. Step 4 from roachtest: restore/nodeShutdown/coordinator failed #85879 (comment). This needs to be backported to v22.1 because kvserver: allow voter demoting to get a lease, in case there's an incoming voter #83686 made it back to v22.1.

@exalate-issue-sync exalate-issue-sync bot added blocks-22.2.0-beta.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed T-disaster-recovery GA-blocker T-kv KV Team labels Oct 4, 2022
@craig craig bot closed this as completed in 17d288b Oct 8, 2022
blathers-crl bot pushed a commit that referenced this issue Oct 8, 2022
This PR restricts the case when a VOTER_DEMOTING_LEARNER can
aquire the lease in a joint configuration to only the case where
it was the last leaseholder. Since it is being removed, we only
want it to get the lease if no other replica can aquire it,
the scenario described in #83687

This fix solves a potential starvation scenario where a VOTER_DEMOTING_LEARNER keeps
transferring the lease to the VOTER_INCOMING, succeeding, but then re-acquiring
because the VOTER_INCOMING is dead and the lease expires. In this case, we would
want another replica to pick up the lease, which would allow us to exit the joint configuration.

This PR also removes the leaseHolderRemovalAllowed parameter of CheckCanReceiveLease,
since it is always true since 22.2.

Release note (bug fix): narrows down the conditions under which a VOTER_DEMOTING_LEARNER
can acquire the lease in a joint configuration to a) there has to be an VOTER_INCOMING
in the configuration and b) the VOTER_DEMOTING_LEARNER was the last leaseholder. This
prevents it from acquiring the lease unless it is the only one that can acquire it,
because transferring the lease away is necessary before exiting the joint config (without
the fix the system can be stuck in a joint configuration in some rare situations).

Fixes: #88667
See also #89340
blathers-crl bot pushed a commit that referenced this issue Oct 8, 2022
This PR restricts the case when a VOTER_DEMOTING_LEARNER can
aquire the lease in a joint configuration to only the case where
it was the last leaseholder. Since it is being removed, we only
want it to get the lease if no other replica can aquire it,
the scenario described in #83687

This fix solves a potential starvation scenario where a VOTER_DEMOTING_LEARNER keeps
transferring the lease to the VOTER_INCOMING, succeeding, but then re-acquiring
because the VOTER_INCOMING is dead and the lease expires. In this case, we would
want another replica to pick up the lease, which would allow us to exit the joint configuration.

This PR also removes the leaseHolderRemovalAllowed parameter of CheckCanReceiveLease,
since it is always true since 22.2.

Release note (bug fix): narrows down the conditions under which a VOTER_DEMOTING_LEARNER
can acquire the lease in a joint configuration to a) there has to be an VOTER_INCOMING
in the configuration and b) the VOTER_DEMOTING_LEARNER was the last leaseholder. This
prevents it from acquiring the lease unless it is the only one that can acquire it,
because transferring the lease away is necessary before exiting the joint config (without
the fix the system can be stuck in a joint configuration in some rare situations).

Fixes: #88667
See also #89340
@nvanbenschoten nvanbenschoten reopened this Oct 8, 2022
@nvanbenschoten
Copy link
Member

We still need to land #89340 to resolve this.

shralex added a commit to shralex/cockroach that referenced this issue Oct 8, 2022
This PR restricts the case when a VOTER_DEMOTING_LEARNER can
aquire the lease in a joint configuration to only the case where
it was the last leaseholder. Since it is being removed, we only
want it to get the lease if no other replica can aquire it,
the scenario described in cockroachdb#83687

This fix solves a potential starvation scenario where a VOTER_DEMOTING_LEARNER keeps
transferring the lease to the VOTER_INCOMING, succeeding, but then re-acquiring
because the VOTER_INCOMING is dead and the lease expires. In this case, we would
want another replica to pick up the lease, which would allow us to exit the joint configuration.

Release note (bug fix): narrows down the conditions under which a VOTER_DEMOTING_LEARNER
can acquire the lease in a joint configuration to a) there has to be an VOTER_INCOMING
in the configuration and b) the VOTER_DEMOTING_LEARNER was the last leaseholder. This
prevents it from acquiring the lease unless it is the only one that can acquire it,
because transferring the lease away is necessary before exiting the joint config (without
the fix the system can be stuck in a joint configuration in some rare situations).

Release justification: solves a serious bug.

Fixes: cockroachdb#88667
See also cockroachdb#89340
craig bot pushed a commit that referenced this issue Oct 9, 2022
89340: kv: bypass lease transfer safety checks during joint consensus r=shralex a=nvanbenschoten

This commit adds logic to bypass lease transfer safety checks (added in 034611b) when in a joint configuration and transferring the lease from a VOTER_DEMOTING to a VOTER_INCOMING. We do so because we could get stuck without a path to exit the joint configuration if we rejected this lease transfer while waiting to confirm that the target is up-to-date on its log. That confirmation may never arrive if the target is dead or partitioned away, and while we'd rather not transfer the lease to a dead node, at least we have a mechanism to recovery from that state. We also just sent the VOTER_INCOMING a snapshot (as a LEARNER, before promotion), so it is unlikely that the replica is actually dead or behind on its log.

A better alternative here would be to introduce a mechanism to choose an alternate lease transfer target after some amount of time, if the lease transfer to the VOTER_INCOMING cannot be confirmed to be safe. We may do this in the future, but given the proximity to the release and given that this matches the behavior in v22.1, we choose this approach for now.

Release note: None

Release justification: Needed to resolve release blocker.

Fixes: #88667
See also #89564

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 0036160 Oct 9, 2022
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 9, 2022
Fixes cockroachdb#88667.

This commit adds logic to bypass lease transfer safety checks (added in
034611b) when in a joint configuration and transferring the lease from a
VOTER_DEMOTING to a VOTER_INCOMING. We do so because we could get stuck
without a path to exit the joint configuration if we rejected this lease
transfer while waiting to confirm that the target is up-to-date on its
log. That confirmation may never arrive if the target is dead or
partitioned away, and while we'd rather not transfer the lease to a dead
node, at least we have a mechanism to recovery from that state. We also
just sent the VOTER_INCOMING a snapshot (as a LEARNER, before
promotion), so it is unlikely that the replica is actually dead or
behind on its log.

A better alternative here would be to introduce a mechanism to choose an
alternate lease transfer target after some amount of time, if the lease
transfer to the VOTER_INCOMING cannot be confirmed to be safe. We may do
this in the future, but given the proximity to the release and given that
this matches the behavior in v22.1, we choose this approach for now.

Release note: None

Release justification: Needed to resolve release blocker.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 9, 2022
Fixes cockroachdb#88667.

This commit adds logic to bypass lease transfer safety checks (added in
034611b) when in a joint configuration and transferring the lease from a
VOTER_DEMOTING to a VOTER_INCOMING. We do so because we could get stuck
without a path to exit the joint configuration if we rejected this lease
transfer while waiting to confirm that the target is up-to-date on its
log. That confirmation may never arrive if the target is dead or
partitioned away, and while we'd rather not transfer the lease to a dead
node, at least we have a mechanism to recovery from that state. We also
just sent the VOTER_INCOMING a snapshot (as a LEARNER, before
promotion), so it is unlikely that the replica is actually dead or
behind on its log.

A better alternative here would be to introduce a mechanism to choose an
alternate lease transfer target after some amount of time, if the lease
transfer to the VOTER_INCOMING cannot be confirmed to be safe. We may do
this in the future, but given the proximity to the release and given that
this matches the behavior in v22.1, we choose this approach for now.

Release note: None

Release justification: Needed to resolve release blocker.
@lunevalex lunevalex added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
No open projects
Archived in project
6 participants