Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add timeout for lease acquisitions #81136

Merged
merged 2 commits into from
May 25, 2022

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 8, 2022

kvserver: deflake TestLeasePreferencesDuringOutage

This patch deflakes TestLeasePreferencesDuringOutage by allowing nodes
to acquire a lease even when they are not the Raft leader.

Release note: None

kvserver: add timeout for lease acquisitions

This patch adds a timeout for lease acquisitions. It is set to
twice the Raft election timeout (6 seconds total), since it may need to
hold a Raft election and repropose the lease acquisition command, each
of which may take up to one election timeout.

Without this timeout, it's possible for a lease acquisition to stall
indefinitely (e.g. in the case of a stalled disk). This prevents a
NotLeaseHolderError from being returned to the client DistSender,
which in turn prevents it from trying other replicas that could acquire
the lease instead. This can cause a lease to remain invalid forever.

Touches #81100.

Release note (bug fix): Fixed a bug where an unresponsive node (e.g.
with a stalled disk) could prevent other nodes from acquiring its
leases, effectively stalling these ranges until the node was shut down
or recovered.

@erikgrinaker erikgrinaker requested a review from a team as a code owner May 8, 2022 16:15
@erikgrinaker erikgrinaker self-assigned this May 8, 2022
@erikgrinaker erikgrinaker requested a review from a team as a code owner May 8, 2022 16:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the lease-acquire-timeout branch from 688a39a to c78e37e Compare May 8, 2022 16:15
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second commit LGTM, first commit I am concerned that we're making the API worse and not better by adding this, since we don't have a way to return a descriptive cancellation error.

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvserver/replica_range_lease.go line 362 at r2 (raw file):

			// regular heartbeat. We must eventually return a NotLeaseHolderError,
			// otherwise we could prevent anyone else from acquiring the lease.
			Timeout: nlRenewal,

Let's say a range is unavailable, but the circuit breaker hasn't tripped (there isn't a lease). A request comes in and requests the lease. We will do so with a 4.5s timeout. I was worried that we wouldn't ever trip the breaker. But I think I'm wrong - the caller (this code) will abandon the lease proposal, but the proposal will stick around. So we will still tickle the breaker.

(Just rationalizing this for myself.)


pkg/util/stop/stopper.go line 503 at r1 (raw file):

		// infallible.
		if opt.Timeout > 0 {
			_ = contextutil.RunWithTimeout(ctx, opt.TaskName, opt.Timeout,

This seems to defeat the purpose of using RunWithTimeout in the first place. If you specify a timeout, you want to pass a fallible closure. The nesting option that you alluded to would be better since it will actually surface the error massaged by RunWithTimeout.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/replica_range_lease.go line 362 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Let's say a range is unavailable, but the circuit breaker hasn't tripped (there isn't a lease). A request comes in and requests the lease. We will do so with a 4.5s timeout. I was worried that we wouldn't ever trip the breaker. But I think I'm wrong - the caller (this code) will abandon the lease proposal, but the proposal will stick around. So we will still tickle the breaker.

(Just rationalizing this for myself.)

Yeah, thanks for verifying. It seems unfortunate that all of these mechanisms (lease acquisitions, circuit breakers, etc) are driven by client requests.


pkg/util/stop/stopper.go line 503 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This seems to defeat the purpose of using RunWithTimeout in the first place. If you specify a timeout, you want to pass a fallible closure. The nesting option that you alluded to would be better since it will actually surface the error massaged by RunWithTimeout.

If we do the call manually we're still in the same situation: we're in an infallible goroutine and have to handle the error from RunWithTimeout. The same is true if the caller simply passes in a cancellable context to RunAsyncTask: the context error must be handled inside the infallible goroutine. These errors must either be logged or passed to a different goroutine regardless of this option.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/util/stop/stopper.go line 503 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

If we do the call manually we're still in the same situation: we're in an infallible goroutine and have to handle the error from RunWithTimeout. The same is true if the caller simply passes in a cancellable context to RunAsyncTask: the context error must be handled inside the infallible goroutine. These errors must either be logged or passed to a different goroutine regardless of this option.

Oh wait, yeah I see what you're saying. I'll drop it.

@erikgrinaker erikgrinaker force-pushed the lease-acquire-timeout branch 2 times, most recently from 1ee0204 to 5a2cbde Compare May 12, 2022 14:03
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/util/stop/stopper.go line 503 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Oh wait, yeah I see what you're saying. I'll drop it.

Done, PTAL.

@erikgrinaker erikgrinaker force-pushed the lease-acquire-timeout branch from 5a2cbde to 1086348 Compare May 12, 2022 14:13
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second commit LGTM, first commit I am concerned that we're making the API worse and not better by adding this, since we don't have a way to return a descriptive cancellation error.

I think Reviewable ate the other comment thread, since I can't find it anywhere in the UI anymore now that it's resolved. Anyway, the new error handling requires a decode/encode roundtrip for roachpb.Error -- I'm a bit worried that those can be lossy, but let's see what CI says.

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

@erikgrinaker erikgrinaker force-pushed the lease-acquire-timeout branch 2 times, most recently from 3a8260b to 2de8289 Compare May 19, 2022 08:12
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

I'm seeing two CI failures here:

Going to resolve these somehow before merging.

@erikgrinaker erikgrinaker force-pushed the lease-acquire-timeout branch from 2de8289 to 0f3f5d2 Compare May 24, 2022 09:36
This patch deflakes `TestLeasePreferencesDuringOutage` by allowing nodes
to acquire a lease even when they are not the Raft leader.

Release note: None
This patch adds a timeout for lease acquisitions. It is set to
twice the Raft election timeout (6 seconds total), since it may need to
hold a Raft election and repropose the lease acquisition command, each
of which may take up to one election timeout.

Without this timeout, it's possible for a lease acquisition to stall
indefinitely (e.g. in the case of a stalled disk). This prevents a
`NotLeaseHolderError` from being returned to the client DistSender,
which in turn prevents it from trying other replicas that could acquire
the lease instead. This can cause a lease to remain invalid forever.

Release note (bug fix): Fixed a bug where an unresponsive node (e.g.
with a stalled disk) could prevent other nodes from acquiring its
leases, effectively stalling these ranges until the node was shut down
or recovered.
@erikgrinaker erikgrinaker force-pushed the lease-acquire-timeout branch from 0f3f5d2 to ab74b97 Compare May 24, 2022 12:39
@erikgrinaker
Copy link
Contributor Author

This is being resolved in cockroachdb/pebble#1719. We can merge this regardless, since this is already flaky on master.

I had to increase the timeout from the heartbeat timeout (4.5 seconds) to twice the Raft election timeout (6 seconds), since it can take up to two election timeouts to elect a new Raft leader and then repropose the lease acquisition command. I also deflaked the test in a separate commit.

Mind giving this another look @tbg?

@lunevalex lunevalex linked an issue May 24, 2022 that may be closed by this pull request
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

TFTR!

bors r=tbg

@craig
Copy link
Contributor

craig bot commented May 25, 2022

Build succeeded:

@craig craig bot merged commit cf6e75d into cockroachdb:master May 25, 2022
@blathers-crl
Copy link

blathers-crl bot commented May 25, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ab74b97 to blathers/backport-release-21.2-81136: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


error creating merge commit from ab74b97 to blathers/backport-release-22.1-81136: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: add timeout for lease acquisitions
3 participants