-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: don't allow raft forwarding of lease requests #55148
kvserver: don't allow raft forwarding of lease requests #55148
Conversation
Patch is not quite ready and I will work on tests too, but I wanna see if it raises any discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_proposal_buf.go, line 163 at r2 (raw file):
// rejectProposalWithRedirectLocked rejects a proposal and redirects the // proposer to try it on another node. This is used to sometimes reject lease // acquisitions when the another replica is the leader; the intended
"the another"
pkg/kv/kvserver/replica_proposal_buf.go, line 411 at r2 (raw file):
ents := make([]raftpb.Entry, 0, used) var follower bool
Give this block a comment.
pkg/kv/kvserver/replica_proposal_buf.go, line 703 at r2 (raw file):
} // rejectProposalWithRedirectLocked is part of the proposer interface.
I think we also need to call cleanupFailedProposalLocked
in this function before calling finishApplication
. That will ensure that the proposal is removed from the proposals
map and that it will release quota if it acquired any.
pkg/kv/kvserver/replica_proposal_buf.go, line 708 at r2 (raw file):
) { r := (*Replica)(rp) rangeDesc := r.Desc()
Doesn't this grab a read lock? Read locks aren't re-entrant. I think you want descRLocked
pkg/kv/kvserver/replica_proposal_buf.go, line 709 at r2 (raw file):
r := (*Replica)(rp) rangeDesc := r.Desc() replicaDesc, err := r.getReplicaDescriptorRLocked()
If you just want the StoreID
, why go through all this effort. Is r.store.StoreID()
not sufficient?
pkg/kv/kvserver/replica_proposal_buf_test.go, line 74 at r2 (raw file):
} func (t *testProposer) rejectProposalWithRedirectLocked(
It seems like we should be able to write some tests for this fairly easily.
pkg/kv/kvserver/replica_range_lease.go, line 212 at r3 (raw file):
// every request from making it too far down the proposal path - in particular // to avoid the heartbeating of the liveness record done below. leader := p.repl.mu.leaderID
I'm not sold on this. We want to backport this PR, so we'd like it to be as small as possible. And we're planning on using master
as a staging ground for the backport, so we shouldn't merge anything on top that we don't intend to backport in case it hides issues with the portion we do intend to backport.
So I'm going to assume that if this is in this PR, we want to backport it too. If so, then there's a high bar for changes, especially in such subtle areas as this. This is only an optimization, right? Have we seen that it's anything more than theoretical? Have we seen the liveness heartbeats discussed above cause issues? As of 20.2, liveness heartbeats are coalesced, so I'd be surprised if they were.
I'm primarily worried about the additional complexity this adds in exchange for the unclear optimization. The PR itself is already making a potentially destabilizing change, so I'm wary of making any issues that fall out from it harder to debug or increases the chance that we got it wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r2 looks good to me; i share Nathan's reservations about r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @tbg)
pkg/kv/kvserver/replica_proposal_buf.go, line 421 at r2 (raw file):
leader = roachpb.ReplicaID(status.Lead) if leader == b.p.replicaID() { // !!! this assertion fails, apparently in single-replica groups, so I
Yeah, I remember that single-replica groups short-circuit a lot of logic and don't maintain a lot of things that look like invariants in multi-replica groups. I don't remember anything well enough to give you useful advice right now, though.
pkg/kv/kvserver/replica_proposal_buf.go, line 463 at r2 (raw file):
// - if the leader is known, we reject this proposal and make sure the // request that needed the lease is redirected to the leaseholder; // - if the leader is not known, we don't do anything special here to
We don't have to do anything special here because it will get dropped later, but could we just reject the proposal here whether we know the leader or not? Would avoid questions like the one in the assertion above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @tbg)
pkg/kv/kvserver/replica_proposal_buf.go, line 463 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
We don't have to do anything special here because it will get dropped later, but could we just reject the proposal here whether we know the leader or not? Would avoid questions like the one in the assertion above.
I was going to respond that we don't want to immediately reject the proposal because the case where !leaderKnown
is the case where we'd rather wait until the current election completes so we know who to redirect to, but now I'm confused. When will we ever be a follower and not know the current leader? I thought you'll always be a candidate/pre-candidate if you don't know the leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_proposal_buf.go, line 416 at r2 (raw file):
if raftGroup != nil { status := raftGroup.BasicStatus() follower = status.RaftState == raft.StateFollower
Should this be state == follower
or state != leader
? I think candidates and pre-candidates (and learners, although they just use StateFollower) should be handled the same way as followers here.
pkg/kv/kvserver/replica_proposal_buf.go, line 463 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I was going to respond that we don't want to immediately reject the proposal because the case where
!leaderKnown
is the case where we'd rather wait until the current election completes so we know who to redirect to, but now I'm confused. When will we ever be a follower and not know the current leader? I thought you'll always be a candidate/pre-candidate if you don't know the leader.
The initial state of a raft instance is to be a follower with unknown leader. And there are a few ways to get back to that state. For example, if an election fails without a majority winner, the candidates become followers (with no leaders) as they back off before starting a new election.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @tbg)
pkg/kv/kvserver/replica_proposal_buf.go, line 463 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
The initial state of a raft instance is to be a follower with unknown leader. And there are a few ways to get back to that state. For example, if an election fails without a majority winner, the candidates become followers (with no leaders) as they back off before starting a new election.
Thanks for the explanation, that's very helpful. Then I agree with your other comment that we should invert the follower
boolean to be a leader
boolean and then check if !leader && p.Request.IsLeaseRequest() {
.
I still think we should only redirect if the leader is known, though, for the reasons listed in #37906 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_proposal_buf.go, line 463 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Thanks for the explanation, that's very helpful. Then I agree with your other comment that we should invert the
follower
boolean to be aleader
boolean and then checkif !leader && p.Request.IsLeaseRequest() {
.I still think we should only redirect if the leader is known, though, for the reasons listed in #37906 (comment).
I don't feel strongly about what happens when there is no leader - it's a very rare case and either bouncing among random replicas or waiting synchronously would be acceptable to me. If we could get rid of the synchronous wait code altogether that would be nice, although I guess that's still useful in the normal cold start path.
e27baf6
to
b31e727
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL at the test
I hit a problem though: replicas on draining nodes will not take leases. They will become leaders, though, which can lead to deadlock since, if a leader is draining, nobody else will attempt to take the lease and it also won’t.
The logic is here:
cockroach/pkg/kv/kvserver/replica_range_lease.go
Lines 619 to 623 in 785aea7
if r.store.IsDraining() { | |
// We've retired from active duty. | |
return r.mu.pendingLeaseRequest.newResolvedHandle(roachpb.NewError( | |
newNotLeaseHolderError(nil, r.store.StoreID(), r.mu.state.Desc))) | |
} |
I’m also thinking if this is already a deadlock condition: is there anything preventing all the nodes with replicas for a range from draining at the same time? If that happens, nobody will want the lease?
I'm thinking of allowing a leader to acquire the lease even if it's draining.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_proposal_buf.go, line 163 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"the another"
done
pkg/kv/kvserver/replica_proposal_buf.go, line 411 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this block a comment.
done
pkg/kv/kvserver/replica_proposal_buf.go, line 416 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Should this be
state == follower
orstate != leader
? I think candidates and pre-candidates (and learners, although they just use StateFollower) should be handled the same way as followers here.
done
pkg/kv/kvserver/replica_proposal_buf.go, line 463 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I don't feel strongly about what happens when there is no leader - it's a very rare case and either bouncing among random replicas or waiting synchronously would be acceptable to me. If we could get rid of the synchronous wait code altogether that would be nice, although I guess that's still useful in the normal cold start path.
I think I'm in the "let's wait" camp too. Seems cleaner to me... I don't really like NotLeaseholderErrors
that don't have a redirection in them.
pkg/kv/kvserver/replica_proposal_buf.go, line 703 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think we also need to call
cleanupFailedProposalLocked
in this function before callingfinishApplication
. That will ensure that the proposal is removed from theproposals
map and that it will release quota if it acquired any.
done
pkg/kv/kvserver/replica_proposal_buf.go, line 708 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Doesn't this grab a read lock? Read locks aren't re-entrant. I think you want
descRLocked
removed
pkg/kv/kvserver/replica_proposal_buf.go, line 709 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
If you just want the
StoreID
, why go through all this effort. Isr.store.StoreID()
not sufficient?
done, 10x
Before this patch, a draining node would not take any new leases once it's draining. This is a problem in case all replicas of a range are draining at the same time(*) - nobody wants the lease. Particularly because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness. It also becomes more of a problem with cockroachdb#55148, where we start refusing to take the lease on replicas that are not the leader - so if the leader is draining, we deadlock. We have code that tries to transfer the leadership away from draining nodes - which is a way to solve these deadlocks - but that code is pretty best-effort (and also useless if all the replicas are draining? not sure). This patch makes an exception for leaders, which now no longer refuse the lease even when they're draining. The reasonsing being that it's too easy to deadlock otherwise. (*) Draining multiple nodes at once might be questionable, but I think it's an acceptible practice. At least aspirationally we should allow an operator to drain as many nodes as it wants concurrently, as long as care is taken to not actually stop critical nodes. The way it ought to work is that you start draining 3 nodes, and then the draining becomes successful on one node at a time - this way allowing you to rotate them all. Even with this patch there's multiple holes in that story, one of them being the fact that draining multiple nodes at once will generally not succeed on any node because each node will be stuck with some leases. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r1, 1 of 4 files at r4, 5 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 1157 at r5 (raw file):
log.Infof(ctx, "test: truncating log... done") // !!! assert lease expired
Don't forget this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @knz, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 1054 at r5 (raw file):
// the leader does not attempt to acquire a lease and, thus, does not block until // it figures out that it cannot, in fact, take the lease. // !!! more words
!!!
pkg/kv/kvserver/client_raft_test.go, line 1110 at r5 (raw file):
// Wait until the leadership is transferred away from the partitioned replica. testutils.SucceedsSoon(t, func() error { mtc.advanceClock(ctx)
How long does this test take to run? Does advanceClock
actually speed up the election process? I wouldn't expect it to make a replica tick faster.
pkg/kv/kvserver/client_raft_test.go, line 1175 at r5 (raw file):
// Now we're going to send a request to the behind replica, and we expect it // to not block; we expect a redirection to the leader.
Did you confirm that this blocked before the introduction of your fix?
pkg/kv/kvserver/replica_proposal_buf.go, line 418 at r5 (raw file):
// requests. var iAmTheLeader bool var leaderKnown bool
nit: var leaderKnown, iAmTheLeader bool
Before this patch, a draining node would not take any new leases once it's draining. This is a problem in case all replicas of a range are draining at the same time (e.g. when draining a single-node cluster, or when draining multiple nodes at the same time perhaps by mistake) - nobody wants the lease. Particularly because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness. It also becomes more of a problem with cockroachdb#55148, where we start refusing to take the lease on replicas that are not the leader - so if the leader is draining, we deadlock. This patch makes an exception for leaders, which now no longer refuse the lease even when they're draining. The reasonsing being that it's too easy to deadlock otherwise. Release note: None
Before this patch, a draining node would not take any new leases once it's draining. This is a problem in case all replicas of a range are draining at the same time (e.g. when draining a single-node cluster, or when draining multiple nodes at the same time perhaps by mistake) - nobody wants the lease. Particularly because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness. It also becomes more of a problem with cockroachdb#55148, where we start refusing to take the lease on replicas that are not the leader - so if the leader is draining, we deadlock. This patch makes an exception for leaders, which now no longer refuse the lease even when they're draining. The reasonsing being that it's too easy to deadlock otherwise. Release note: None
55624: kvserver: don't always refuse to take leases when draining r=andreimatei a=andreimatei Before this patch, a draining node would not take any new leases once it's draining. This is a problem in case all replicas of a range are draining at the same time (e.g. when draining a single-node cluster, or when draining multiple nodes at the same time perhaps by mistake) - nobody wants the lease. Particularly because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness. It also becomes more of a problem with #55148, where we start refusing to take the lease on replicas that are not the leader - so if the leader is draining, we deadlock. This patch makes an exception for leaders, which now no longer refuse the lease even when they're draining. The reasonsing being that it's too easy to deadlock otherwise. Release note: None 56004: build: add instructions to rectify code that is not regenerated r=irfansharif a=otan Make it easier for the developer to know how to fix un-generated code by specifying the required commands. Release note: None 56016: backfill: remove unused bound account from column backfiller r=yuzefovich a=adityamaru Previously, we were creating a bound account in the column backfiller which was not being used to track anything. This change removes that. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Aditya Maru <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ready to go. PTAL at the test.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 1054 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
!!!
done
pkg/kv/kvserver/client_raft_test.go, line 1110 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How long does this test take to run? Does
advanceClock
actually speed up the election process? I wouldn't expect it to make a replica tick faster.
moved to TestServer. And tweaked some knobs. It takes ~3.5s
pkg/kv/kvserver/client_raft_test.go, line 1157 at r5 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Don't forget this.
done
pkg/kv/kvserver/client_raft_test.go, line 1175 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you confirm that this blocked before the introduction of your fix?
yup
pkg/kv/kvserver/replica_proposal_buf.go, line 421 at r2 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Yeah, I remember that single-replica groups short-circuit a lot of logic and don't maintain a lot of things that look like invariants in multi-replica groups. I don't remember anything well enough to give you useful advice right now, though.
gone
pkg/kv/kvserver/replica_proposal_buf.go, line 418 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
var leaderKnown, iAmTheLeader bool
done
pkg/kv/kvserver/replica_proposal_buf_test.go, line 74 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It seems like we should be able to write some tests for this fairly easily.
added test
pkg/kv/kvserver/replica_range_lease.go, line 212 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm not sold on this. We want to backport this PR, so we'd like it to be as small as possible. And we're planning on using
master
as a staging ground for the backport, so we shouldn't merge anything on top that we don't intend to backport in case it hides issues with the portion we do intend to backport.So I'm going to assume that if this is in this PR, we want to backport it too. If so, then there's a high bar for changes, especially in such subtle areas as this. This is only an optimization, right? Have we seen that it's anything more than theoretical? Have we seen the liveness heartbeats discussed above cause issues? As of 20.2, liveness heartbeats are coalesced, so I'd be surprised if they were.
I'm primarily worried about the additional complexity this adds in exchange for the unclear optimization. The PR itself is already making a potentially destabilizing change, so I'm wary of making any issues that fall out from it harder to debug or increases the chance that we got it wrong.
removed commit
b31e727
to
6d60b64
Compare
Test failure: pkg/kv/kvserver/client_raft_test.go:1090:11: this value of err is never used (SA4006) @nvanbenschoten you've got this PR, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r6, 9 of 11 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @bdarnell, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 1283 at r7 (raw file):
// the lease and then blocking traffic for a long time until they find out // whether they successfully took the lease or not. func TestRequestsOnLaggingReplicaXXX(t *testing.T) {
Looks like this test name is off. Did you mean to remove this version with multiTestContext
?
pkg/kv/kvserver/replica_proposal_buf.go, line 189 at r7 (raw file):
} // setTestingKnobs should only be called immediately after Init(); it cannot be
nit: I would skip the unexported setter method and just assign directly like we do for the other testing knobs, but this is also fine.
pkg/kv/kvserver/replica_proposal_buf_test.go, line 74 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
added test
Did you want to add a test to this file? I think that would be the easiest way to unit test this change. It should be pretty satisfying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 1283 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Looks like this test name is off. Did you mean to remove this version with
multiTestContext
?
done
pkg/kv/kvserver/replica_proposal_buf.go, line 189 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: I would skip the unexported setter method and just assign directly like we do for the other testing knobs, but this is also fine.
done
pkg/kv/kvserver/replica_proposal_buf_test.go, line 74 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you want to add a test to this file? I think that would be the easiest way to unit test this change. It should be pretty satisfying to do.
done, PTAL
This patch backpedals a little bit on the logic introduced in cockroachdb#55148. That patch said that, if a leader is known, every other replica refuses to propose a lease acquisition. Instead, the replica in question redirects whomever was triggering the lease acquisition to the leader, thinking that the leader should take the lease. That patch introduced a deadlock: some replicas refuse to take the lease because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the deadlock, this patch incorporates that check in the proposal buffer's decision about whether or not to reject a proposal: if the leader is believed to refuse to take the lease, then we again forward our own lease request. An edge case to the edge case is when the leader is not even part of the proposer's range descriptor. This can happen if the proposer is far behind. In this case, we assume that the leader is eligible. If it isn't, the deadlock will resolve once the proposer catches up. A future patch will relax the conditions under which a replica agrees to take the lease. VOTER_INCOMING replicas should take the lease. VOTER_DEMOTING are more controversial. Fixes cockroachdb#57798 Release note: None
This patch backpedals a little bit on the logic introduced in cockroachdb#55148. That patch said that, if a leader is known, every other replica refuses to propose a lease acquisition. Instead, the replica in question redirects whomever was triggering the lease acquisition to the leader, thinking that the leader should take the lease. That patch introduced a deadlock: some replicas refuse to take the lease because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the deadlock, this patch incorporates that check in the proposal buffer's decision about whether or not to reject a proposal: if the leader is believed to refuse to take the lease, then we again forward our own lease request. An edge case to the edge case is when the leader is not even part of the proposer's range descriptor. This can happen if the proposer is far behind. In this case, we assume that the leader is eligible. If it isn't, the deadlock will resolve once the proposer catches up. A future patch will relax the conditions under which a replica agrees to take the lease. VOTER_INCOMING replicas should take the lease. VOTER_DEMOTING are more controversial. Fixes cockroachdb#57798 Release note: None
This patch backpedals a little bit on the logic introduced in cockroachdb#55148. That patch said that, if a leader is known, every other replica refuses to propose a lease acquisition. Instead, the replica in question redirects whomever was triggering the lease acquisition to the leader, thinking that the leader should take the lease. That patch introduced a deadlock: some replicas refuse to take the lease because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the deadlock, this patch incorporates that check in the proposal buffer's decision about whether or not to reject a proposal: if the leader is believed to refuse to take the lease, then we again forward our own lease request. An edge case to the edge case is when the leader is not even part of the proposer's range descriptor. This can happen if the proposer is far behind. In this case, we assume that the leader is eligible. If it isn't, the deadlock will resolve once the proposer catches up. A future patch will relax the conditions under which a replica agrees to take the lease. VOTER_INCOMING replicas should take the lease. VOTER_DEMOTING are more controversial. Fixes cockroachdb#57798 Release note: None
This patch backpedals a little bit on the logic introduced in cockroachdb#55148. That patch said that, if a leader is known, every other replica refuses to propose a lease acquisition. Instead, the replica in question redirects whomever was triggering the lease acquisition to the leader, thinking that the leader should take the lease. That patch introduced a deadlock: some replicas refuse to take the lease because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the deadlock, this patch incorporates that check in the proposal buffer's decision about whether or not to reject a proposal: if the leader is believed to refuse to take the lease, then we again forward our own lease request. An edge case to the edge case is when the leader is not even part of the proposer's range descriptor. This can happen if the proposer is far behind. In this case, we assume that the leader is eligible. If it isn't, the deadlock will resolve once the proposer catches up. A future patch will relax the conditions under which a replica agrees to take the lease. VOTER_INCOMING replicas should take the lease. VOTER_DEMOTING are more controversial. Fixes cockroachdb#57798 Release note: None
58722: kvserver: don't refuse to fwd lease proposals in some edge cases r=andreimatei a=andreimatei This patch backpedals a little bit on the logic introduced in #55148. That patch said that, if a leader is known, every other replica refuses to propose a lease acquisition. Instead, the replica in question redirects whomever was triggering the lease acquisition to the leader, thinking that the leader should take the lease. That patch introduced a deadlock: some replicas refuse to take the lease because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the deadlock, this patch incorporates that check in the proposal buffer's decision about whether or not to reject a proposal: if the leader is believed to refuse to take the lease, then we again forward our own lease request. An edge case to the edge case is when the leader is not even part of the proposer's range descriptor. This can happen if the proposer is far behind. In this case, we assume that the leader is eligible. If it isn't, the deadlock will resolve once the proposer catches up. A future patch will relax the conditions under which a replica agrees to take the lease. VOTER_INCOMING replicas should take the lease. VOTER_DEMOTING are more controversial. Fixes #57798 Release note: None 59087: util/log: new output format 'crdb-v2' r=itsbilal a=knz Fixes #50166. This new format intends to address all the known shortcomings with `crdb-v1` while remaining compatible with entry parsers designed for the previous version. See the user-facing release note below for a summary of changes; and the included reference documentation for details. Example TTY output with colors: ![image](https://user-images.githubusercontent.com/642886/104824568-261e9380-5853-11eb-9ad9-e5936f0890fd.png) Example for a single-line unstructured entry. Before: ``` I210116 22:17:03.736236 57 cli/start.go:681 ⋮ node startup completed: ``` After: ``` I210116 22:17:03.736236 57 cli/start.go:681 ⋮ [-] 12 node startup completed: tag field now always included ^^^ entry counter now always included ^^^ ``` Example for a multi-line unstructured entry. Before: ``` I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 gossip status (ok, 1 node‹›) gossip client (0/3 cur/max conns) gossip connectivity n1 [sentinel]; ``` (subsequent lines lack a log entry prefix; hard to determine where entries start and end) After: ``` I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 gossip status (ok, 1 node‹›) I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 +gossip client (0/3 cur/max conns) I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 +gossip connectivity I210116 22:15:38.105666 452 gossip/gossip.go:567 ⋮ [n1] 74 + n1 [sentinel]; ^^^ common prefix repeated for each msg line same entry counter for every subsequent line ^^^ continuation marker "+" on susequent lines ^^ ``` Example for a structured entry. Before: ``` I210116 22:14:38.175829 469 util/log/event_log.go:32 ⋮ [n1] Structured entry: {...} ``` After: ``` I210116 22:14:38.175829 469 util/log/event_log.go:32 ⋮ [n1] 21 ={...} entry counter always present ^^ equal sign "=" denotes structured entry ^^ ``` Release note (cli change): The default output format for `file-group` and `stderr` sinks has been changed to `crdb-v2`. This new format is non-ambiguous and makes it possible to reliably parse log files. Refer to the format's documentation for details. Additionally, it prevents single log lines from exceeding a large size; this problem is inherent to the `crdb-v1` format and can prevent `cockroach debug zip` from retrieving v1 log files. The new format has also been designed so that existinglog file analyzers for the `crdb-v1` format can read entries written the new format. However, this conversion may be imperfect. Again, refer to the new format's documentation for details. In case of incompatibility, users can force the previous format by using `format: crdb-v1` in their logging configuration. 59141: ui: upgrade admin-ui-components to new dep r=dhartunian a=dhartunian We renamed the `admin-ui-components` package to `cluster-ui`. Release note: None 59388: build,bazel: remove references to `gofmt` in bazel build r=rickystewart a=rickystewart This was cargo-culted from the `Makefile`, but isn't necessary to get the build to succeed, and interferes with hermiticity because it requires `gofmt` to be globally installed. It's simpler to just remove these references entirely. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
…ts"" This reverts commit 49c8337. This is a revert of a revert, reintroducing the backport of the main commit from cockroachdb#55148. The backport was reverted because of a deadlock, which is not being fixed by the next commit. Release note (bug fix): A bug causing queries sent to a freshly-restarted node to sometimes hang for a long time while the node catches up with replication has been fixed.
This patch backpedals a little bit on the logic introduced in cockroachdb#55148. That patch said that, if a leader is known, every other replica refuses to propose a lease acquisition. Instead, the replica in question redirects whomever was triggering the lease acquisition to the leader, thinking that the leader should take the lease. That patch introduced a deadlock: some replicas refuse to take the lease because they are not VOTER_FULL (see CheckCanReceiveLease()). To fix the deadlock, this patch incorporates that check in the proposal buffer's decision about whether or not to reject a proposal: if the leader is believed to refuse to take the lease, then we again forward our own lease request. An edge case to the edge case is when the leader is not even part of the proposer's range descriptor. This can happen if the proposer is far behind. In this case, we assume that the leader is eligible. If it isn't, the deadlock will resolve once the proposer catches up. A future patch will relax the conditions under which a replica agrees to take the lease. VOTER_INCOMING replicas should take the lease. VOTER_DEMOTING are more controversial. Fixes cockroachdb#57798 Release note: None
61982: kvserver: allow lease extensions from followers in propBuf r=nvanbenschoten,andreimatei,tbg a=erikgrinaker The Raft proposal buffer currently rejects lease acquisition proposals from followers if there is an eligible leader, to avoid interfering with it trying to claim the lease. However, this logic unintentionally rejected lease extension proposals from followers as well, which could cause a leaseholder to lose its lease. This patch changes the proposal buffer to allow lease extension proposals from followers for extension-based leases as long as they still hold a valid lease. Resolves #59179. Follow-up to #55148. Release note: None Co-authored-by: Erik Grinaker <[email protected]>
Previously, we were only updating the `leases.error` (metric measuring the number of failed lease requests) if a lease request failed during command evaluation. However after cockroachdb#55148, lease requests can be rejected by the replica proposal buffer if there is a known raft leader that is different from the replica requesting the lease. We were not recording such rejections as failed lease requests in the metrics. This commit fixes this oversight. Release note: None
64531: kvserver: fix metrics collection for rejected lease requests r=aayushshah15 a=aayushshah15 Previously, we were only updating the `leases.error` (metric measuring the number of failed lease requests) if a lease request failed during command evaluation. However after #55148, lease requests can be rejected by the replica proposal buffer if there is a known raft leader that is different from the replica requesting the lease. We were not recording such rejections as failed lease requests in the metrics. This commit fixes this oversight. Release note: None Co-authored-by: Aayush Shah <[email protected]>
This patch aims to improve the behavior in scenarios where a follower
replica is behind, unaware of the latest lease, and it tries to acquire
a lease in its ignorant state. That lease acquisition request is bound
to fail (because the lease that it's based on is stale), but while it
fails (or, rather, until the behind replica finds out that it failed)
local requests are blocked. This blocking can last for a very long time
in situations where a snapshot is needed to catch up the follower, and
the snapshot is queued up behind many other snapshots (e.g. after a node
has been down for a while and gets restarted).
This patch tries an opinionated solution: never allow followers to
acquire leases. If there is a leader, it's a better idea for the leader
to acquire the lease. The leader might have a lease anyway or, even if
it doesn't, having the leader acquire it saves a leadership transfer
(leadership follows the lease).
We change the proposal path to recognize lease requests and reject them
early if the current replica is a follower and the leader is known. The
rejection points to the leader, which causes the request that triggered
the lease acquisition to make its way to the leader and attempt to
acquire a lease over there.
The next commit takes this further by short-circuiting the lease
proposal even sooner - but that patch is more best-effort.
Fixes #37906
Release note (bug fix): A bug causing queries sent to a
freshly-restarted node to sometimes hang for a long time while the node
catches up with replication has been fixed.