-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: lease counts diverge when a new node is added to a cluster #67740
Comments
I wanted to post this more generally because I'm not quite sure what behaviour we'd rather have here. I also struggled to come up with any concrete ideas that would cleanly improve the situation. |
Is it possible to have a leaseholder execute a membership change removing itself away from the raft group? (Either by letting the lease expire, or in a more coordinated fashion by transferring the lease to the incoming replica.) That would reduce the number of lease transfers needed for a given range from 2 to 1, and it would stay the same rack/locality. If that's not possible, can we have an outgoing leaseholder hold onto its lease over the duration of the rebalance, transferring it away to another locality only when the learner replica has been fully caught up? That would reduce how long a cluster observes this diverged state. |
This is something that @tbg has discussed in the past. It's certainly not possible today, because we don't have a way to coalesce a lease transfer with a membership change into the same Raft log entry. If we did, then I think we could simplify this and a few other edge cases with 1x replication.
This is a good idea, though I don't think it would work today because we eagerly throw away LEARNER replicas from previous ChangeReplicas attempts whenever we see them (see An interesting variation of this for when we know we want a longer-lived learner would be to create a NON_VOTER instead of a LEARNER before transferring away the lease and then allow the new leaseholder to perform the VOTER<->NON_VOTER swap before removing the old VOTER. |
Relates to #51867 |
@aayushshah15 there was a previous idea described here #40333 for how to fix this. |
Here's an easy local reproduction of this issue that may come in handy when trying to fix it:
I looked briefly into this intermediate NON_VOTER idea and it doesn't look trivial. After a refresh of the code, I believe the "then allow the new leaseholder to perform the VOTER<->NON_VOTER swap before removing the old VOTER" step would require a good amount of specialized code. Without that, the new leaseholder's allocator would first issue an We could address this by maintaining more state about the purpose of the temporary non-voter in the range descriptor. But this wouldn't be a small change. This makes me question whether transferring the lease away for a short duration is even good enough to consider this issue resolved. In setups like multi-region deployments with one replica per region and leaseholder preferences (note: REGION survivability uses 5x replication, so it doesn't hit this), violating these preferences even for a short period of time will be disruptive. So why are we doing so in the first place? The straightforward answer that's spelled out in #40333 is that we can't perform a For this to be safe, we'd need to change our leaseholder eligibility rules in two ways. First, we'd need to allow I think that if we made this change, we would not only fix this issue sufficiently but could actually guarantee that the lease would never leave a region during an intra-region rebalancing operation, all while never violating the invariant that a voting replica always holds the lease. |
That sounds like a good idea. Something that bugs me about the original issue is also how we have the replica locally in the same rack, but we chose to send the leaseholder to possibly far away and then send a snapshot from there. This also would be avoided then. Of course exploring the ideas of requesting snapshots to be sent from suitable replicas (regardless of leadership status) would give us even more flexibility but it's orthogonal. |
This is actually something that came up in a customer call last week as something that substantially adds to the operating costs of CRDB. This is because network traffic within the same AZ is ~free whereas cross-AZ traffic is not. Our current inability to make sure that we're sending snapshots from within the same locality also makes operations like node decommissioning quite expensive (which has prompted customers to try some unsupported workflows). |
Related: #42491 |
@shralex here is a collection of code pointers and areas for investigation. The first is about raft leadership and the flexibility of etcd/raft. If we want to transfer a lease from a Next, we can start looking at how CRDB performs rebalance operations. Let's start at the cockroach/pkg/kv/kvserver/allocator.go Line 687 in 48b3793
In such cases, the cockroach/pkg/kv/kvserver/replicate_queue.go Line 246 in 48b3793
If a target for a lateral rebalance is established, the queue proceeded to its cockroach/pkg/kv/kvserver/replicate_queue.go Lines 483 to 484 in 48b3793
A replica is only able to initiate a replication change if it is the current leaseholder. So if a replica reaches this part in the code, it knows (with enough certainty), that it is the current leaseholder. So then when deciding which replica to remove in a lateral replica rebalance, it takes special notice of if it is the removal target. In such cases, it immediately transfers the lease away: cockroach/pkg/kv/kvserver/replicate_queue.go Lines 1109 to 1110 in 48b3793
This is the start of where we'll need to start making changes. As discussed above and in person, we'd like to defer this lease transfer until after the incoming replica has received its initial snapshot. So let's see where that is. Back to the code, the replica then (if it didn't just transfer the lease away) calls into I'll leave this by noting that the rebalance operation enters and leaves its joint voter configuration in cockroach/pkg/kv/kvserver/replica_command.go Lines 1697 to 1718 in 48b3793
So this is where we could move the lease transfer from the VOTER_OUTGOING to the VOTER_INCOMING. But that's likely not where we'll actually want to put it. The reason for this is that we need to be able to recover (i.e. leave the joint quorum configuration) from a partially successful rebalance and roll back or roll forward. In the cases where we need to roll forward, we still want to perform the lease transfer, because we can't let the VOTER_OUTGOING be removed while it still holds the lease. Luckily, we have a single place where we perform this roll forward, in That's a high-level lay of the land, but there will certainly be things that come up when we begin touching the code. One thing that I'm aware of from spending a short amount of time trying to prototype this a few months ago is that this code won't let a Raft proposal remove its proposer. But it's using the wrong definition of "remove", so it won't let the proposer move to VOTER_OUTGOING. That's one of the many things we'll need to resolve. |
Based on my reading of the code below, this should "just work", but we still need to verify it. |
@tbg can you clarify this case vs what you referenced above: https://github.com/etcd-io/etcd/blob/519f62b269cbc5f0438587cdcd9e3d4653c6515b/raft/testdata/confchange_v1_remove_leader.txt#L201 |
That unit test shows what happens when you remove the raft leader (tldr: nothing good, actually pretty bad in terms of loss of availability, though I think it would be temporary in our case since once the old leader heartbeats anyone it will be told to delete itself, and after the election timeout someone else would campaign). We would try to avoid this case by transferring leadership away while the leader is a VOTER_DEMOTING. But yes, raft definitely has sharp edges here that we may be better off addressing. We don't control raft elections, so even if we transfer the lease and raft leadership away, it's possible to slip back in before we actually demote the replica. And you don't want to run the risk of having a learner that thinks it's the raft leader, or the unavailability mentioned above. However, we should be able to fix this. If we are in a joint config of |
As part of my job, I have to ask for updates pertaining to tickets that involve our customers. This gh issue impacts gh 1228, which was closed in lieu of this ticket. |
@daniel-crlabs we are hoping to address this issue in the v22.1 release. Given the complexity involved, the fix here will likely not be backported to earlier release branches. |
Sounds great, thank you for the update, I'll let the customer know. |
@shralex to make some of this concrete, here's the order of operations that we would expect before and after this change during a replica rebalance. Imagine we are starting with replicas Current replica rebalance steps:
New replica rebalance steps:
Writing this all out also makes me realize that this change will reduce cross-zone/region network traffic as well. With the new order of operations, we are able to send the learner snapshot directly from 1 to 4, instead of sending it from the temporary leaseholder across zones/regions. This gives us part of the benefit promised by #42491 for free. NOTE: we no longer use VOTER_OUTGOING, so please pretend my previous references to VOTER_OUTGOING said VOTER_DEMOTING_LEARNER. This is most relavant to the earlier comment:
|
Describe the problem
In a cluster with at least 3 localities, adding a new node to an existing locality reliably triggers a lease count divergence. The lease counts continue to diverge until this newly added node is fully hydrated with the ~mean number of replicas.
This looks something like the following:
Cause
In order to understand how this happens, consider a cluster with 3 racks:
rack=0, rack=1, rack=2
and 9 nodes. Let's walk through adding a new node (n10
) torack=0
. Because of the diversity heuristic, only the other existing nodes inrack=0
are allowed to shed their replicas away ton10
. For ranges that have their leases also inrack=0
, this means that those leaseholders will first need to shed their lease away to one of the nodes in racks1
or2
and expect those nodes to execute the rebalance. This will continue untiln10
has received roughly ~mean number of replicas relative to the rest of the nodes in the cluster.However, in a cluster with enough data, fully hydrating the new node will take a while, sometimes on the order of hours (note that
n10
can only receive new replicas at the snapshot rate dictated bykv.snapshot_rebalance.max_rate
). Until this happens, nodes inrack=0
will continue shedding leases away to nodes in racks1
and2
until they basically have zero leases.To Reproduce
I can reproduce this by following the steps outlined above on both 20.2 and 21.1.
Additional details
Until
n10
is fully hydrated, nodes inrack=0
will continue hitting theconsiderRebalance
path (for the ranges for which they are the leaseholders) in the allocator:cockroach/pkg/kv/kvserver/replicate_queue.go
Line 487 in 7495434
Since there is indeed a valid rebalance candidate,
RebalanceVoter
will returnok==true
here:cockroach/pkg/kv/kvserver/replicate_queue.go
Lines 1079 to 1087 in 7495434
This will then lead to the call to
maybeTransferLeaseAway
here:cockroach/pkg/kv/kvserver/replicate_queue.go
Lines 1106 to 1107 in 7495434
This will transfer the lease away to another replica.
/cc. @cockroachdb/kv
gz#5876
Epic CRDB-10569
gz#9817
The text was updated successfully, but these errors were encountered: