-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: can't change the membership with three replicas in three nodes cluster #2067
Comments
I don't understand. If you have less than three nodes, specify a config that has less than three replicas. You should be able to do any number (including one). What are you suggesting? |
For example, I have a cluster with three nodes: node1, node2 and node3 , every node have 3 stores , and the number of replicas in zoneconfig is 3. At the beginning, the replicas of range1 are on [node1:store1, node2:store1, node3:store1]. With the increasing of data, range begin to split. The replicas of newly created ranges will be in the same place as range1. If I want to move the new range to [node1:store2, node2:store2, node3:store2] to balance the capacity between different stores, it can’t be done because we do not allow two replicas of a range located in the same node and do not allow a replica moving in a same node, the replica moving will be failed. |
@thundercw yes you're correct. I believe the current logic would disallow such a lateral move. @mrtracy is working on rebalancing logic and will be able to incorporate this change. |
@spencerkimball @mrtracy We have implemented the routing to multiple store. If you need I can merge it in. |
@thundercw, would you be able to open a PR with the changes you mentioned above? |
This came up again today and I wanted to point out a risk with moving replicas from one store to another on the same node. During the move, if the node dies, the range will lose two replicas, which enough to make it lose quorum. We shouldn't try to support this until we have preemptive snapshots to minimize the amount of time when both local replicas are active, and even then it's a little more risky than usual so we might not want to enable this in clusters that are large enough to allow the rebalancer to work normally. |
User @yznming reports hitting this issue today too (5 nodes, 6 stores per node, only 2 stores per node actually used) |
@bdarnell: I believe there is a way to completely eliminate this window of risk by using Howard quorums[1]. To reiterate the problem that @bdarnell mentioned above, consider normal rebalancing (across distinct nodes):
Everything is fine. But we cannot rebalance ranges across two stores on the same node. Consider:
This window is not good. So we currently do not allow rebalances across the same node. However, this rebalance can be done if we have Howard Quorums. Here's the basics of Howard quorums. Define the following variables: Conditions required for a valid Howard quorum: Regular raft just sets E=W=ceil((N+1)/2). However, we can tweak these carefully to thread the needle. Let's denote a Howard Quorum as (N,E,W). So vanilla Raft with 3 nodes is (3,2,2). When you upreplicate vanilla raft to 4 ranges, you move to a (4,3,3) range. However, Howard Quorums now allow for (4,3,2) clusters. Let's try a new algorithm for rebalancing ranges across stores on a single node:
[1]: Flexible Paxos: https://arxiv.org/pdf/1608.06696v1.pdf |
@tschottdorf and I were talking about this yesterday. In short: introduce a new storage layer between replica and physical stores, I'll call this "virtual store" below. How this would work:
How does this relate to what we already have? I also suggested to equate "virtual store IDs" with attribute groups; a virtual store would be defined by an set of attributes, and membership of a physical store to a virtual store would be defined by the attributes used to start the store. |
Oh and I forgot the most important of course: once this is in place, we can also implement online migration of a physical store from one virtual store to another (in effect, migrating some replicas from one to another) by ensuring the invariant is preserved. |
(Alternatively to this entire story, we can also suggest users to define a single store over a RAID-backed physical store, which amounts to the same thing!) |
This is actually a little worse than I think we had previously realized. Not only will no replicas get rebalanced between stores, but leases can't even be properly balanced sometimes, as found in https://forum.cockroachlabs.com/t/how-to-enable-leaseholder-load-balancing/1732/4. The allocator decides on replica rebalancing before lease rebalancing, and in the linked case it keeps deciding that it should try to rebalance to the other store on the same node as the leaseholder because of the large imbalance between their range counts. The attempt to add the new replica fails because they're on the same node. However, because we decided that we should move the replica, we didn't even consider whether we should move the lease. This is leading to a large lease imbalance on the cluster. |
Zendesk ticket #2288 has been linked to this issue. |
Recently I've seen "atomic membership changes" mentioned a few times as something we should be working on soon. Has there been a more concrete proposal on what to do here? |
The main issue for atomic membership changes is #12768. We need to implement "joint consensus" (which was the membership change protocol in the original raft paper, then demoted in favor of a simpler but more limited protocol in the final dissertation. See section 4.3 of the dissertation) in upstream raft (etcd-io/etcd#7625), then use it in ChangeReplicas. |
Closing for #12768 |
Two replicas are not allowed to exist in the same node. If we set zoneconfig with three replicas , multi-stores can't be used in the cluster with three nodes. Because new splited ranges can`t be moved to other stores.
I think this is unreasonable. It limits the capability of the single node in this scenario. And it requires users to build a cluster with more than three nodes.
The text was updated successfully, but these errors were encountered: