Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: can't change the membership with three replicas in three nodes cluster #2067

Closed
thundercw opened this issue Aug 12, 2015 · 16 comments
Closed
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@thundercw
Copy link
Contributor

Two replicas are not allowed to exist in the same node. If we set zoneconfig with three replicas , multi-stores can't be used in the cluster with three nodes. Because new splited ranges can`t be moved to other stores.

I think this is unreasonable. It limits the capability of the single node in this scenario. And it requires users to build a cluster with more than three nodes.

@tbg
Copy link
Member

tbg commented Aug 12, 2015

I don't understand. If you have less than three nodes, specify a config that has less than three replicas. You should be able to do any number (including one). What are you suggesting?

@thundercw
Copy link
Contributor Author

For example, I have a cluster with three nodes: node1, node2 and node3 , every node have 3 stores , and the number of replicas in zoneconfig is 3. At the beginning, the replicas of range1 are on [node1:store1, node2:store1, node3:store1]. With the increasing of data, range begin to split. The replicas of newly created ranges will be in the same place as range1. If I want to move the new range to [node1:store2, node2:store2, node3:store2] to balance the capacity between different stores, it can’t be done because we do not allow two replicas of a range located in the same node and do not allow a replica moving in a same node, the replica moving will be failed.
My suggestion is that we need to support replicas moving across stores on the same node.

@spencerkimball
Copy link
Member

@thundercw yes you're correct. I believe the current logic would disallow such a lateral move. @mrtracy is working on rebalancing logic and will be able to incorporate this change.

@thundercw
Copy link
Contributor Author

@spencerkimball @mrtracy We have implemented the routing to multiple store. If you need I can merge it in.

@tbg
Copy link
Member

tbg commented Dec 27, 2015

@thundercw, would you be able to open a PR with the changes you mentioned above?

@petermattis petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 14, 2016
@petermattis petermattis modified the milestone: Beta Feb 14, 2016
@petermattis petermattis changed the title Can`t change the membership with three replicas in three nodes cluster Can't change the membership with three replicas in three nodes cluster Feb 16, 2016
@petermattis petermattis changed the title Can't change the membership with three replicas in three nodes cluster storage: can't change the membership with three replicas in three nodes cluster Feb 23, 2016
@bdarnell bdarnell modified the milestones: 1.0, Beta Mar 8, 2016
@bdarnell
Copy link
Contributor

This came up again today and I wanted to point out a risk with moving replicas from one store to another on the same node. During the move, if the node dies, the range will lose two replicas, which enough to make it lose quorum. We shouldn't try to support this until we have preemptive snapshots to minimize the amount of time when both local replicas are active, and even then it's a little more risky than usual so we might not want to enable this in clusters that are large enough to allow the rebalancer to work normally.

@knz
Copy link
Contributor

knz commented Sep 3, 2016

User @yznming reports hitting this issue today too (5 nodes, 6 stores per node, only 2 stores per node actually used)

@rjnn
Copy link
Contributor

rjnn commented Sep 6, 2016

@bdarnell: I believe there is a way to completely eliminate this window of risk by using Howard quorums[1].

To reiterate the problem that @bdarnell mentioned above, consider normal rebalancing (across distinct nodes):

  1. You have a range that is replicated on 3 different nodes. You can tolerate 1 failure.
  2. You upreplicate this range to 4 nodes (add a new replica on a new node). You can still tolerate 1 failure.
  3. You downreplicate this range to 3 nodes (remove a node). You can still tolerate 1 failure.

Everything is fine.

But we cannot rebalance ranges across two stores on the same node. Consider:

  1. You have a range that is replicated on 3 different nodes. You can tolerate 1 failure.
  2. You upreplicate a range on [Node 1, Store 1] to [Node 1, Store 2]. This is bad, because if Node 1 goes down, then the remaining replicas cannot form a quorum (2 replicas left out of a group of 4 nodes). So you cannot tolerate a failure of Node 1.
  3. If that didn't happen, you downreplicate [Node 1, Store 1]. You are now safe again and can tolerate 1 failure.

This window is not good. So we currently do not allow rebalances across the same node.

However, this rebalance can be done if we have Howard Quorums. Here's the basics of Howard quorums. Define the following variables:
N: Number of replicas.
E: Quorum required for a leader election.
W: Quorum required for a proposal.

Conditions required for a valid Howard quorum:
2E > N (so that we never can have two leaders)
E+W > N (so we can never have a successful proposal with an obsolete leader).

Regular raft just sets E=W=ceil((N+1)/2). However, we can tweak these carefully to thread the needle.

Let's denote a Howard Quorum as (N,E,W). So vanilla Raft with 3 nodes is (3,2,2). When you upreplicate vanilla raft to 4 ranges, you move to a (4,3,3) range. However, Howard Quorums now allow for (4,3,2) clusters.

Let's try a new algorithm for rebalancing ranges across stores on a single node:

  1. You have a range that is replicated on a (3,2,2) raft group. You can tolerate 1 failure.
  2. You want to rebalance a range across 2 stores on Node 1. You first ensure that the leader is transferred a replica that is NOT on Node 1.
  3. You upreplicate the range [Node 1, Store 1] to [Node 1, Store 2], changing the Howard quorum to (4,3,2).
    1. This is safe because if Node 1 goes down, you lose 2 replicas and you no longer can elect a leader, but the existing leader (which we've ensured is not on Node 1) still can make progress committing proposals (importantly, the leader can upreplicate onto some other node once this failure is detected).
    2. Recovery: The leader would upreplicate to a (5,3,3) range, then downreplicate back to (3,2,2).
    3. This is safe because you require two failures (Node 1 must go down and the leader range must go down) to bring things down. Throughout this whole process you could always tolerate one failure.
  4. Otherwise, once you have upreplicated [Node 1, Store 2], you safely downreplicate the replica on [Node 1, Store 1] and return to a stable (3,2,2) range.

[1]: Flexible Paxos: https://arxiv.org/pdf/1608.06696v1.pdf

@spencerkimball spencerkimball removed this from the 1.0 milestone Mar 28, 2017
@knz knz added the C-investigation Further steps needed to qualify. C-label will change. label Apr 20, 2017
@knz knz added this to the 1.1 milestone Apr 20, 2017
@knz
Copy link
Contributor

knz commented May 9, 2017

@tschottdorf and I were talking about this yesterday.
Here's what came up: we could solve the problem without any changes to the replication and lease protocol.

In short: introduce a new storage layer between replica and physical stores, I'll call this "virtual store" below.

How this would work:

  • instead of mapping replicas to stores, we would map replicas to a pair (node ID, virtual store ID)
  • any physical store in a node would then belong to one virtual store
  • we define the (new) following invariant: the data of a replica must be duplicated across all physical stores in its virtual store
  • to enforce the invariant:
    • when a node starts up with a new physical store, we copy the data from the other physical stores in the same virtual store into the new physical store
    • when a new replica is accepted on a running node, we only acknowledge reception of the replica when its data has been duplicated across all the physical stores in its virtual store
    • when a node is restarted with a different virtual store ID on one of its previously-populated physical stores, we synchronize that physical store with its new virtual store:
      • any range in the group not already copied to the migrating physical store must be duplicated there
      • (I think) any range in the migrating physical store not already in the new virtual store must be duplicated to every other physical store in the new group, and a new replica must be announced to its peers in the network; or we can simply drop these ranges (higher-level replication already guarantees there are enough replicas on other nodes).
      • if this synchronization is impossible, then we can choose, either the node cannot start, or we simply drop the ranges in the migrating physical stores and simply re-populate it from its virtual store group
    • when a node starts up with one less physical store, we don't do anything (again the higher-level replication protocol already guarantees there are other replicas, on other nodes)

How does this relate to what we already have? I also suggested to equate "virtual store IDs" with attribute groups; a virtual store would be defined by an set of attributes, and membership of a physical store to a virtual store would be defined by the attributes used to start the store.

@knz
Copy link
Contributor

knz commented May 9, 2017

Oh and I forgot the most important of course: once this is in place, we can also implement online migration of a physical store from one virtual store to another (in effect, migrating some replicas from one to another) by ensuring the invariant is preserved.

@knz
Copy link
Contributor

knz commented May 9, 2017

(Alternatively to this entire story, we can also suggest users to define a single store over a RAID-backed physical store, which amounts to the same thing!)

@cuongdo cuongdo added this to the Later milestone Aug 22, 2017
@cuongdo cuongdo removed this from the 1.1 milestone Aug 22, 2017
@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. O-community Originated from the community and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community-questions labels Apr 24, 2018
@tbg tbg added the A-kv-distribution Relating to rebalancing and leasing. label May 15, 2018
@a-robinson
Copy link
Contributor

This is actually a little worse than I think we had previously realized. Not only will no replicas get rebalanced between stores, but leases can't even be properly balanced sometimes, as found in https://forum.cockroachlabs.com/t/how-to-enable-leaseholder-load-balancing/1732/4.

The allocator decides on replica rebalancing before lease rebalancing, and in the linked case it keeps deciding that it should try to rebalance to the other store on the same node as the leaseholder because of the large imbalance between their range counts. The attempt to add the new replica fails because they're on the same node. However, because we decided that we should move the replica, we didn't even consider whether we should move the lease. This is leading to a large lease imbalance on the cluster.

@tbg tbg modified the milestones: Later, 2.2 Jul 19, 2018
@tim-o
Copy link
Contributor

tim-o commented Aug 12, 2018

Zendesk ticket #2288 has been linked to this issue.

@tbg
Copy link
Member

tbg commented Aug 13, 2018

Recently I've seen "atomic membership changes" mentioned a few times as something we should be working on soon. Has there been a more concrete proposal on what to do here?

@bdarnell
Copy link
Contributor

The main issue for atomic membership changes is #12768. We need to implement "joint consensus" (which was the membership change protocol in the original raft paper, then demoted in favor of a simpler but more limited protocol in the final dissertation. See section 4.3 of the dissertation) in upstream raft (etcd-io/etcd#7625), then use it in ChangeReplicas.

@tbg
Copy link
Member

tbg commented Oct 11, 2018

Closing for #12768

@tbg tbg closed this as completed Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests