storage: can't change the membership with three replicas in three nodes cluster #2067

thundercw · 2015-08-12T14:17:37Z

Two replicas are not allowed to exist in the same node. If we set zoneconfig with three replicas , multi-stores can't be used in the cluster with three nodes. Because new splited ranges can`t be moved to other stores.

I think this is unreasonable. It limits the capability of the single node in this scenario. And it requires users to build a cluster with more than three nodes.

tbg · 2015-08-12T14:43:53Z

I don't understand. If you have less than three nodes, specify a config that has less than three replicas. You should be able to do any number (including one). What are you suggesting?

thundercw · 2015-08-13T03:48:03Z

For example, I have a cluster with three nodes: node1, node2 and node3 , every node have 3 stores , and the number of replicas in zoneconfig is 3. At the beginning, the replicas of range1 are on [node1:store1, node2:store1, node3:store1]. With the increasing of data, range begin to split. The replicas of newly created ranges will be in the same place as range1. If I want to move the new range to [node1:store2, node2:store2, node3:store2] to balance the capacity between different stores, it can’t be done because we do not allow two replicas of a range located in the same node and do not allow a replica moving in a same node, the replica moving will be failed.
My suggestion is that we need to support replicas moving across stores on the same node.

spencerkimball · 2015-08-13T14:13:52Z

@thundercw yes you're correct. I believe the current logic would disallow such a lateral move. @mrtracy is working on rebalancing logic and will be able to incorporate this change.

thundercw · 2015-08-17T11:31:38Z

@spencerkimball @mrtracy We have implemented the routing to multiple store. If you need I can merge it in.

tbg · 2015-12-27T23:11:26Z

@thundercw, would you be able to open a PR with the changes you mentioned above?

bdarnell · 2016-05-18T21:58:13Z

This came up again today and I wanted to point out a risk with moving replicas from one store to another on the same node. During the move, if the node dies, the range will lose two replicas, which enough to make it lose quorum. We shouldn't try to support this until we have preemptive snapshots to minimize the amount of time when both local replicas are active, and even then it's a little more risky than usual so we might not want to enable this in clusters that are large enough to allow the rebalancer to work normally.

knz · 2016-09-03T07:36:11Z

User @yznming reports hitting this issue today too (5 nodes, 6 stores per node, only 2 stores per node actually used)

rjnn · 2016-09-06T20:01:38Z

@bdarnell: I believe there is a way to completely eliminate this window of risk by using Howard quorums[1].

To reiterate the problem that @bdarnell mentioned above, consider normal rebalancing (across distinct nodes):

You have a range that is replicated on 3 different nodes. You can tolerate 1 failure.
You upreplicate this range to 4 nodes (add a new replica on a new node). You can still tolerate 1 failure.
You downreplicate this range to 3 nodes (remove a node). You can still tolerate 1 failure.

Everything is fine.

But we cannot rebalance ranges across two stores on the same node. Consider:

You have a range that is replicated on 3 different nodes. You can tolerate 1 failure.
You upreplicate a range on [Node 1, Store 1] to [Node 1, Store 2]. This is bad, because if Node 1 goes down, then the remaining replicas cannot form a quorum (2 replicas left out of a group of 4 nodes). So you cannot tolerate a failure of Node 1.
If that didn't happen, you downreplicate [Node 1, Store 1]. You are now safe again and can tolerate 1 failure.

This window is not good. So we currently do not allow rebalances across the same node.

However, this rebalance can be done if we have Howard Quorums. Here's the basics of Howard quorums. Define the following variables:
N: Number of replicas.
E: Quorum required for a leader election.
W: Quorum required for a proposal.

Conditions required for a valid Howard quorum:
2E > N (so that we never can have two leaders)
E+W > N (so we can never have a successful proposal with an obsolete leader).

Regular raft just sets E=W=ceil((N+1)/2). However, we can tweak these carefully to thread the needle.

Let's denote a Howard Quorum as (N,E,W). So vanilla Raft with 3 nodes is (3,2,2). When you upreplicate vanilla raft to 4 ranges, you move to a (4,3,3) range. However, Howard Quorums now allow for (4,3,2) clusters.

Let's try a new algorithm for rebalancing ranges across stores on a single node:

You have a range that is replicated on a (3,2,2) raft group. You can tolerate 1 failure.
You want to rebalance a range across 2 stores on Node 1. You first ensure that the leader is transferred a replica that is NOT on Node 1.
You upreplicate the range [Node 1, Store 1] to [Node 1, Store 2], changing the Howard quorum to (4,3,2).
1. This is safe because if Node 1 goes down, you lose 2 replicas and you no longer can elect a leader, but the existing leader (which we've ensured is not on Node 1) still can make progress committing proposals (importantly, the leader can upreplicate onto some other node once this failure is detected).
2. Recovery: The leader would upreplicate to a (5,3,3) range, then downreplicate back to (3,2,2).
3. This is safe because you require two failures (Node 1 must go down and the leader range must go down) to bring things down. Throughout this whole process you could always tolerate one failure.
Otherwise, once you have upreplicated [Node 1, Store 2], you safely downreplicate the replica on [Node 1, Store 1] and return to a stable (3,2,2) range.

[1]: Flexible Paxos: https://arxiv.org/pdf/1608.06696v1.pdf

knz · 2017-05-09T12:42:26Z

@tschottdorf and I were talking about this yesterday.
Here's what came up: we could solve the problem without any changes to the replication and lease protocol.

In short: introduce a new storage layer between replica and physical stores, I'll call this "virtual store" below.

How this would work:

instead of mapping replicas to stores, we would map replicas to a pair (node ID, virtual store ID)
any physical store in a node would then belong to one virtual store
we define the (new) following invariant: the data of a replica must be duplicated across all physical stores in its virtual store
to enforce the invariant:
- when a node starts up with a new physical store, we copy the data from the other physical stores in the same virtual store into the new physical store
- when a new replica is accepted on a running node, we only acknowledge reception of the replica when its data has been duplicated across all the physical stores in its virtual store
- when a node is restarted with a different virtual store ID on one of its previously-populated physical stores, we synchronize that physical store with its new virtual store:
  - any range in the group not already copied to the migrating physical store must be duplicated there
  - (I think) any range in the migrating physical store not already in the new virtual store must be duplicated to every other physical store in the new group, and a new replica must be announced to its peers in the network; or we can simply drop these ranges (higher-level replication already guarantees there are enough replicas on other nodes).
  - if this synchronization is impossible, then we can choose, either the node cannot start, or we simply drop the ranges in the migrating physical stores and simply re-populate it from its virtual store group
- when a node starts up with one less physical store, we don't do anything (again the higher-level replication protocol already guarantees there are other replicas, on other nodes)

How does this relate to what we already have? I also suggested to equate "virtual store IDs" with attribute groups; a virtual store would be defined by an set of attributes, and membership of a physical store to a virtual store would be defined by the attributes used to start the store.

knz · 2017-05-09T12:45:48Z

Oh and I forgot the most important of course: once this is in place, we can also implement online migration of a physical store from one virtual store to another (in effect, migrating some replicas from one to another) by ensuring the invariant is preserved.

knz · 2017-05-09T12:46:50Z

(Alternatively to this entire story, we can also suggest users to define a single store over a RAID-backed physical store, which amounts to the same thing!)

a-robinson · 2018-06-15T20:09:46Z

This is actually a little worse than I think we had previously realized. Not only will no replicas get rebalanced between stores, but leases can't even be properly balanced sometimes, as found in https://forum.cockroachlabs.com/t/how-to-enable-leaseholder-load-balancing/1732/4.

The allocator decides on replica rebalancing before lease rebalancing, and in the linked case it keeps deciding that it should try to rebalance to the other store on the same node as the leaseholder because of the large imbalance between their range counts. The attempt to add the new replica fails because they're on the same node. However, because we decided that we should move the replica, we didn't even consider whether we should move the lease. This is leading to a large lease imbalance on the cluster.

tim-o · 2018-08-12T16:18:57Z

Zendesk ticket #2288 has been linked to this issue.

tbg · 2018-08-13T13:12:39Z

Recently I've seen "atomic membership changes" mentioned a few times as something we should be working on soon. Has there been a more concrete proposal on what to do here?

bdarnell · 2018-08-13T13:49:38Z

The main issue for atomic membership changes is #12768. We need to implement "joint consensus" (which was the membership change protocol in the original raft paper, then demoted in favor of a simpler but more limited protocol in the final dissertation. See section 4.3 of the dissertation) in upstream raft (etcd-io/etcd#7625), then use it in ChangeReplicas.

tbg · 2018-10-11T10:49:48Z

Closing for #12768

thundercw closed this as completed Aug 17, 2015

tamird reopened this Aug 17, 2015

tbg mentioned this issue Dec 27, 2015

Do multiple stores on the same node work correctly? #3531

Closed

bdarnell mentioned this issue Jan 28, 2016

acceptance: Core Stability Testing Scenarios #4015

Closed

petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 14, 2016

petermattis modified the milestone: Beta Feb 14, 2016

petermattis changed the title ~~Can`t change the membership with three replicas in three nodes cluster~~ Can't change the membership with three replicas in three nodes cluster Feb 16, 2016

petermattis changed the title ~~Can't change the membership with three replicas in three nodes cluster~~ storage: can't change the membership with three replicas in three nodes cluster Feb 23, 2016

bdarnell modified the milestones: 1.0, Beta Mar 8, 2016

bdarnell mentioned this issue May 18, 2016

storage: replicas not balanced across multiple stores in same node #6782

Closed

dianasaur323 added the community label Jan 26, 2017

spencerkimball removed this from the 1.0 milestone Mar 28, 2017

petermattis mentioned this issue Apr 13, 2017

storage: each store's data is not uniform when we start a single node with multi-stores #14886

Closed

knz added the C-investigation Further steps needed to qualify. C-label will change. label Apr 20, 2017

knz added this to the 1.1 milestone Apr 20, 2017

cuongdo added this to the Later milestone Aug 22, 2017

cuongdo removed this from the 1.1 milestone Aug 22, 2017

tbg added the A-kv-distribution Relating to rebalancing and leasing. label May 15, 2018

tbg modified the milestones: Later, 2.2 Jul 19, 2018

tbg mentioned this issue Oct 1, 2018

Underreplicated ranges slows down queries #29056

Closed

petermattis removed this from the 2.2 milestone Oct 5, 2018

tbg mentioned this issue Oct 11, 2018

stability: Rebalances must be atomic #12768

Closed

tbg closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: can't change the membership with three replicas in three nodes cluster #2067

storage: can't change the membership with three replicas in three nodes cluster #2067

thundercw commented Aug 12, 2015

tbg commented Aug 12, 2015

thundercw commented Aug 13, 2015

spencerkimball commented Aug 13, 2015

thundercw commented Aug 17, 2015

tbg commented Dec 27, 2015

bdarnell commented May 18, 2016

knz commented Sep 3, 2016

rjnn commented Sep 6, 2016 •

edited

Loading

knz commented May 9, 2017

knz commented May 9, 2017

knz commented May 9, 2017

a-robinson commented Jun 15, 2018

tim-o commented Aug 12, 2018

tbg commented Aug 13, 2018

bdarnell commented Aug 13, 2018

tbg commented Oct 11, 2018

storage: can't change the membership with three replicas in three nodes cluster #2067

storage: can't change the membership with three replicas in three nodes cluster #2067

Comments

thundercw commented Aug 12, 2015

tbg commented Aug 12, 2015

thundercw commented Aug 13, 2015

spencerkimball commented Aug 13, 2015

thundercw commented Aug 17, 2015

tbg commented Dec 27, 2015

bdarnell commented May 18, 2016

knz commented Sep 3, 2016

rjnn commented Sep 6, 2016 • edited Loading

knz commented May 9, 2017

knz commented May 9, 2017

knz commented May 9, 2017

a-robinson commented Jun 15, 2018

tim-o commented Aug 12, 2018

tbg commented Aug 13, 2018

bdarnell commented Aug 13, 2018

tbg commented Oct 11, 2018

rjnn commented Sep 6, 2016 •

edited

Loading