core: Don't let nodes from one cluster interfere with another #15801

a-robinson · 2017-05-09T14:47:32Z

We do a good job in the gossip layer of not letting nodes from one cluster join another, but not at any other layers of the stack. If one of the clusters, cluster X, used to have at least one node at the address of one of the nodes in cluster Y, then bad things can happen because the nodes in X will talk to that address in ways that can mess with cluster Y. One instance of this problem was seen in / explained by my comment on #15591 (comment):

At some point before I got involved, someone had set up a cross-cloud cluster in clouds A and B. After their testing was done, they took down the nodes in A but left up the nodes in B.
Later, while I was around, we brought up a new cluster in node A with the exact same IP addresses.
Although the nodes in B from the old cluster couldn't properly join the new cluster via gossip due to the cluster ID check, they could still talk to the nodes in A because they were at the same IP address. In practice, it looks like the old nodes leftover in B do two main things:
- Open raft transport streams and send messages to the new nodes in A. If the node/store ID of the node in A doesn't match the expectation of the node in B, all that happens is that requests get rejected. However, if the ID does match, then bad things can happen, like the crash in the PDF I attached above in which one of the new nodes crashed because it didn't have a raft group in place for the request.
- Try to update the node liveness table. This is presumably why the IPs on the node list page of the UI would change every time we refreshed it -- sometimes the new node would have most recently updated the liveness and sometimes the old node would have.

We can wait and see whether or not anyone else runs into something like this for the sake of prioritizing it, since I don't think it's likely to be common, but it will be pretty confusing for anyone that it does happen to.

a-robinson · 2017-06-02T15:30:51Z

Assigning to @a6802739 since he's assigned himself to #15898, which should have the same fix.

In the heartbeat, nodes now share their cluster IDs and check that they match. We allow for missing cluster IDs, since new nodes do not have a cluster ID until they obtain one via gossip, but conflicting IDs will result in a heartbeat error. In addition, connections are now not added to the connection pool until the heartbeat succeeds. This allows us to fail fast when a node attempts to join the wrong cluster. Fixes cockroachdb#15801. Fixes cockroachdb#15898. Refers cockroachdb#18058.

a-robinson · 2017-12-18T19:16:57Z

Mostly fixed by #20163. The remaining work is tracked by #20537

This was referenced May 9, 2017

stability: cluster unable to recover after 2 node outage #15591

Closed

gossip: Can allocate same node ID to two different nodes in very unlikely startup race #15898

Closed

petermattis modified the milestone: 1.1 Jun 1, 2017

a-robinson assigned a6802739 Jun 2, 2017

a-robinson mentioned this issue Jul 11, 2017

rfc: version migration for backwards incompatible functionality #16977

Merged

a-robinson unassigned a6802739 Aug 31, 2017

a-robinson modified the milestones: 1.2, 1.1 Aug 31, 2017

This was referenced Aug 31, 2017

rpc: fail at handshake time when node versions are incompatible #18058

Closed

storage: panic in raftGroup.Step while running under stressrace #14231

Closed

solongordon mentioned this issue Nov 20, 2017

core: Check for cluster ID conflicts on handshake #20163

Merged

a-robinson closed this as completed Dec 18, 2017

a-robinson mentioned this issue Dec 19, 2017

storage: panic: tocommit(28) is out of range [lastIndex(0)] in StartTestCluster #20764

Closed

bdarnell mentioned this issue Apr 23, 2018

stability: partitioned gossip network caused by ping-ponging of r1's lease #24753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: Don't let nodes from one cluster interfere with another #15801

core: Don't let nodes from one cluster interfere with another #15801

a-robinson commented May 9, 2017

a-robinson commented Jun 2, 2017

a-robinson commented Dec 18, 2017

core: Don't let nodes from one cluster interfere with another #15801

core: Don't let nodes from one cluster interfere with another #15801

Comments

a-robinson commented May 9, 2017

a-robinson commented Jun 2, 2017

a-robinson commented Dec 18, 2017