Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Don't let nodes from one cluster interfere with another #15801

Closed
a-robinson opened this issue May 9, 2017 · 2 comments
Closed

core: Don't let nodes from one cluster interfere with another #15801

a-robinson opened this issue May 9, 2017 · 2 comments
Milestone

Comments

@a-robinson
Copy link
Contributor

We do a good job in the gossip layer of not letting nodes from one cluster join another, but not at any other layers of the stack. If one of the clusters, cluster X, used to have at least one node at the address of one of the nodes in cluster Y, then bad things can happen because the nodes in X will talk to that address in ways that can mess with cluster Y. One instance of this problem was seen in / explained by my comment on #15591 (comment):

  • At some point before I got involved, someone had set up a cross-cloud cluster in clouds A and B. After their testing was done, they took down the nodes in A but left up the nodes in B.
  • Later, while I was around, we brought up a new cluster in node A with the exact same IP addresses.
  • Although the nodes in B from the old cluster couldn't properly join the new cluster via gossip due to the cluster ID check, they could still talk to the nodes in A because they were at the same IP address. In practice, it looks like the old nodes leftover in B do two main things:
    • Open raft transport streams and send messages to the new nodes in A. If the node/store ID of the node in A doesn't match the expectation of the node in B, all that happens is that requests get rejected. However, if the ID does match, then bad things can happen, like the crash in the PDF I attached above in which one of the new nodes crashed because it didn't have a raft group in place for the request.
    • Try to update the node liveness table. This is presumably why the IPs on the node list page of the UI would change every time we refreshed it -- sometimes the new node would have most recently updated the liveness and sometimes the old node would have.

We can wait and see whether or not anyone else runs into something like this for the sake of prioritizing it, since I don't think it's likely to be common, but it will be pretty confusing for anyone that it does happen to.

@a-robinson
Copy link
Contributor Author

Assigning to @a6802739 since he's assigned himself to #15898, which should have the same fix.

@a-robinson a-robinson modified the milestones: 1.2, 1.1 Aug 31, 2017
solongordon added a commit to solongordon/cockroach that referenced this issue Nov 20, 2017
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Fixes cockroachdb#15801.
Fixes cockroachdb#15898.
Refers cockroachdb#18058.
solongordon added a commit to solongordon/cockroach that referenced this issue Nov 20, 2017
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Fixes cockroachdb#15801.
Fixes cockroachdb#15898.
Refers cockroachdb#18058.
solongordon added a commit to solongordon/cockroach that referenced this issue Nov 22, 2017
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Fixes cockroachdb#15801.
Fixes cockroachdb#15898.
Refers cockroachdb#18058.
solongordon added a commit to solongordon/cockroach that referenced this issue Nov 27, 2017
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Fixes cockroachdb#15801.
Fixes cockroachdb#15898.
Refers cockroachdb#18058.
solongordon added a commit to solongordon/cockroach that referenced this issue Nov 30, 2017
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Fixes cockroachdb#15801.
Fixes cockroachdb#15898.
Refers cockroachdb#18058.
solongordon added a commit to solongordon/cockroach that referenced this issue Dec 4, 2017
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Fixes cockroachdb#15801.
Fixes cockroachdb#15898.
Refers cockroachdb#18058.
solongordon added a commit to solongordon/cockroach that referenced this issue Dec 4, 2017
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Fixes cockroachdb#15801.
Fixes cockroachdb#15898.
Refers cockroachdb#18058.
@a-robinson
Copy link
Contributor Author

Mostly fixed by #20163. The remaining work is tracked by #20537

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants