Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: RPC re-connections are not validated #20537

Closed
solongordon opened this issue Dec 6, 2017 · 4 comments · Fixed by #22518
Closed

core: RPC re-connections are not validated #20537

solongordon opened this issue Dec 6, 2017 · 4 comments · Fixed by #22518
Assignees
Milestone

Comments

@solongordon
Copy link
Contributor

This is a follow-up to #20163.

We currently validate new RPC connections before handing them out by waiting for a successful heartbeat. (See rpc.Connect().) However, if a node goes down and is restarted at the same address, gRPC happily reconnects and this validation is not performed. You can easily demonstrate this as follows:

# Bootstrap node1.
$ cockroach start --insecure --store=/tmp/node1

# Connect node2.
$ cockroach start --insecure --store=/tmp/node2 --port=26258 --http-port=8081 --join=localhost:26257

# Kill node1 and wipe its store.
$ rm -r /tmp/node1

# Restart node1.
$ cockroach start --insecure --store=/tmp/node1

Ideally node2 would revalidate its node1 connection at this point and refuse to connect due to a cluster ID mismatch. But instead the connection is reused, and node2 panics.

@petermattis suggests we could avoid this issue by using a custom gRPC dialer. See #20163 (comment).

cc @tschottdorf

@tbg
Copy link
Member

tbg commented Dec 6, 2017

We should also address the issue that the Ping heartbeat loop will practically never stop heartbeating (this was an unfortunate and likely unintended side effect of letting grpc handle heartbeats). I suggest the following:

  • if no Ping has succeeded within the last $LARGEDURATION (hour?), decide that this isn't just TCP connection throttling holding us back; there is definitely something wrong and the connection should be closed.
  • if a Ping goes through, and your validation fails, definitely close the connection as well.

The first would likely be obviated by inserting a Dialer that doesn't retry infinitely (I suppose our dialer would never retry and we'd just throw away the connection instead).

@tbg tbg added this to the 1.2 milestone Dec 6, 2017
@tbg tbg self-assigned this Dec 6, 2017
@tbg
Copy link
Member

tbg commented Dec 6, 2017

Assigning myself for bookkeeping, not actually planning to tackle this anytime soon.

@a-robinson
Copy link
Contributor

Also for bookkeeping purposes, this is the same issue as #15898

@bdarnell
Copy link
Contributor

bdarnell commented Feb 8, 2018

I'm taking this over, on the theory that #20764 and #22320 are due to port reuse.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Feb 8, 2018
GRPC will transparently reconnect when a connection fails, but if the
next process to use that port is not a part of the same cluster, this
leads to confusing errors and potential data corruption. (this is most
common in tests, but it can also occur in other situations).

This change disables grpc's automatic reconnections so that in the
event of a failed connection, we go through our full dialing process
including an initial heartbeat that validates certain parameters.

Fixes cockroachdb#20537

Release note (bug fix): Implement additional safeguards against RPC
connections between nodes that belong to different clusters.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Feb 8, 2018
GRPC will transparently reconnect when a connection fails, but if the
next process to use that port is not a part of the same cluster, this
leads to confusing errors and potential data corruption. (this is most
common in tests, but it can also occur in other situations).

This change disables grpc's automatic reconnections so that in the
event of a failed connection, we go through our full dialing process
including an initial heartbeat that validates certain parameters.

Fixes cockroachdb#20537

Release note (bug fix): Implement additional safeguards against RPC
connections between nodes that belong to different clusters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants