core: RPC re-connections are not validated #20537

solongordon · 2017-12-06T20:43:18Z

This is a follow-up to #20163.

We currently validate new RPC connections before handing them out by waiting for a successful heartbeat. (See rpc.Connect().) However, if a node goes down and is restarted at the same address, gRPC happily reconnects and this validation is not performed. You can easily demonstrate this as follows:

# Bootstrap node1.
$ cockroach start --insecure --store=/tmp/node1

# Connect node2.
$ cockroach start --insecure --store=/tmp/node2 --port=26258 --http-port=8081 --join=localhost:26257

# Kill node1 and wipe its store.
$ rm -r /tmp/node1

# Restart node1.
$ cockroach start --insecure --store=/tmp/node1

Ideally node2 would revalidate its node1 connection at this point and refuse to connect due to a cluster ID mismatch. But instead the connection is reused, and node2 panics.

@petermattis suggests we could avoid this issue by using a custom gRPC dialer. See #20163 (comment).

cc @tschottdorf

The text was updated successfully, but these errors were encountered:

tbg · 2017-12-06T21:04:35Z

We should also address the issue that the Ping heartbeat loop will practically never stop heartbeating (this was an unfortunate and likely unintended side effect of letting grpc handle heartbeats). I suggest the following:

if no Ping has succeeded within the last $LARGEDURATION (hour?), decide that this isn't just TCP connection throttling holding us back; there is definitely something wrong and the connection should be closed.
if a Ping goes through, and your validation fails, definitely close the connection as well.

The first would likely be obviated by inserting a Dialer that doesn't retry infinitely (I suppose our dialer would never retry and we'd just throw away the connection instead).

tbg · 2017-12-06T21:05:34Z

Assigning myself for bookkeeping, not actually planning to tackle this anytime soon.

a-robinson · 2017-12-06T21:39:26Z

Also for bookkeeping purposes, this is the same issue as #15898

bdarnell · 2018-02-08T16:08:23Z

I'm taking this over, on the theory that #20764 and #22320 are due to port reuse.

GRPC will transparently reconnect when a connection fails, but if the next process to use that port is not a part of the same cluster, this leads to confusing errors and potential data corruption. (this is most common in tests, but it can also occur in other situations). This change disables grpc's automatic reconnections so that in the event of a failed connection, we go through our full dialing process including an initial heartbeat that validates certain parameters. Fixes cockroachdb#20537 Release note (bug fix): Implement additional safeguards against RPC connections between nodes that belong to different clusters.

tbg added this to the 1.2 milestone Dec 6, 2017

tbg self-assigned this Dec 6, 2017

a-robinson mentioned this issue Dec 6, 2017

core: crash on insert (after nodes from different cluster attempt to join) #20207

Closed

This was referenced Dec 18, 2017

core: Don't let nodes from one cluster interfere with another #15801

Closed

storage: panic: tocommit(28) is out of range [lastIndex(0)] in StartTestCluster #20764

Closed

storage: panic in raftGroup.Step while running under stressrace #14231

Closed

bdarnell assigned bdarnell and unassigned tbg Feb 8, 2018

This was referenced Feb 8, 2018

rpc: Perform initial-heartbeat validation on GRPC reconnections #22518

Merged

gossip: Can allocate same node ID to two different nodes in very unlikely startup race #15898

Closed

bdarnell closed this as completed in #22518 Feb 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: RPC re-connections are not validated #20537

core: RPC re-connections are not validated #20537

solongordon commented Dec 6, 2017

tbg commented Dec 6, 2017

tbg commented Dec 6, 2017

a-robinson commented Dec 6, 2017

bdarnell commented Feb 8, 2018

core: RPC re-connections are not validated #20537

core: RPC re-connections are not validated #20537

Comments

solongordon commented Dec 6, 2017

tbg commented Dec 6, 2017

tbg commented Dec 6, 2017

a-robinson commented Dec 6, 2017

bdarnell commented Feb 8, 2018