-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: RPC re-connections are not validated #20537
Comments
We should also address the issue that the Ping heartbeat loop will practically never stop heartbeating (this was an unfortunate and likely unintended side effect of letting grpc handle heartbeats). I suggest the following:
The first would likely be obviated by inserting a |
Assigning myself for bookkeeping, not actually planning to tackle this anytime soon. |
Also for bookkeeping purposes, this is the same issue as #15898 |
GRPC will transparently reconnect when a connection fails, but if the next process to use that port is not a part of the same cluster, this leads to confusing errors and potential data corruption. (this is most common in tests, but it can also occur in other situations). This change disables grpc's automatic reconnections so that in the event of a failed connection, we go through our full dialing process including an initial heartbeat that validates certain parameters. Fixes cockroachdb#20537 Release note (bug fix): Implement additional safeguards against RPC connections between nodes that belong to different clusters.
GRPC will transparently reconnect when a connection fails, but if the next process to use that port is not a part of the same cluster, this leads to confusing errors and potential data corruption. (this is most common in tests, but it can also occur in other situations). This change disables grpc's automatic reconnections so that in the event of a failed connection, we go through our full dialing process including an initial heartbeat that validates certain parameters. Fixes cockroachdb#20537 Release note (bug fix): Implement additional safeguards against RPC connections between nodes that belong to different clusters.
This is a follow-up to #20163.
We currently validate new RPC connections before handing them out by waiting for a successful heartbeat. (See
rpc.Connect()
.) However, if a node goes down and is restarted at the same address, gRPC happily reconnects and this validation is not performed. You can easily demonstrate this as follows:Ideally node2 would revalidate its node1 connection at this point and refuse to connect due to a cluster ID mismatch. But instead the connection is reused, and node2 panics.
@petermattis suggests we could avoid this issue by using a custom gRPC dialer. See #20163 (comment).
cc @tschottdorf
The text was updated successfully, but these errors were encountered: