rpc: refuse incoming connections unless dialback succeeds #84289

knz · 2022-07-12T16:33:59Z

Is your feature request related to a problem? Please describe.

Followup to #49220 / cockroachlabs/support#1690

We know of multiple situations where a cluster can find itself into an asymmetric partition, whtiich causes all kinds of symptoms (#49220), these include at least:

low-level network partitions, when e.g IP packets get dropped in one direction but not the other
firewall partitions, when e.g. existing TCP sessions stay alive but new TCP sessions cannot be established
TLS misconfigurations, when two nodes use different client/server certs, so that the TLS handshake succeeds in one direction but fails in the other
DNS misconfigurations, where the advertised hostname of a node is not resolvable from (some) other nodes

It would be good if we had some automation to exclude nodes which appear to be partially partitioned away. (and require operator attention)

Describe the solution you'd like

We could have a couple relatively simple mechanisms to protect a cluster:

Two point-to-point mechanisms, to protect against pairwise partions:

When receiving an incoming connection from another node, refuse to respond to heartbeats until we get a successful outgoing dial to the same node. In other words: when n2 connects to n1, have n1 refuse the conn until it can dial back to n2 successfully.
When an outgoing dial fails, or a heartbeat fails, keep a timestamp of the failure for the remote node ID, and when receiving a heartbeat from that ID, refuse to respond to the heartbeat if there's a recent failure (and possibly actively close the connection). In other words, start refusing a heartbeat from n2 to n1, if n1 has failed to connect/heartbeat to n2 recently.

Then a cluster-wisde mechanism, to protect against global partitions (e.g. n1-n2 and n2-n3 can connect, but not n1-n3)

if one of the two mechanisms above decides to exclude an ID, they would also register a gossip entry (with suitable TTL) that says the node is potentially blocked, containing the IDs of both the blocked node and the decider node. Then, on every node, use these gossip entries as follows: if a node X connects to node Y, and node Y is in a "blocked node" entry created by node Z, and node X has a valid connection to the node Z (i.e. X and Z are in the same side of the partition), then have X proactively start blocking Y too.

Jira issue: CRDB-17572

gz#13169

Epic CRDB-2488

erikgrinaker · 2022-07-12T16:53:36Z

This is related to #70111, in that we don't currently attempt to form a full mesh network -- a node will only connect to a different node if some component needs to reach it. I don't know exactly if or how this impacts the proposal here, but it's possible that we may find ourself in a situation where a node connects to a peer but the peer does not attempt to connect back (unless we explicitly make it so).

knz · 2022-07-12T20:40:50Z

we may find ourself in a situation where a node connects to a peer but the peer does not attempt to connect back (unless we explicitly make it so).

100% agree, hence the proposal above to:

When receiving an incoming connection from another node, refuse to respond to heartbeats until we get a successful outgoing dial to the same node. In other words: when n2 connects to n1, have n1 refuse the conn until it can dial back to n2 successfully.

zhoufang-joe · 2022-11-08T16:35:54Z

Do we have ETA for this one?
We recently experienced the very similar issue: when a host DNS was unavailable, but the one direction connection/dial still worked, the cluster rejected almost all the incoming client connections to the cluster. Thus, it caused a production outage.
Thx

erikgrinaker · 2022-11-11T15:09:48Z

As seen in https://github.com/cockroachlabs/support/issues/1875, 22.2 includes two additional protections that may help here, by preventing lease transfers to nodes that aren't able to apply the lease:

These are particularly relevant for the case where a lease is transferred to a node which is able to establish outbound connections (and thus can successfully heartbeat), but can't receive inbound connections (in which case it won't catch up on the Raft log and can't apply the lease). Previously, this would cause the range to become unavailable. However, this doesn't help when leases are already present on the semi-partitioned node, so the proposal here is still relevant (in particular by preventing the node from heartbeating, assuming the liveness leaseholder is also one-way partitioned from the node).

Do we have ETA for this one?

No concrete ETA I'm afraid, but we're very aware of this class of problem.

ajwerner · 2022-11-16T20:04:26Z

A simpler thing than described here which feels worth doing is finding a way to at least observe and identify when one-way partitions do happen. Today we can't tell. The latency tracking code is happy to use just one direction of connection to update its view.

knz · 2022-11-16T20:22:10Z

The latency tracking code is happy to use just one direction of connection to update its view.

Technically both sides of the connection maintain the latency metric for "their side" and we can see both in the UI (across the diagonal). We'd lose information if we merged them together.

erikgrinaker · 2022-12-11T20:45:54Z

In addition to this, we should require that all connection classes can be established. We have seen cases where nodes were able to establish one but not the other connection class due to a bug in the RPC circuit breakers.

andrewbaptist · 2023-01-04T21:35:58Z

Looking at solutions for this (both the initial and ongoing partitions), there are two general ways it could be done:

Don't send heartbeat PingResponse unless a backward connection is set up. If the reverse connection cannot be established or broken, stop sending the PingResponse messages. (what is described in the original proposal)
Add a field to PingResponse called badReverseConnections which specifies the list of reverse connections that are in an error state. A receiver of a PingResponse would treat a message with this field set as an error to the caller.

The second since it allows for differentiating why the heartbeats are not making it back and has more information such as the status of each connection class. It is not exactly clear what we should do if some connection classes are able to connect backward and others can't but at least we would have the information for logging/debugging purposes.

Regarding the gossip of the data and its usage for blocking. I agree it is useful to put the data in gossip, but I'm less clear about exactly how it should be used. If we are blocking certain connections based on "hearsay" then it seems possible that a bad node can cause a larger outage than it should (in the example, what if X is connected to both Y and Z and both report the other through gossip)? I'm not exactly sure about the conditions where this would happen, but it seems possible. Unless anyone feels strongly I wasn't going to use this "proxy" information initially, but it will be available in the logs / gossip dump.

erikgrinaker · 2023-01-05T10:30:55Z

I think we can start simple with 1 and not send ping responses. We can log an error on the rejecting node, which I think should be sufficient and wouldn't need any protocol changes.

As for the gossip aspect, let's defer that. This is related to global failure detection, and there are multiple conversations going on about this, so let's not implement anything until we have a clear picture of how failure detection should work in general (in particular wrt. leases).

andrewbaptist · 2023-01-06T17:21:15Z

The implementation of this was a little more complicated than I originally thought due to the behavior of the
The original thought was that it would look something like this:

Time 0 - Node A connects to Node B
Time 1 - Node A calls `heartbeatClient.Ping` to Node B
Time 2 - Node B receives PingRequest from A - does not send PingResponse since no backward connection
Time 3 - Node B connects to Node A
Time 4 - Node B calls `heartbeatClient.Ping` to Node A
Time 5 - Node A receives PingRequest from B - notices that a connection attempt is in progress and waits for it

But this doesn't work because they are deadlocked. I was hoping to examine the state of the connection that is being used since there are really three "important" states

TCP connection set up
PingRequest sent
PingResponse received

However, there does not appear to be a reliable way to tell after either 1 or 2 is complete with the way our connections are set up today.

The approach I am looking at now is the following:

Time 0 - Node A connects to Node B
Time 1 - Node A calls `heartbeatClient.Ping` to Node B
Time 2 - Node B receives PingRequest from A - does not send PingResponse since no fully established backward connection
Time 3 - Node B connects to Node A on a new connection set up with `grpc.WithBlock()`
Time 4 - Node B waits until TCP connection is established
Time 5 - Node B sends a PingResponse to node A (on the original connection)
Time 6 - Node B closes the TCP connection it just established

The other alternative is to send an error back immediately on the first PingRequest but that has the unfortunate impact of also failing the first request sent and also not sending another PingResponse for a while.

Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest will change to be the actual address that a node should use to dial back. Currently this field is not set correctly. Epic: none Release note: None

Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest is deprecated. That PR will add back another field that can be used to lookup the Locality correct field to use. Epic: none Release note: None

94438: sql,storage: add support for COL_BATCH_RESPONSE scan format r=yuzefovich a=yuzefovich This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ``` ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ ``` On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Addresses: #82323. Informs: #87610. Epic: CRDB-14837 Release note: None 95701: gossip: Track latency by nodeID rather than addr r=kvoli,erikgrinaker a=andrewbaptist Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat #84289 the use of the OriginAddr field in the PingRequest will change to be the actual address that a node should use to dial back. Currently this field is not set correctly. Epic: none Release note: None 95796: ui: add CPU Time chart do statement details r=maryliag a=maryliag This commit adds a new chart for CPU time on Statement Details page. Part Of #87213 <img width="1508" alt="Screen Shot 2023-01-24 at 6 01 07 PM" src="https://user-images.githubusercontent.com/1017486/214440274-c48d3bb6-ecbe-47a2-861a-0a8407d219c4.png"> Release note (ui change): Add CPU Time chart to Statement Details page. 95832: cdc: remove 'nonsensitive' tag from changefeed description in telemetry logs r=jayshrivastava a=jayshrivastava Previously, the description field in changefeed telemetry logs was marked as `nonsensitive`. This is incorrect because the description field may contain an SQL statement which is not safe to report. This change removes the `nonsensitive` tag so the field is redacted by default. Fixes: #95823 Epic: none Release note: none 95838: logictest: remove smallEngineBlocks randomization r=yuzefovich a=yuzefovich This metamorphic randomization has caused some flakiness (due to a subset of tests taking very long time) so is now removed. This feature should be tested in a more targeted fashion. Fixes: #95799. Fixes: #95829 Release note: None 95840: opt: replace make with dev in test instructions r=mgartner a=mgartner Epic: None Release note: None 95842: roachtest: fix parameters passed to require.NoError r=yuzefovich,srosenberg,herkolategan a=renatolabs When context is passed to an assertion, the parameters *must* be a string format, followed by arguments (as you would in a call to `fmt.Sprintf`). The previous code would panic trying to cast int to string. Informs #95416 Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: maryliag <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Renato Costa <[email protected]>

blathers-crl · 2023-03-08T14:10:14Z

Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Fixes: cockroachdb#84289 Previously one-way partitions where a node could initiate a successful TCP connection in one direction, but the reverse connection fails which causes problems. The node that initiates outgoing connections can acquire leases and cause failures for reads and writes to those ranges. This is particularly a problem if it acquires the liveness range leases, but is a problem even for other ranges. This commit adds an additional check during server-to-server communication where the recipient of a new PingRequest first validates that it is able to open a reverse connection to the initiator before responding. Additionally, it will monitor whether it has a successful reverse connection over time and asynchronously validate reverse connections to the sender. The ongoing validation is asynchronous to avoid adding delays to PingResponses as they are used for measuring clock offsets. Also the onlyOnceDialer prevents retrying after a dial error, however this can get into a state where it continually retries for certain network connections. This is not easy to reproduce in a unit tests as it requires killing the connection using iptables (normal closes don't cauuse this). After this change the onlyOnceDialer will no longer repeatedly retry to reconnect after a broken connection during setup. Release note (bug fix): RPC connections between nodes now require RPC connections to be established in both directions, otherwise the connection will be closed. This is done to prevent asymmetric network partitions where nodes are able to send outbound messages but not receive inbound messages, which could result in persistent unavailability. This behavior can be disabled by the cluster setting rpc.dialback.enabled. Epic: CRDB-2488

Fixes: cockroachdb#99104 Informs cockroachdb#84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None

99429: ccl/sqlproxyccl: add PROXY protocol support via CLI flag to sqlproxy r=pjtatlow a=jaylim-crl This commits adds a new `require-proxy-protocol` flag to `mt start-proxy`, and that changes the sqlproxy's behavior to support the PROXY protocol. When the flag is set, the protocol will be enforced on the SQL listener, and supported on a best-effort basis on the HTTP listener. If the PROXY protocol isn't used, but is enforced, the connection will be rejected. The rationale behind doing best-effort basis on the HTTP listener is that some healthcheck systems don't support the protocol. This work is needed for the AWS PrivateLink work in CockroachCloud, which requires the use of the PROXY protocol. Release note: None Epic: none Release justification: SQL Proxy change only. Changes are needed for the AWS PrivateLink work in CockroachCloud. 99840: rpc: Fix blackhole recv r=erikgrinaker a=andrewbaptist Fixes: #99104 Informs #84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None Co-authored-by: Jay <[email protected]> Co-authored-by: Andrew Baptist <[email protected]>

Fixes: #99104 Informs #84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None

Previously the latency to remote nodes was tracked by address rather than the node's id. This could result in a few problems. First, the remote address could be reused across nodes. This could result in incorrect information. Additionally, places that used this information (such as the allocator) needed to unnecessarily map the node id to address just to do a lookup. Finally in preparation for dialback on heartbeat cockroachdb#84289 the use of the OriginAddr field in the PingRequest is deprecated. That PR will add back another field that can be used to lookup the Locality correct field to use. Epic: none Release note: None

knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 12, 2022

blathers-crl bot added the T-kv KV Team label Jul 12, 2022

knz added A-kv-server Relating to the KV-level RPC server T-kv KV Team and removed T-kv KV Team labels Jul 12, 2022

erikgrinaker mentioned this issue Nov 21, 2022

kvserver: Raft prevote does not prevent election despite active leader #92088

Closed

exalate-issue-sync bot assigned andrewbaptist Jan 3, 2023

erikgrinaker mentioned this issue Jan 4, 2023

roachtest: failover variant for partial network partitions #94614

Closed

andrewbaptist linked a pull request Jan 5, 2023 that will close this issue

kv: convert uni-directional network partitions to bi-directional #94778

Merged

andrewbaptist mentioned this issue Jan 11, 2023

kv: partial/asymmetric partition that does not isolate node liveness does not force range lease revocation #49220

Open

andrewbaptist mentioned this issue Jan 23, 2023

gossip: Track latency by nodeID rather than addr #95701

Merged

andrewbaptist added the GA-blocker label Mar 8, 2023

craig bot closed this as completed in 5800482 Mar 9, 2023

andrewbaptist mentioned this issue Mar 28, 2023

rpc: Fix blackhole recv #99840

Merged

blathers-crl bot mentioned this issue Mar 30, 2023

release-23.1: rpc: Fix blackhole recv #100238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: refuse incoming connections unless dialback succeeds #84289

rpc: refuse incoming connections unless dialback succeeds #84289

knz commented Jul 12, 2022 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Jul 12, 2022 •

edited

Loading

knz commented Jul 12, 2022

zhoufang-joe commented Nov 8, 2022

erikgrinaker commented Nov 11, 2022

ajwerner commented Nov 16, 2022

knz commented Nov 16, 2022

erikgrinaker commented Dec 11, 2022

andrewbaptist commented Jan 4, 2023

erikgrinaker commented Jan 5, 2023 •

edited

Loading

andrewbaptist commented Jan 6, 2023

blathers-crl bot commented Mar 8, 2023

rpc: refuse incoming connections unless dialback succeeds #84289

rpc: refuse incoming connections unless dialback succeeds #84289

Comments

knz commented Jul 12, 2022 • edited by exalate-issue-sync bot Loading

erikgrinaker commented Jul 12, 2022 • edited Loading

knz commented Jul 12, 2022

zhoufang-joe commented Nov 8, 2022

erikgrinaker commented Nov 11, 2022

ajwerner commented Nov 16, 2022

knz commented Nov 16, 2022

erikgrinaker commented Dec 11, 2022

andrewbaptist commented Jan 4, 2023

erikgrinaker commented Jan 5, 2023 • edited Loading

andrewbaptist commented Jan 6, 2023

blathers-crl bot commented Mar 8, 2023

knz commented Jul 12, 2022 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Jul 12, 2022 •

edited

Loading

erikgrinaker commented Jan 5, 2023 •

edited

Loading