rpc: Fix blackhole recv #99840

andrewbaptist · 2023-03-28T18:35:11Z

As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node.

Epic: none
Release note: None

cockroach-teamcity · 2023-03-28T18:35:23Z

This change is

erikgrinaker

we may only block system traffic connections and not default class connections

Why is this a problem?

In the blackhole-recv case, the node has no inbound connections, but it has outbound connections. It therefore can't receive RPC traffic from the gateways. If we close the SystemClass connection, it also can't send Raft heartbeats nor liveness heartbeats, so it can't hold on to leadership or leases. Why does the workload stall in that case?

Are you sure the problem here wasn't that we closed DefaultClass but not SystemClass? That would allow the node to hang onto leases and leadership, even though no gateways could reach it.

rpc: Fix blackhole recv

Nit: consider describing the actual bug here, i.e. that we often wouldn't detect dialback failures for some classes.

erikgrinaker · 2023-03-30T11:54:15Z

pkg/cmd/roachtest/tests/failover.go

-		f.c.Run(ctx, f.c.Node(nodeID),
-			`sudo iptables -A INPUT -m multiport -p tcp --ports 26257 -j DROP`)
-		f.c.Run(ctx, f.c.Node(nodeID),
-			`sudo iptables -A OUTPUT -m multiport -p tcp --ports 26257 -j DROP`)


Why remove this? This puts blackhole nodes into this weird state where they can receive TCP packets from remote port 26257 and send packets from local port 26257, but not the other way around (recall that client ports are randomly chosen). While that's still going to result in a useless TCP connection since packets can't be acked, the common case of a network outage is that all packets are dropped, and we should have coverage for that case.

We could arguably change the input/output cases below to drop packets in both directions (still maintaining connection asymmetry), which would have the same effect. However, asymmetric partitions are often caused by an incorrect firewall rule that drops packets in one direction, so it might be useful to cover that scenario. That seems less important that covering the common scenario of a total outage though.

erikgrinaker · 2023-03-30T12:16:56Z

pkg/rpc/context.go

 		ctx := rpcCtx.makeDialCtx(target, 0, SystemClass)
 		conn, err := rpcCtx.grpcDialRaw(ctx, target, SystemClass, grpc.WithBlock())
 		if err != nil {
-			log.Warningf(ctx, "dialback connection failed to %s, n%d, %v", target, nodeID, err)


nit: shouldn't this be a warning/error, since it results in a connection failure?

erikgrinaker · 2023-03-30T12:25:04Z

pkg/rpc/context.go

+// continue to return success. Once this completes, we remove this from our map
+// and return whatever error this attempt returned.
+func (rpcCtx *Context) loadOrCreateConnAttempt(
+	nodeID roachpb.NodeID, createConnFunc func() *Connection,


nit: why the closure?

erikgrinaker · 2023-03-30T12:25:46Z

pkg/rpc/context.go

-func (rpcCtx *Context) previousAttempt(nodeID roachpb.NodeID) (error, bool) {
-	// Check if there was a previous attempt and if so use that and clear out
-	// the previous attempt.
+// createPreviousAttempt handles the case where we don't have a previous attempt


nit: method names in comments here seem outdated.

erikgrinaker · 2023-03-30T12:26:47Z

pkg/rpc/context.go

+// connection attempt on future pings. Use the SystemClass to ensure that Raft
+// traffic is not interrupted. It is unusual for some classes to be affected and
+// not others but the SystemClass is the one we really care about.


nit: Raft/liveness heartbeats (most Raft traffic goes via DefaultClass).

erikgrinaker · 2023-03-30T12:37:34Z

pkg/rpc/context.go

+			if err == nil {
+				// If it completed without error then don't track the connection
+				// anymore. If it did have an error we need to track it until it later gets cleared.
+				rpcCtx.dialbackMu.m[nodeID] = nil
+			}


Right ok, I guess the problem here was that we never left the error around for other classes to pick up, so given a regular ping rate only one class would see the error. I think this becomes particularly problematic when it's the SystemClass that doesn't see the error, rather than the DefaultClass.

Is there a risk here that we can leave a stray failed connection around, permanently wedging it? Do we need to reset this when we successfully make an outbound connection to a node, outside of VerifyDialback?

On second thought, I guess that's covered by the explicit check for active connections in VerifyDialback, which also avoids leaking this stuff all over rpcContext. Might be nice to call that out.

blathers-crl · 2023-03-30T20:57:52Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

andrewbaptist

TFTR! I cleaned up the comments and removed the change to the failover test. I'll bors it once it passes and we should be able to see results tonight.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @smg260, and @srosenberg)

pkg/rpc/context.go line 0 at r1 (raw file):