Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: Fix blackhole recv #99840

Merged
merged 1 commit into from
Mar 30, 2023

Conversation

andrewbaptist
Copy link
Collaborator

Fixes: #99104
Informs #84289

As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node.

Epic: none
Release note: None

@andrewbaptist andrewbaptist requested review from a team as code owners March 28, 2023 18:35
@andrewbaptist andrewbaptist requested review from srosenberg and smg260 and removed request for a team March 28, 2023 18:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may only block system traffic connections and not default class connections

Why is this a problem?

In the blackhole-recv case, the node has no inbound connections, but it has outbound connections. It therefore can't receive RPC traffic from the gateways. If we close the SystemClass connection, it also can't send Raft heartbeats nor liveness heartbeats, so it can't hold on to leadership or leases. Why does the workload stall in that case?

Are you sure the problem here wasn't that we closed DefaultClass but not SystemClass? That would allow the node to hang onto leases and leadership, even though no gateways could reach it.

rpc: Fix blackhole recv

Nit: consider describing the actual bug here, i.e. that we often wouldn't detect dialback failures for some classes.

Comment on lines 745 to 757
f.c.Run(ctx, f.c.Node(nodeID),
`sudo iptables -A INPUT -m multiport -p tcp --ports 26257 -j DROP`)
f.c.Run(ctx, f.c.Node(nodeID),
`sudo iptables -A OUTPUT -m multiport -p tcp --ports 26257 -j DROP`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this? This puts blackhole nodes into this weird state where they can receive TCP packets from remote port 26257 and send packets from local port 26257, but not the other way around (recall that client ports are randomly chosen). While that's still going to result in a useless TCP connection since packets can't be acked, the common case of a network outage is that all packets are dropped, and we should have coverage for that case.

We could arguably change the input/output cases below to drop packets in both directions (still maintaining connection asymmetry), which would have the same effect. However, asymmetric partitions are often caused by an incorrect firewall rule that drops packets in one direction, so it might be useful to cover that scenario. That seems less important that covering the common scenario of a total outage though.

ctx := rpcCtx.makeDialCtx(target, 0, SystemClass)
conn, err := rpcCtx.grpcDialRaw(ctx, target, SystemClass, grpc.WithBlock())
if err != nil {
log.Warningf(ctx, "dialback connection failed to %s, n%d, %v", target, nodeID, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't this be a warning/error, since it results in a connection failure?

// continue to return success. Once this completes, we remove this from our map
// and return whatever error this attempt returned.
func (rpcCtx *Context) loadOrCreateConnAttempt(
nodeID roachpb.NodeID, createConnFunc func() *Connection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why the closure?

func (rpcCtx *Context) previousAttempt(nodeID roachpb.NodeID) (error, bool) {
// Check if there was a previous attempt and if so use that and clear out
// the previous attempt.
// createPreviousAttempt handles the case where we don't have a previous attempt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: method names in comments here seem outdated.

Comment on lines 2694 to 2696
// connection attempt on future pings. Use the SystemClass to ensure that Raft
// traffic is not interrupted. It is unusual for some classes to be affected and
// not others but the SystemClass is the one we really care about.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Raft/liveness heartbeats (most Raft traffic goes via DefaultClass).

Comment on lines +2721 to +2723
if err == nil {
// If it completed without error then don't track the connection
// anymore. If it did have an error we need to track it until it later gets cleared.
rpcCtx.dialbackMu.m[nodeID] = nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ok, I guess the problem here was that we never left the error around for other classes to pick up, so given a regular ping rate only one class would see the error. I think this becomes particularly problematic when it's the SystemClass that doesn't see the error, rather than the DefaultClass.

Is there a risk here that we can leave a stray failed connection around, permanently wedging it? Do we need to reset this when we successfully make an outbound connection to a node, outside of VerifyDialback?

Copy link
Contributor

@erikgrinaker erikgrinaker Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I guess that's covered by the explicit check for active connections in VerifyDialback, which also avoids leaking this stuff all over rpcContext. Might be nice to call that out.

@erikgrinaker erikgrinaker added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 30, 2023
@andrewbaptist andrewbaptist force-pushed the 2023.03.28-blackhole_recv branch from 5af4759 to b52c62f Compare March 30, 2023 20:57
@blathers-crl
Copy link

blathers-crl bot commented Mar 30, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I cleaned up the comments and removed the change to the failover test. I'll bors it once it passes and we should be able to see results tonight.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @smg260, and @srosenberg)


pkg/rpc/context.go line 0 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)

nit: shouldn't this be a warning/error, since it results in a connection failure?

I'm not sure what the option here is we already flood the log with messages when a node goes down. I downgraded the error level since it was not an "abnormal" case in all cases (if they were upgrading, taking a node down, ...) So I didn't want warnings for "normal" operations. I'll leave this for now, but we could consider upgrading it later.


pkg/rpc/context.go line 2689 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: method names in comments here seem outdated.

Done - cleaned up the name and comments.


pkg/rpc/context.go line 2696 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: Raft/liveness heartbeats (most Raft traffic goes via DefaultClass).

Done - clarified the text regarding this.


pkg/rpc/context.go line 2703 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: why the closure?

This was to allow us to defer creating the connection until we know we need to, but also do it under the lock. It is similar to the Java's computeIfAbsent function. I could also just inline this method back in, but it made it clearer where the locks were held to keep it separated.


pkg/rpc/context.go line 2725 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

On second thought, I guess that's covered by the explicit check for active connections in VerifyDialback, which also avoids leaking this stuff all over rpcContext. Might be nice to call that out.

Added comments about this.


pkg/cmd/roachtest/tests/failover.go line 0 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)

Why remove this? This puts blackhole nodes into this weird state where they can receive TCP packets from remote port 26257 and send packets from local port 26257, but not the other way around (recall that client ports are randomly chosen). While that's still going to result in a useless TCP connection since packets can't be acked, the common case of a network outage is that all packets are dropped, and we should have coverage for that case.

We could arguably change the input/output cases below to drop packets in both directions (still maintaining connection asymmetry), which would have the same effect. However, asymmetric partitions are often caused by an incorrect firewall rule that drops packets in one direction, so it might be useful to cover that scenario. That seems less important that covering the common scenario of a total outage though.

I pulled this change out of the commit. I think it was not changing anything, but I don't want to change the code and the test at the same time. I'll pull this out to a separate PR that doesn't need to be backported.

@andrewbaptist andrewbaptist force-pushed the 2023.03.28-blackhole_recv branch from b52c62f to 2c88c13 Compare March 30, 2023 20:59
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Reviewed 2 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @smg260, and @srosenberg)


pkg/rpc/context.go line 2703 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This was to allow us to defer creating the connection until we know we need to, but also do it under the lock. It is similar to the Java's computeIfAbsent function. I could also just inline this method back in, but it made it clearer where the locks were held to keep it separated.

I see. We usually do this by suffixing the method name, e.g. loadOrCreateConnAttemptLocked() or loadOrCreateConnAttemptDialbackMuLocked(), see e.g. the plethora of RaftMuLocked() methods.


pkg/cmd/roachtest/tests/failover.go line 0 at r1 (raw file):

I think it was not changing anything

It sort of does, at least technically. Consider two nodes A and B, with a (randomly picked) client port 16384 and server port 26257. We'll have TCP packets going in four directions, across two TCP connections:

  1. A:16384 → B:26257
  2. A:16384 ← B:26257
  3. A:26257 → B:16384
  4. A:26257 ← B:16384

Consider the existing rules, applied on B:

iptables -A INPUT -m multiport -p tcp --ports 26257 -j DROP
iptables -A OUTPUT -m multiport -p tcp --ports 26257 -j DROP

The first rule blocks 1 and 3, the second rule blocks 2 and 4, because --ports matches both the source port and the destination port. Now consider the suggested rules:

iptables -A INPUT -p tcp --dport 26257 -j DROP
iptables -A OUTPUT -p tcp --dport 26257 -j DROP

This only blocks 1 and 4, allowing packets through for 2 and 3.

As I said, this probably doesn't matter all that much since the TCP connections won't work (SYNs and ACKs don't get through), but on the off chance that stray packets can affect the TCP stack behavior somehow I'd rather cover the common outage scenario where all packets are dropped.

Fixes: cockroachdb#99104
Informs cockroachdb#84289

As part of the previous fix for partition handling, we tracked the state
of a previous attempt and use that result on the next attempt. However
if there were multiple connections, we may only block system traffic
connections and not default class connections.  This change addresses
that by ensuring a failed dialback attempt is remembered until we are
able to successfully connect back to the pinging node.

Epic: none
Release note: None
@andrewbaptist andrewbaptist force-pushed the 2023.03.28-blackhole_recv branch from 2c88c13 to 53c94ad Compare March 30, 2023 21:29
@andrewbaptist
Copy link
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Mar 30, 2023

Build succeeded:

@craig craig bot merged commit 2af36b8 into cockroachdb:master Mar 30, 2023
@andrewbaptist andrewbaptist deleted the 2023.03.28-blackhole_recv branch December 15, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: RPC dialback does not resolve failover/non-system/blackhole-recv
3 participants