-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: Fix blackhole recv #99840
rpc: Fix blackhole recv #99840
Conversation
56f52c4
to
84444f5
Compare
84444f5
to
5af4759
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may only block system traffic connections and not default class connections
Why is this a problem?
In the blackhole-recv
case, the node has no inbound connections, but it has outbound connections. It therefore can't receive RPC traffic from the gateways. If we close the SystemClass connection, it also can't send Raft heartbeats nor liveness heartbeats, so it can't hold on to leadership or leases. Why does the workload stall in that case?
Are you sure the problem here wasn't that we closed DefaultClass but not SystemClass? That would allow the node to hang onto leases and leadership, even though no gateways could reach it.
rpc: Fix blackhole recv
Nit: consider describing the actual bug here, i.e. that we often wouldn't detect dialback failures for some classes.
pkg/cmd/roachtest/tests/failover.go
Outdated
f.c.Run(ctx, f.c.Node(nodeID), | ||
`sudo iptables -A INPUT -m multiport -p tcp --ports 26257 -j DROP`) | ||
f.c.Run(ctx, f.c.Node(nodeID), | ||
`sudo iptables -A OUTPUT -m multiport -p tcp --ports 26257 -j DROP`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this? This puts blackhole
nodes into this weird state where they can receive TCP packets from remote port 26257 and send packets from local port 26257, but not the other way around (recall that client ports are randomly chosen). While that's still going to result in a useless TCP connection since packets can't be acked, the common case of a network outage is that all packets are dropped, and we should have coverage for that case.
We could arguably change the input/output cases below to drop packets in both directions (still maintaining connection asymmetry), which would have the same effect. However, asymmetric partitions are often caused by an incorrect firewall rule that drops packets in one direction, so it might be useful to cover that scenario. That seems less important that covering the common scenario of a total outage though.
ctx := rpcCtx.makeDialCtx(target, 0, SystemClass) | ||
conn, err := rpcCtx.grpcDialRaw(ctx, target, SystemClass, grpc.WithBlock()) | ||
if err != nil { | ||
log.Warningf(ctx, "dialback connection failed to %s, n%d, %v", target, nodeID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldn't this be a warning/error, since it results in a connection failure?
// continue to return success. Once this completes, we remove this from our map | ||
// and return whatever error this attempt returned. | ||
func (rpcCtx *Context) loadOrCreateConnAttempt( | ||
nodeID roachpb.NodeID, createConnFunc func() *Connection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why the closure?
pkg/rpc/context.go
Outdated
func (rpcCtx *Context) previousAttempt(nodeID roachpb.NodeID) (error, bool) { | ||
// Check if there was a previous attempt and if so use that and clear out | ||
// the previous attempt. | ||
// createPreviousAttempt handles the case where we don't have a previous attempt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: method names in comments here seem outdated.
pkg/rpc/context.go
Outdated
// connection attempt on future pings. Use the SystemClass to ensure that Raft | ||
// traffic is not interrupted. It is unusual for some classes to be affected and | ||
// not others but the SystemClass is the one we really care about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Raft/liveness heartbeats (most Raft traffic goes via DefaultClass).
if err == nil { | ||
// If it completed without error then don't track the connection | ||
// anymore. If it did have an error we need to track it until it later gets cleared. | ||
rpcCtx.dialbackMu.m[nodeID] = nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ok, I guess the problem here was that we never left the error around for other classes to pick up, so given a regular ping rate only one class would see the error. I think this becomes particularly problematic when it's the SystemClass
that doesn't see the error, rather than the DefaultClass
.
Is there a risk here that we can leave a stray failed connection around, permanently wedging it? Do we need to reset this when we successfully make an outbound connection to a node, outside of VerifyDialback
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I guess that's covered by the explicit check for active connections in VerifyDialback
, which also avoids leaking this stuff all over rpcContext
. Might be nice to call that out.
5af4759
to
b52c62f
Compare
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I cleaned up the comments and removed the change to the failover test. I'll bors it once it passes and we should be able to see results tonight.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @smg260, and @srosenberg)
pkg/rpc/context.go
line 0 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)
nit: shouldn't this be a warning/error, since it results in a connection failure?
I'm not sure what the option here is we already flood the log with messages when a node goes down. I downgraded the error level since it was not an "abnormal" case in all cases (if they were upgrading, taking a node down, ...) So I didn't want warnings for "normal" operations. I'll leave this for now, but we could consider upgrading it later.
pkg/rpc/context.go
line 2689 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: method names in comments here seem outdated.
Done - cleaned up the name and comments.
pkg/rpc/context.go
line 2696 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: Raft/liveness heartbeats (most Raft traffic goes via DefaultClass).
Done - clarified the text regarding this.
pkg/rpc/context.go
line 2703 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: why the closure?
This was to allow us to defer creating the connection until we know we need to, but also do it under the lock. It is similar to the Java's computeIfAbsent
function. I could also just inline this method back in, but it made it clearer where the locks were held to keep it separated.
pkg/rpc/context.go
line 2725 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
On second thought, I guess that's covered by the explicit check for active connections in
VerifyDialback
, which also avoids leaking this stuff all overrpcContext
. Might be nice to call that out.
Added comments about this.
pkg/cmd/roachtest/tests/failover.go
line 0 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)
Why remove this? This puts
blackhole
nodes into this weird state where they can receive TCP packets from remote port 26257 and send packets from local port 26257, but not the other way around (recall that client ports are randomly chosen). While that's still going to result in a useless TCP connection since packets can't be acked, the common case of a network outage is that all packets are dropped, and we should have coverage for that case.We could arguably change the input/output cases below to drop packets in both directions (still maintaining connection asymmetry), which would have the same effect. However, asymmetric partitions are often caused by an incorrect firewall rule that drops packets in one direction, so it might be useful to cover that scenario. That seems less important that covering the common scenario of a total outage though.
I pulled this change out of the commit. I think it was not changing anything, but I don't want to change the code and the test at the same time. I'll pull this out to a separate PR that doesn't need to be backported.
b52c62f
to
2c88c13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Reviewed 2 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @smg260, and @srosenberg)
pkg/rpc/context.go
line 2703 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
This was to allow us to defer creating the connection until we know we need to, but also do it under the lock. It is similar to the Java's
computeIfAbsent
function. I could also just inline this method back in, but it made it clearer where the locks were held to keep it separated.
I see. We usually do this by suffixing the method name, e.g. loadOrCreateConnAttemptLocked()
or loadOrCreateConnAttemptDialbackMuLocked()
, see e.g. the plethora of RaftMuLocked()
methods.
pkg/cmd/roachtest/tests/failover.go
line 0 at r1 (raw file):
I think it was not changing anything
It sort of does, at least technically. Consider two nodes A and B, with a (randomly picked) client port 16384 and server port 26257. We'll have TCP packets going in four directions, across two TCP connections:
- A:16384 → B:26257
- A:16384 ← B:26257
- A:26257 → B:16384
- A:26257 ← B:16384
Consider the existing rules, applied on B:
iptables -A INPUT -m multiport -p tcp --ports 26257 -j DROP
iptables -A OUTPUT -m multiport -p tcp --ports 26257 -j DROP
The first rule blocks 1 and 3, the second rule blocks 2 and 4, because --ports
matches both the source port and the destination port. Now consider the suggested rules:
iptables -A INPUT -p tcp --dport 26257 -j DROP
iptables -A OUTPUT -p tcp --dport 26257 -j DROP
This only blocks 1 and 4, allowing packets through for 2 and 3.
As I said, this probably doesn't matter all that much since the TCP connections won't work (SYNs and ACKs don't get through), but on the off chance that stray packets can affect the TCP stack behavior somehow I'd rather cover the common outage scenario where all packets are dropped.
Fixes: cockroachdb#99104 Informs cockroachdb#84289 As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node. Epic: none Release note: None
2c88c13
to
53c94ad
Compare
bors r=erikgrinaker |
Build succeeded: |
Fixes: #99104
Informs #84289
As part of the previous fix for partition handling, we tracked the state of a previous attempt and use that result on the next attempt. However if there were multiple connections, we may only block system traffic connections and not default class connections. This change addresses that by ensuring a failed dialback attempt is remembered until we are able to successfully connect back to the pinging node.
Epic: none
Release note: None