-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: convert uni-directional network partitions to bi-directional #94778
kv: convert uni-directional network partitions to bi-directional #94778
Conversation
andrewbaptist
commented
Jan 5, 2023
•
edited
Loading
edited
99ab197
to
9003190
Compare
77d9bfe
to
dfc5f86
Compare
pkg/rpc/context.go
Outdated
// We want this connection to block on connection so we verify it | ||
// succeeds. | ||
dialOpts = append(dialOpts, grpc.WithBlock()) | ||
conn, err := grpc.DialContext(dialCtx, target, dialOpts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed elsewhere, we should go via grpcDialNodeInternal()
such that we'll retain this connection for future use and join any in-flight connection attempts.
For an initial inbound connection, this will naïvely result in a circular dependency since we'll be attempting to ping each other in either direction, both waiting on the other's ping response. However, for the dialback connection attempt we don't have to wait for the ping response itself, we can simply wait for the underlying gRPC connection to be established asynchronously, e.g. via ClientConn.WaitForStateChange
.
We may want to limit this to only initial ping attempts, such that we later verify that we're actually receiving pings in both directions too, but this is a lesser concern.
pkg/server/server.go
Outdated
@@ -339,6 +339,18 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) { | |||
nodeDialer := nodedialer.NewWithOpt(rpcContext, gossip.AddressResolver(g), | |||
nodedialer.DialerOpt{TestingKnobs: dialerKnobs}) | |||
|
|||
// This is somewhat tangled due to the need to use gossip to resolve the | |||
// address. If we are not able to resolve, we simply skip this verification. | |||
rpcContext.PingResolver = func(ctx context.Context, nodeID roachpb.NodeID) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid this tangle by not passing rpcContext
in to gossip.New
. Gossip only uses the RPC context once it starts making outbound connections, and that only happens once we call Gossip.Start()
far below here. The cleanest approach is probably to remove the Gossip.rpcContext
field entirely, and thread the RPC context via Gossip.Start()
down into bootstrap()
and manage()
, but alternatively we can set the rpcContext
field during Gossip.Start()
as long as we properly synchronize access to it. This should preferably be done as a separate commit.
When we do that, we can set up gossip before the RPC context and just pass a nodedialer.AddressResolver
to rpc.NewContext()
(if we get away with it without any dependency cycles, alternatively use some other corresponding type).
pkg/rpc/context.go
Outdated
func (rpcCtx *Context) VerifyDialback(ctx context.Context, nodeID roachpb.NodeID) error { | ||
log.Errorf(ctx, "Verifying health of connection to n%d", nodeID) | ||
if nodeID == 0 { | ||
//FIXME: at startup, the nodeID might not be set. Unfortunately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only happens when bootstrapping a new node, right? Once the node has joined a cluster, this will always be set even on the initial connection attempts? If so, this seems totally fine, since the node can't participate in consensus or acquire any leases before it's allocated a node ID.
pkg/rpc/heartbeat.go
Outdated
// established. If there is already a healthy connection set up, it will | ||
// simply return immediately, however if not, it attempts to establish a | ||
// "dummy" connection which is never used to send messages on. | ||
verifyDialback func(context.Context, roachpb.NodeID) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have onHandlePing
which effectively does the same thing, should we hook into that instead? We can't go via ContextOptions.OnIncomingPing
because that's necessarily constructed before the RPC context itself, but we can inject a handler during RPC context construction before setting onHandlePing
. Not particularly important, and I'm not sure if it's really making things more clear or less, so take it or leave it.
4578413
to
8d4965b
Compare
84fff6b
to
7467c6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this across the line, I appreciate it was a longer slog than expected.
We should mark this as resolving #84289, and put the GA blocker label on that issue (labels on PRs don't count).
Reviewed 1 of 4 files at r5, 10 of 10 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
pkg/rpc/context.go
line 2589 at r6 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
Changed to wait until the dialback connection is completed before verified rather than just one ping later.
Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.
As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.
pkg/rpc/context.go
line 2249 at r14 (raw file):
// so we ignore it. err := rpcCtx.runHeartbeat(ctx, conn, target) log.Health.Infof(ctx, "connection heartbeat loop ended with err: %v", err)
nit: will this result in duplicate logging? The error is stored in conn.err
and propagated to callers via Connect()
and Health()
where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?
pkg/rpc/context.go
line 2297 at r14 (raw file):
"rpc.dialback.enabled", "if true, require bidirectional RPC connections between nodes to prevent one-way network unavailability", true,
Should this be TenantReadOnly
? I think we generally default to that unless we have a very good reason to use SystemOnly
, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.
pkg/rpc/context.go
line 2578 at r14 (raw file):
// not the node. In that case, we can't look up if we have a connection to the // node and instead need to always try dialback. var err error
nit: we can drop this declaration and use a locally scoped err :=
inside the branch, since we don't need to keep the result around for later.
pkg/rpc/context.go
line 2600 at r14 (raw file):
// Clear out the ServerTime on our response so the receiver does not use // this response in its latency calculations. response.ServerTime = 0
Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set BLOCKING
on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called after ServerTime
has been populated.
Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.
pkg/rpc/context.go
line 2645 at r14 (raw file):
previousAttempt := rpcCtx.dialbackMu.m[nodeID] // Block here for the previous connection to be completed (successfully or
nit: comment seems outdated, we don't block here.
pkg/rpc/context.go
line 2652 at r14 (raw file):
if previousAttempt != nil { select { case <-previousAttempt.initialHeartbeatDone:
nit: isn't this exactly the same logic as Health()
, where the default branch corresponds to ErrNotHeartbeated
?
pkg/rpc/heartbeat.proto
line 74 at r3 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
This works fine with mixed-version clusters and CLIs from different versions. I will do one more pass to make sure I didn't miss anything, but it seems best to not add a new flag.
Sure. I suppose we could use a version gate whenever we add an enum value instead.
pkg/rpc/context_test.go
line 2618 at r14 (raw file):
rpcCtx := NewContext(context.Background(), opts) // This is normally set up inside the server, we want to hold onto all PingRequests that come through. rpcCtx.OnIncomingPing = func(ctx context.Context, req *PingRequest, resp *PingResponse) error {
So we basically rely on failover/*/blackhole-(recv|send)
to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stock TestCluster
too?
pkg/rpc/context_test.go
line 2715 at r14 (raw file):
// Forcibly shut down listener 2 and the connection node1 -> node2. // Verify the reverse connection will also close within a DialTimeout.
nit: DialTimeout
doesn't come into play here because the closed listener will respond with an immediate TCP RST
packet. I'm not sure we can easily test DialTimeout
here since we have to fiddle with the OS TCP stack.
7467c6f
to
0249427
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR - I updated the PR and comments also. I'm rerunning all the blackhole tests one more time with the final code and will push once that is done and successful.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/rpc/context.go
line 2589 at r6 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.
As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.
Thanks, I agree this is better. I was worried about the timeouts on connection attempts, but it seems OK since we have a couple of failsafes in place. I'll add a note about this as well.
pkg/rpc/context.go
line 2249 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: will this result in duplicate logging? The error is stored in
conn.err
and propagated to callers viaConnect()
andHealth()
where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?
It is duplicate, but it is better to know when this fails immediately rather than waiting for a future (non-heartbeat) message. Granted that future messages are typically soon, but there is no guarantee when they actually happen. I left it as Info since the connection is also Info and I wanted to be symmetric with that.
pkg/rpc/context.go
line 2297 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Should this be
TenantReadOnly
? I think we generally default to that unless we have a very good reason to useSystemOnly
, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.
Update to TenantReadOnly
pkg/rpc/context.go
line 2578 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: we can drop this declaration and use a locally scoped
err :=
inside the branch, since we don't need to keep the result around for later.
Done
pkg/rpc/context.go
line 2600 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set
BLOCKING
on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called afterServerTime
has been populated.Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.
Good idea! I changed to check on the sender side only and removed zeroing it out here.
pkg/rpc/context.go
line 2645 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: comment seems outdated, we don't block here.
Updated comment
pkg/rpc/context.go
line 2652 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: isn't this exactly the same logic as
Health()
, where the default branch corresponds toErrNotHeartbeated
?
It was very close to calling Health
and I considered that, but it needed to handle the dialbackMu.m
differently and by time I had handled all the different cases it was basically harder to read than just copying the checks here. Two of the three "branches" do something different, so it really became easier to just reimplemnt this rather than using Healkth.
pkg/rpc/heartbeat.proto
line 74 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Sure. I suppose we could use a version gate whenever we add an enum value instead.
I left for now, I don't expect we will add new values and it is probably going to require a version gate if we ever do. Leaving the default value as 0 simplified things a little.
pkg/rpc/context_test.go
line 2618 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
So we basically rely on
failover/*/blackhole-(recv|send)
to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stockTestCluster
too?
I'm going to add a TODO to wire up a test for this. I don't want to hold up merging this PR though. I'll work on this later this week.
pkg/rpc/context_test.go
line 2715 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit:
DialTimeout
doesn't come into play here because the closed listener will respond with an immediate TCPRST
packet. I'm not sure we can easily testDialTimeout
here since we have to fiddle with the OS TCP stack.
Updated the comment.
bors r=erikgrinaker |
Build failed (retrying...): |
Build failed: |
Fixes: cockroachdb#84289 Previously one-way partitions where a node could initiate a successful TCP connection in one direction, but the reverse connection fails which causes problems. The node that initiates outgoing connections can acquire leases and cause failures for reads and writes to those ranges. This is particularly a problem if it acquires the liveness range leases, but is a problem even for other ranges. This commit adds an additional check during server-to-server communication where the recipient of a new PingRequest first validates that it is able to open a reverse connection to the initiator before responding. Additionally, it will monitor whether it has a successful reverse connection over time and asynchronously validate reverse connections to the sender. The ongoing validation is asynchronous to avoid adding delays to PingResponses as they are used for measuring clock offsets. Also the onlyOnceDialer prevents retrying after a dial error, however this can get into a state where it continually retries for certain network connections. This is not easy to reproduce in a unit tests as it requires killing the connection using iptables (normal closes don't cauuse this). After this change the onlyOnceDialer will no longer repeatedly retry to reconnect after a broken connection during setup. Release note (bug fix): RPC connections between nodes now require RPC connections to be established in both directions, otherwise the connection will be closed. This is done to prevent asymmetric network partitions where nodes are able to send outbound messages but not receive inbound messages, which could result in persistent unavailability. This behavior can be disabled by the cluster setting rpc.dialback.enabled. Epic: CRDB-2488
0249427
to
5800482
Compare
bors r=erikgrinaker |
Build succeeded: |
This was added recently, in cockroachdb#94778, and contributes to log spam of the following sort: I230404 15:00:33.826337 2400 rpc/context.go:2249 [T1,n1,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 268 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826338 3986 rpc/context.go:2249 [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=system,rpc] 269 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826367 3455 rpc/context.go:2249 [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=default,rpc] 270 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826394 3354 rpc/context.go:2249 [T1,n2,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 271 connection heartbeat loop ended with err: <nil> Release note: None
This was added recently, in cockroachdb#94778, and contributes to log spam of the following sort: I230404 15:00:33.826337 2400 rpc/context.go:2249 [T1,n1,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 268 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826338 3986 rpc/context.go:2249 [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=system,rpc] 269 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826367 3455 rpc/context.go:2249 [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=default,rpc] 270 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826394 3354 rpc/context.go:2249 [T1,n2,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 271 connection heartbeat loop ended with err: <nil> Release note: None
This was added recently, in cockroachdb#94778, and contributes to log spam of the following sort: I230404 15:00:33.826337 2400 rpc/context.go:2249 [T1,n1,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 268 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826338 3986 rpc/context.go:2249 [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=system,rpc] 269 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826367 3455 rpc/context.go:2249 [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=default,rpc] 270 connection heartbeat loop ended with err: <nil> I230404 15:00:33.826394 3354 rpc/context.go:2249 [T1,n2,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 271 connection heartbeat loop ended with err: <nil> Release note: None
@andrewbaptist i'm making a small docs update based on this PR - can you confirm what versions this is in? I did the following to check which tags contain this PR's commit SHA but let me know if i'm holding it wrong if below is correct, looks like it's in everything v23.1+ at this point but please correct me if wrong
|
ah i just found it in the release notes as well for v23.1.0 https://www.cockroachlabs.com/docs/releases/v23.1.html#v23-1-0-cluster-settings guess that answers my question, sorry for the noise |
Since they're all updated by cockroachdb/cockroach#94778