Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: convert uni-directional network partitions to bi-directional #94778

Merged
merged 1 commit into from
Mar 9, 2023

Conversation

andrewbaptist
Copy link
Collaborator

@andrewbaptist andrewbaptist commented Jan 5, 2023

Previously one-way partitions where a node could initiate a successful TCP
connection in one direction, but the reverse connection fails which causes
problems. The node that initiates outgoing connections can acquire
leases and cause failures for reads and writes to those ranges. This is
particularly a problem if it acquires the liveness range leases, but is
a problem even for other ranges.

This commit adds an additional check during server-to-server
communication where the recipient of a new PingRequest first validates
that it is able to open a reverse connection to the initiator before
responding. Additionally, it will monitor whether it has a successful
reverse connection over time and asynchronously validate reverse
connections to the sender. The ongoing validation is asynchronous to
avoid adding delays to PingResponses as they are used for measuring
clock offsets.

Release note (bug fix): Detects and addresses one-way partitions.
Epic: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 221228.network-partition branch from 99ab197 to 9003190 Compare January 5, 2023 19:18
@andrewbaptist andrewbaptist linked an issue Jan 5, 2023 that may be closed by this pull request
@andrewbaptist andrewbaptist force-pushed the 221228.network-partition branch 16 times, most recently from 77d9bfe to dfc5f86 Compare January 10, 2023 17:49
// We want this connection to block on connection so we verify it
// succeeds.
dialOpts = append(dialOpts, grpc.WithBlock())
conn, err := grpc.DialContext(dialCtx, target, dialOpts...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed elsewhere, we should go via grpcDialNodeInternal() such that we'll retain this connection for future use and join any in-flight connection attempts.

For an initial inbound connection, this will naïvely result in a circular dependency since we'll be attempting to ping each other in either direction, both waiting on the other's ping response. However, for the dialback connection attempt we don't have to wait for the ping response itself, we can simply wait for the underlying gRPC connection to be established asynchronously, e.g. via ClientConn.WaitForStateChange.

We may want to limit this to only initial ping attempts, such that we later verify that we're actually receiving pings in both directions too, but this is a lesser concern.

@@ -339,6 +339,18 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
nodeDialer := nodedialer.NewWithOpt(rpcContext, gossip.AddressResolver(g),
nodedialer.DialerOpt{TestingKnobs: dialerKnobs})

// This is somewhat tangled due to the need to use gossip to resolve the
// address. If we are not able to resolve, we simply skip this verification.
rpcContext.PingResolver = func(ctx context.Context, nodeID roachpb.NodeID) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid this tangle by not passing rpcContext in to gossip.New. Gossip only uses the RPC context once it starts making outbound connections, and that only happens once we call Gossip.Start() far below here. The cleanest approach is probably to remove the Gossip.rpcContext field entirely, and thread the RPC context via Gossip.Start() down into bootstrap() and manage(), but alternatively we can set the rpcContext field during Gossip.Start() as long as we properly synchronize access to it. This should preferably be done as a separate commit.

When we do that, we can set up gossip before the RPC context and just pass a nodedialer.AddressResolver to rpc.NewContext() (if we get away with it without any dependency cycles, alternatively use some other corresponding type).

func (rpcCtx *Context) VerifyDialback(ctx context.Context, nodeID roachpb.NodeID) error {
log.Errorf(ctx, "Verifying health of connection to n%d", nodeID)
if nodeID == 0 {
//FIXME: at startup, the nodeID might not be set. Unfortunately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only happens when bootstrapping a new node, right? Once the node has joined a cluster, this will always be set even on the initial connection attempts? If so, this seems totally fine, since the node can't participate in consensus or acquire any leases before it's allocated a node ID.

// established. If there is already a healthy connection set up, it will
// simply return immediately, however if not, it attempts to establish a
// "dummy" connection which is never used to send messages on.
verifyDialback func(context.Context, roachpb.NodeID) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have onHandlePing which effectively does the same thing, should we hook into that instead? We can't go via ContextOptions.OnIncomingPing because that's necessarily constructed before the RPC context itself, but we can inject a handler during RPC context construction before setting onHandlePing. Not particularly important, and I'm not sure if it's really making things more clear or less, so take it or leave it.

@andrewbaptist andrewbaptist force-pushed the 221228.network-partition branch 9 times, most recently from 4578413 to 8d4965b Compare January 31, 2023 15:19
@andrewbaptist andrewbaptist force-pushed the 221228.network-partition branch 2 times, most recently from 84fff6b to 7467c6f Compare March 7, 2023 22:30
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this across the line, I appreciate it was a longer slog than expected.

We should mark this as resolving #84289, and put the GA blocker label on that issue (labels on PRs don't count).

Reviewed 1 of 4 files at r5, 10 of 10 files at r14, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)


pkg/rpc/context.go line 2589 at r6 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Changed to wait until the dialback connection is completed before verified rather than just one ping later.

Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.

As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.


pkg/rpc/context.go line 2249 at r14 (raw file):

			// so we ignore it.
			err := rpcCtx.runHeartbeat(ctx, conn, target)
			log.Health.Infof(ctx, "connection heartbeat loop ended with err: %v", err)

nit: will this result in duplicate logging? The error is stored in conn.err and propagated to callers via Connect() and Health() where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?


pkg/rpc/context.go line 2297 at r14 (raw file):

	"rpc.dialback.enabled",
	"if true, require bidirectional RPC connections between nodes to prevent one-way network unavailability",
	true,

Should this be TenantReadOnly? I think we generally default to that unless we have a very good reason to use SystemOnly, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.


pkg/rpc/context.go line 2578 at r14 (raw file):

	// not the node. In that case, we can't look up if we have a connection to the
	// node and instead need to always try dialback.
	var err error

nit: we can drop this declaration and use a locally scoped err := inside the branch, since we don't need to keep the result around for later.


pkg/rpc/context.go line 2600 at r14 (raw file):

		// Clear out the ServerTime on our response so the receiver does not use
		// this response in its latency calculations.
		response.ServerTime = 0

Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set BLOCKING on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called after ServerTime has been populated.

Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.


pkg/rpc/context.go line 2645 at r14 (raw file):

	previousAttempt := rpcCtx.dialbackMu.m[nodeID]

	// Block here for the previous connection to be completed (successfully or

nit: comment seems outdated, we don't block here.


pkg/rpc/context.go line 2652 at r14 (raw file):

	if previousAttempt != nil {
		select {
		case <-previousAttempt.initialHeartbeatDone:

nit: isn't this exactly the same logic as Health(), where the default branch corresponds to ErrNotHeartbeated?


pkg/rpc/heartbeat.proto line 74 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This works fine with mixed-version clusters and CLIs from different versions. I will do one more pass to make sure I didn't miss anything, but it seems best to not add a new flag.

Sure. I suppose we could use a version gate whenever we add an enum value instead.


pkg/rpc/context_test.go line 2618 at r14 (raw file):

	rpcCtx := NewContext(context.Background(), opts)
	// This is normally set up inside the server, we want to hold onto all PingRequests that come through.
	rpcCtx.OnIncomingPing = func(ctx context.Context, req *PingRequest, resp *PingResponse) error {

So we basically rely on failover/*/blackhole-(recv|send) to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stock TestCluster too?


pkg/rpc/context_test.go line 2715 at r14 (raw file):

	// Forcibly shut down listener 2 and the connection node1 -> node2.
	// Verify the reverse connection will also close within a DialTimeout.

nit: DialTimeout doesn't come into play here because the closed listener will respond with an immediate TCP RST packet. I'm not sure we can easily test DialTimeout here since we have to fiddle with the OS TCP stack.

@andrewbaptist andrewbaptist force-pushed the 221228.network-partition branch from 7467c6f to 0249427 Compare March 8, 2023 15:17
Copy link
Collaborator Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR - I updated the PR and comments also. I'm rerunning all the blackhole tests one more time with the final code and will push once that is done and successful.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/rpc/context.go line 2589 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.

As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.

Thanks, I agree this is better. I was worried about the timeouts on connection attempts, but it seems OK since we have a couple of failsafes in place. I'll add a note about this as well.


pkg/rpc/context.go line 2249 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: will this result in duplicate logging? The error is stored in conn.err and propagated to callers via Connect() and Health() where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?

It is duplicate, but it is better to know when this fails immediately rather than waiting for a future (non-heartbeat) message. Granted that future messages are typically soon, but there is no guarantee when they actually happen. I left it as Info since the connection is also Info and I wanted to be symmetric with that.


pkg/rpc/context.go line 2297 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Should this be TenantReadOnly? I think we generally default to that unless we have a very good reason to use SystemOnly, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.

Update to TenantReadOnly


pkg/rpc/context.go line 2578 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: we can drop this declaration and use a locally scoped err := inside the branch, since we don't need to keep the result around for later.

Done


pkg/rpc/context.go line 2600 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set BLOCKING on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called after ServerTime has been populated.

Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.

Good idea! I changed to check on the sender side only and removed zeroing it out here.


pkg/rpc/context.go line 2645 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: comment seems outdated, we don't block here.

Updated comment


pkg/rpc/context.go line 2652 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: isn't this exactly the same logic as Health(), where the default branch corresponds to ErrNotHeartbeated?

It was very close to calling Health and I considered that, but it needed to handle the dialbackMu.m differently and by time I had handled all the different cases it was basically harder to read than just copying the checks here. Two of the three "branches" do something different, so it really became easier to just reimplemnt this rather than using Healkth.


pkg/rpc/heartbeat.proto line 74 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Sure. I suppose we could use a version gate whenever we add an enum value instead.

I left for now, I don't expect we will add new values and it is probably going to require a version gate if we ever do. Leaving the default value as 0 simplified things a little.


pkg/rpc/context_test.go line 2618 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

So we basically rely on failover/*/blackhole-(recv|send) to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stock TestCluster too?

I'm going to add a TODO to wire up a test for this. I don't want to hold up merging this PR though. I'll work on this later this week.


pkg/rpc/context_test.go line 2715 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: DialTimeout doesn't come into play here because the closed listener will respond with an immediate TCP RST packet. I'm not sure we can easily test DialTimeout here since we have to fiddle with the OS TCP stack.

Updated the comment.

@andrewbaptist
Copy link
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Mar 8, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 8, 2023

Build failed:

Fixes: cockroachdb#84289

Previously one-way partitions where a node could initiate a successful
TCP connection in one direction, but the reverse connection fails which
causes problems. The node that initiates outgoing connections can
acquire leases and cause failures for reads and writes to those ranges.
This is particularly a problem if it acquires the liveness range leases,
but is a problem even for other ranges.

This commit adds an additional check during server-to-server
communication where the recipient of a new PingRequest first validates
that it is able to open a reverse connection to the initiator before
responding. Additionally, it will monitor whether it has a successful
reverse connection over time and asynchronously validate reverse
connections to the sender. The ongoing validation is asynchronous to
avoid adding delays to PingResponses as they are used for measuring
clock offsets.

Also the onlyOnceDialer prevents retrying after a dial error, however
this can get into a state where it continually retries for certain
network connections. This is not easy to reproduce in a unit tests as
it requires killing the connection using iptables (normal closes don't
cauuse this).

After this change the onlyOnceDialer will no longer repeatedly retry to
reconnect after a broken connection during setup.

Release note (bug fix): RPC connections between nodes now require RPC
connections to be established in both directions, otherwise the
connection will be closed. This is done to prevent asymmetric network
partitions where nodes are able to send outbound messages but not
receive inbound messages, which could result in persistent
unavailability. This behavior can be disabled by the cluster setting
rpc.dialback.enabled.

Epic: CRDB-2488
@andrewbaptist andrewbaptist force-pushed the 221228.network-partition branch from 0249427 to 5800482 Compare March 8, 2023 21:31
@andrewbaptist
Copy link
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Mar 9, 2023

Build succeeded:

@craig craig bot merged commit 4dc9e98 into cockroachdb:master Mar 9, 2023
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 7, 2023
This was added recently, in cockroachdb#94778, and contributes to log spam of the
following sort:

I230404 15:00:33.826337 2400 rpc/context.go:2249  [T1,n1,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 268  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826338 3986 rpc/context.go:2249  [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=system,rpc] 269  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826367 3455 rpc/context.go:2249  [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=default,rpc] 270  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826394 3354 rpc/context.go:2249  [T1,n2,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 271  connection heartbeat loop ended with err: <nil>

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 27, 2023
This was added recently, in cockroachdb#94778, and contributes to log spam of the
following sort:

I230404 15:00:33.826337 2400 rpc/context.go:2249  [T1,n1,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 268  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826338 3986 rpc/context.go:2249  [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=system,rpc] 269  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826367 3455 rpc/context.go:2249  [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=default,rpc] 270  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826394 3354 rpc/context.go:2249  [T1,n2,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 271  connection heartbeat loop ended with err: <nil>

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request May 7, 2023
This was added recently, in cockroachdb#94778, and contributes to log spam of the
following sort:

I230404 15:00:33.826337 2400 rpc/context.go:2249  [T1,n1,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 268  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826338 3986 rpc/context.go:2249  [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=system,rpc] 269  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826367 3455 rpc/context.go:2249  [T1,n2,rnode=3,raddr=127.0.0.1:55955,class=default,rpc] 270  connection heartbeat loop ended with err: <nil>
I230404 15:00:33.826394 3354 rpc/context.go:2249  [T1,n2,rnode=2,raddr=127.0.0.1:55941,class=default,rpc] 271  connection heartbeat loop ended with err: <nil>

Release note: None
@andrewbaptist andrewbaptist deleted the 221228.network-partition branch December 15, 2023 21:33
@rmloveland
Copy link
Collaborator

@andrewbaptist i'm making a small docs update based on this PR - can you confirm what versions this is in?

I did the following to check which tags contain this PR's commit SHA but let me know if i'm holding it wrong

if below is correct, looks like it's in everything v23.1+ at this point but please correct me if wrong

$ git tag --contains 5800482 | egrep '^v2' | egrep -v '(alpha|beta|rc)'        
v23.1.0
v23.1.1
v23.1.10
v23.1.11
v23.1.12
v23.1.13
v23.1.14
v23.1.15
v23.1.16
v23.1.17
v23.1.18
v23.1.19
v23.1.2
v23.1.20
v23.1.21
v23.1.22
v23.1.23
v23.1.24
v23.1.25
v23.1.26
v23.1.3
v23.1.4
v23.1.5
v23.1.6
v23.1.7
v23.1.8
v23.1.9
v23.2.0
v23.2.1
v23.2.10
v23.2.11
v23.2.2
v23.2.3
v23.2.4
v23.2.5
v23.2.6
v23.2.7
v23.2.8
v23.2.9
v24.1.0
v24.1.1
v24.1.2
v24.1.3
v24.1.4
v24.2.0
v24.2.1
v24.2.2

@rmloveland
Copy link
Collaborator

ah i just found it in the release notes as well for v23.1.0

https://www.cockroachlabs.com/docs/releases/v23.1.html#v23-1-0-cluster-settings

guess that answers my question, sorry for the noise

rmloveland added a commit to cockroachdb/docs that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpc: refuse incoming connections unless dialback succeeds
5 participants