-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: Check for cluster ID conflicts on handshake #20163
core: Check for cluster ID conflicts on handshake #20163
Conversation
Review status: 0 of 13 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. pkg/rpc/context.go, line 245 at r1 (raw file):
@tschottdorf I didn't end up guarding this with a mutex since there didn't seem to be any risk of contention. (It's only modified in Comments from Reviewable |
de05d1c
to
1c26d86
Compare
Nice work, @solongordon! Reviewed 11 of 13 files at r1, 2 of 2 files at r2. pkg/cli/cli_test.go, line 343 at r2 (raw file):
This is a much less clear error message compared to what we had before. Can we find a way to make it more clear to users who don't know what the "initial heartbeat" is? pkg/cli/start.go, line 938 at r2 (raw file):
Nice catch! Isn't this also needed before the previous early return? pkg/rpc/context.go, line 440 at r2 (raw file):
Isn't pkg/rpc/context.go, line 442 at r2 (raw file):
Nit, but I think "initial connection heartbeat failed" may be a little clearer to folks less familiar with how this stuff is implemented. pkg/rpc/context.go, line 443 at r2 (raw file):
Doesn't this need to be pkg/rpc/context_test.go, line 801 at r2 (raw file):
It looks like it'll work fine right now due to the semantics of pkg/rpc/context_test.go, line 802 at r2 (raw file):
Can we be more specific and expect an error message indicating that the cluster IDs don't match? pkg/rpc/heartbeat.go, line 51 at r2 (raw file):
Why this change? It's typically more maintainable and testable to ask for just the few things that you need than to embed a larger type with tons of fields that you don't need. pkg/rpc/heartbeat.go, line 70 at r2 (raw file):
I don't believe there's ever a valid scenario where a client with a cluster ID would connect to a node without a cluster ID. We may want to be more strict about that to avoid issues when initializing a new node. I'd say we may even just want to disallow incoming connections to a node that doesn't have a cluster ID, but that would break the init command. cc @tschottdorf to double check that first statement. pkg/rpc/heartbeat.go, line 71 at r2 (raw file):
Huh, TIL that you can directly compare arrays in go. pkg/rpc/heartbeat.go, line 73 at r2 (raw file):
I expect it'd be helpful to include both IDs in this case, e.g. "client cluster ID %s doesn't match server cluster ID %s" pkg/rpc/heartbeat.proto, line 51 at r2 (raw file):
nit, but s/illegal connections/connections between nodes in different clusters/ pkg/server/server.go, line 571 at r2 (raw file):
s/engine/store/ to match the more widely-known terminology pkg/server/server.go, line 940 at r2 (raw file):
At this point we should definitely have a known cluster ID, either via a bootstrapped store or via gossip -- shouldn't we update the pkg/server/server.go, line 1055 at r2 (raw file):
Doing this after starting so much up seems wrong. Am I just mistaken or should this be done earlier? Also, how is this safe from races with incoming/outgoing connection heartbeats? If it isn't, then are we missing a test that would have caught it when run under the go race detector? Comments from Reviewable |
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful. pkg/rpc/heartbeat.go, line 51 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Mostly because I needed the Comments from Reviewable |
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful. pkg/cli/start.go, line 938 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/rpc/context.go, line 440 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/rpc/context.go, line 442 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/rpc/context.go, line 443 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Ah, maybe. I didn't run across the deadlock in testing but it seems possible. I'm a bit worried about removing the connection asynchronously since the whole point of this is to ensure that a bad connection doesn't get handed out. Does that seem like a valid concern? pkg/rpc/context_test.go, line 802 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/rpc/heartbeat.go, line 70 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Interesting, I'll experiment with disallowing missing cluster IDs on the ping handler side and see what happens. pkg/rpc/heartbeat.go, line 73 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/rpc/heartbeat.proto, line 51 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/server/server.go, line 571 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/server/server.go, line 940 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
From reading the code it seemed like obtaining a cluster ID via gossip happens in pkg/server/server.go, line 1055 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
See above comment about the placement. This seemed like the earliest I could do it but I could easily be wrong. I didn't worry about race conditions since there are no concurrent writes going on, and the heartbeat succeeds whether this value is empty or populated. But maybe I'm missing something subtler. What race did you have in mind? I'll read up on the race detector. Comments from Reviewable |
1c26d86
to
9f932d1
Compare
Review status: 5 of 13 files reviewed at latest revision, 16 unresolved discussions, some commit checks pending. pkg/cli/cli_test.go, line 343 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yeah, the error message does have a preamble that's less cryptic (see ~10 lines up) but I'll see if I can clean this up. Comments from Reviewable |
Review status: 5 of 13 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed. pkg/cli/cli_test.go, line 343 at r2 (raw file): Previously, solongordon wrote…
Added back in the Comments from Reviewable |
e648cb5
to
f9c09c2
Compare
Review status: 4 of 13 files reviewed at latest revision, 16 unresolved discussions. pkg/rpc/context_test.go, line 801 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Good idea, done! Comments from Reviewable |
Speaking of races, looks like the race detector found something it didn't like: https://teamcity.cockroachdb.com/viewLog.html?buildId=417603&buildTypeId=Cockroach_UnitTests You can run the race detector on our tests locally using the
Reviewed 8 of 9 files at r3. pkg/rpc/context.go, line 443 at r2 (raw file): Previously, solongordon wrote…
Removing it asynchronously is fine because we'll already have set its pkg/rpc/heartbeat.go, line 51 at r2 (raw file): Previously, solongordon wrote…
Ah, I see. pkg/rpc/heartbeat.go, line 70 at r2 (raw file): Previously, solongordon wrote…
Just to follow up on this with a little evidence, our existing gossip code's ClusterID logic rejects incoming connections that have a cluster ID if the server receiving the connection doesn't have one yet: https://github.com/cockroachdb/cockroach/blob/master/pkg/gossip/server.go#L129 pkg/rpc/heartbeat.go, line 73 at r2 (raw file): Previously, solongordon wrote…
I realize you haven't made the change yet, but if we do reject incoming connections that have a cluster ID when we don't, we may want to switch these pkg/server/server.go, line 940 at r2 (raw file): Previously, solongordon wrote…
Ah, I'm sorry! You're right -- gossip may learn of the cluster ID before we call The main thing I'm worried about here is that we're opening up the main listener, allowing incoming connections, when we might not have set the clusterID held by There's also the potential for similar issues with outgoing connections being established before To summarize, I think we need to be more careful here, unless you think I'm missing missing something. Sorry for the wall of text - hopefully I'm overthinking things and we don't have to add much more complexity. Let me know if you'd like to video chat about any of this. pkg/server/server.go, line 1055 at r2 (raw file): Previously, solongordon wrote…
Let's move the conversation above, but the tl;dr is that the race I'm worried about is an incoming connection/heartbeat causing Comments from Reviewable |
Review status: 12 of 13 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 443 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I'm a bit anxious about changing the logic here. In particular, removing the connection if the initial heartbeat fails means we could be opening and closing lots of connections if the remote server is down rather than relying on gRPC to find us a good connection. It is possible my anxiety is misplaced as I haven't looked at this code recently. An alternative to removing the connection here is to mark the connection as unavailable until the first heartbeat succeeds. Comments from Reviewable |
Review status: 12 of 13 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 443 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Do we have an existing mechanism for marking a connection as unavailable? Or would that mean just changing the Comments from Reviewable |
Review status: 12 of 13 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 443 at r2 (raw file): Previously, solongordon wrote…
I was thinking the latter: we'd change Comments from Reviewable |
f9c09c2
to
993ef2b
Compare
Thanks for the race detector details! Thankfully it was a race in the test itself. Should be fixed. Review status: 9 of 13 files reviewed at latest revision, 8 unresolved discussions. pkg/rpc/context.go, line 443 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Got it. I had already made the initial heartbeat synchronous, so I think this was a relatively straightforward change. I just added a bit more bookkeeping to the connection metadata to store whether the heartbeat has ever succeeded. pkg/rpc/heartbeat.go, line 73 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/server/server.go, line 940 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
OK, thanks, this seems more subtle than I realized. I'll spend some more time investigating the different scenarios we need to handle here and follow up with you when I understand it better. Comments from Reviewable |
993ef2b
to
5552231
Compare
Review status: 9 of 13 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 218 at r4 (raw file):
On the other hand, performance isn't a concern here so perhaps this is fine as is. pkg/rpc/context.go, line 445 at r4 (raw file):
This could block for up to Comments from Reviewable |
Review status: 9 of 13 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 218 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
Thanks, switched to pkg/rpc/context.go, line 445 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
OK, I took a stab at this. The blocking heartbeat is gone. Instead, pkg/rpc/heartbeat.go, line 51 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Now that we have pkg/rpc/heartbeat.go, line 70 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yup, this does seem like the right thing to do. Unfortunately it breaks a ton of tests since they don't bother to set cluster ID. If it's ok with you, I'll hold off on this for now to avoid bloating this PR further. Now that I'm doing a better job of setting the cluster ID before opening the main listener, it shouldn't be an issue. pkg/server/server.go, line 940 at r2 (raw file): Previously, solongordon wrote…
This should be in better shape now. I introduced pkg/server/server.go, line 1055 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Makes sense. Should be fixed by Comments from Reviewable |
5552231
to
8b3c6e9
Compare
Good question. I didn't realize Review status: 3 of 46 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 232 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. pkg/rpc/context.go, line 236 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
D'oh, of course. pkg/rpc/context.go, line 242 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
Yup, I hear that. I'll hold off on renaming to see if @tschottdorf or @a-robinson have preferences. Comments from Reviewable |
8b3c6e9
to
c5157db
Compare
Review status: 3 of 46 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 248 at r5 (raw file): Previously, solongordon wrote…
You'll want the caller to pass in a context. Comments from Reviewable |
c5157db
to
8673c06
Compare
Review status: 3 of 46 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 248 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Note this required a bunch of new context plumbing since GRPCDial previously didn't take a context. Comments from Reviewable |
modulo the naming of Review status: 3 of 47 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/rpc/context_test.go, line 100 at r6 (raw file):
I think this should be Comments from Reviewable |
8673c06
to
6f80df8
Compare
Review status: 3 of 47 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed. pkg/rpc/context_test.go, line 100 at r6 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Comments from Reviewable |
Regarding my concern about internal gRPC reconnections, I think we can use a custom dialer:
The idea would be that we have a dialer that only dials an address once and then fails. We'd need to find out where the error is reported and remove the connection from our Review status: 3 of 47 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed. Comments from Reviewable |
Yeah, a follow-up issue/PR seems sufficient to handle the reconnection issue. Reviewed 1 of 9 files at r3, 13 of 42 files at r5, 17 of 29 files at r6, 13 of 13 files at r7. pkg/base/cluster_id.go, line 32 at r7 (raw file):
A pkg/base/cluster_id.go, line 60 at r7 (raw file):
s/%d/%s/ pkg/base/cluster_id.go, line 63 at r7 (raw file):
s/%d/%s/ pkg/rpc/context.go, line 242 at r5 (raw file): Previously, solongordon wrote…
pkg/rpc/context.go, line 291 at r7 (raw file):
It feels off to me that pkg/rpc/context.go, line 480 at r7 (raw file):
Is it safe to ignore pkg/rpc/context.go, line 537 at r7 (raw file):
What does this defer statement do for us? I'm a little concerned that it could mark a connection as validated that has never actually been validated (if pkg/rpc/context_test.go, line 814 at r3 (raw file):
Why did you remove this? pkg/rpc/heartbeat.go, line 70 at r2 (raw file): Previously, solongordon wrote…
Yup, that's ok with me. Mind filing an issue to make sure we don't forget about it? pkg/server/node.go, line 559 at r5 (raw file): Previously, solongordon wrote…
This looks reasonable to me. pkg/server/server.go, line 1055 at r2 (raw file): Previously, solongordon wrote…
This initialization ordering is subtle enough that a comment around the ordering of its initialization seems worthwhile. Can you find a place to add such a comment? Comments from Reviewable |
@petermattis that looks like a pretty good solution. Agreed that this PR should land though; it's gotten pretty big. While adding the dialer option is probably doable, adding the test might be a mouthful (though I think the localcluster acceptance tests could do it nicely). Adding the DialOption would also deal with one oddity of the current code (at least last I checked) which is that heartbeat loops essentially never stop (even if the node goes away). This PR should land sooner rather than later, so take my (too many) comments with a grain of salt. Reviewed 1 of 13 files at r1, 3 of 9 files at r3, 14 of 42 files at r5, 29 of 29 files at r6. pkg/base/cluster_id.go, line 32 at r6 (raw file):
You don't need pkg/base/cluster_id.go, line 32 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Are you sure? I find it hard to argue that one is faster than the other. The critical sections for this mutex are extremely small, so I wonder whether the I guess if we wanted raw performance, we'd go for https://golang.org/pkg/sync/atomic/#CompareAndSwapPointer and https://golang.org/pkg/sync/atomic/#LoadPointer but likely premature. pkg/cli/start.go, line 934 at r6 (raw file):
Just a warning: once you rebase, you'll see that pkg/rpc/context.go, line 242 at r5 (raw file): Previously, solongordon wrote…
I actually like pkg/rpc/context.go, line 222 at r6 (raw file):
pkg/rpc/context.go, line 223 at r6 (raw file):
pkg/rpc/context.go, line 225 at r6 (raw file):
nit: pkg/rpc/context.go, line 254 at r6 (raw file):
I think this is racy. You could have the following sequence of events:
I think you want to fold pkg/rpc/context.go, line 552 at r6 (raw file):
pkg/rpc/context.go, line 573 at r6 (raw file):
everSucceeded = everSucceeded || err == nil
conn.heartbeatErr.Store(errValue{err: err, everSucceeded: everSucceeded}) pkg/rpc/context.go, line 576 at r6 (raw file):
This can go then. pkg/rpc/context.go, line 480 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Makes you wonder why we dial in the first place when we know we're going to go local. pkg/rpc/context.go, line 537 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I think he just wants to unblock clients when the connection goes away. Agree that it's slightly concerning. pkg/rpc/heartbeat.go, line 70 at r2 (raw file):
Wouldn't that happen if you let a new node an existing cluster? Sure, the new node usually reaches out first, but there's nothing that prevents other nodes in the cluster to also connect to the new one. In this code, we'll need to accept any combination of missing UUIDs anyway, since we need mixed-version compat with 1.1 which doesn't ever send any. pkg/rpc/heartbeat.proto, line 52 at r6 (raw file):
I know you're just cargo-culting this, but I'm not sure this should be nullable. You'll have interop with old versions that actually don't send the field, so you can distinguish the two. For example, your code could send a trivial UUID to signal that they understand the UUID check but are not bootstrapped yet. Not suggesting you should do that, but it motivates that pkg/server/node.go, line 559 at r5 (raw file): Previously, solongordon wrote…
I'd add a comment explaining that there is a scenario in which pkg/server/server.go, line 940 at r2 (raw file): Previously, solongordon wrote…
Are there any races that could take place if a node comes up waiting for bootstrap and two different clusters connect to it at about the same time? If not, what will happen? Is our node just going to Fatal (due to setting two different IDs)? Would be good to document the extent of the protection and its shortcomings. I think the ID container is a good place to put it as that's one hop from everything related. pkg/server/server.go, line 557 at r6 (raw file):
Check the error first. If there's an error, nothing to compare. If there is no error, I think you always have a Comments from Reviewable |
Review status: all files reviewed at latest revision, 27 unresolved discussions, some commit checks failed. pkg/rpc/heartbeat.go, line 70 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I was going to say that I didn't think it was even possible for a node with an unset cluster ID to receive a non- pkg/server/server.go, line 1126 at r7 (raw file):
Just a suggestion, but we may want to assert that our Comments from Reviewable |
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. pkg/base/cluster_id.go, line 32 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/base/cluster_id.go, line 60 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/base/cluster_id.go, line 63 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/rpc/context.go, line 222 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/rpc/context.go, line 223 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/rpc/context.go, line 225 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/rpc/context.go, line 254 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I think I'm handling this by only loading pkg/rpc/context.go, line 291 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yeah, I had the same thought. Pragmatically I think it worked out simpler this way for testing purposes, but I'll revisit. pkg/rpc/context.go, line 480 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
It looks like it's being ignored, but it gets checked when pkg/rpc/context.go, line 537 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yes, I wanted to make sure pkg/rpc/context_test.go, line 814 at r3 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Based on Peter's suggestion the connection is no longer removed from the pkg/server/server.go, line 557 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Comments from Reviewable |
335115b
to
a826bfb
Compare
Review status: 44 of 47 files reviewed at latest revision, 26 unresolved discussions, some commit checks failed. pkg/cli/start.go, line 934 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/rpc/context.go, line 242 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I don't have a strong preference and pkg/rpc/context.go, line 254 at r6 (raw file): Previously, solongordon wrote…
OK, I went with the pkg/rpc/context.go, line 552 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/rpc/context.go, line 573 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/rpc/context.go, line 576 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/rpc/heartbeat.proto, line 52 at r6 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Sure, makes sense. It feels a little funny since the rest of this proto has pkg/server/node.go, line 559 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I realized that the cluster ID checks here are actually redundant now since I'm doing them in pkg/server/server.go, line 1126 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Good idea. I added this check right after Comments from Reviewable |
Review status: 19 of 47 files reviewed at latest revision, 24 unresolved discussions, some commit checks failed. pkg/server/server.go, line 940 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
To be honest the gossip flow is confusing enough that it's hard for me to tell. I tried to force this race in a couple ways, like specifying two different clusters in Comments from Reviewable |
Reviewed 18 of 28 files at r8. pkg/rpc/context.go, line 242 at r5 (raw file): Previously, solongordon wrote…
Leaving it as pkg/rpc/context.go, line 537 at r7 (raw file): Previously, solongordon wrote…
Couldn't Comments from Reviewable |
a826bfb
to
4e4d224
Compare
Review status: 37 of 47 files reviewed at latest revision, 24 unresolved discussions, some commit checks failed. pkg/rpc/context.go, line 537 at r7 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/server/server.go, line 1055 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I added a comment to Comments from Reviewable |
In the heartbeat, nodes now share their cluster IDs and check that they match. We allow for missing cluster IDs, since new nodes do not have a cluster ID until they obtain one via gossip, but conflicting IDs will result in a heartbeat error. In addition, connections are now not added to the connection pool until the heartbeat succeeds. This allows us to fail fast when a node attempts to join the wrong cluster. Refers cockroachdb#18058. Release note: None
4e4d224
to
78d1ae6
Compare
Congrats 🛩 |
In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.
In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.
Refers #18058.