Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: Check for cluster ID conflicts on handshake #20163

Merged

Conversation

solongordon
Copy link
Contributor

@solongordon solongordon commented Nov 20, 2017

In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Refers #18058.

@solongordon solongordon requested a review from tbg November 20, 2017 19:30
@solongordon solongordon requested a review from a team as a code owner November 20, 2017 19:30
@solongordon solongordon requested a review from a team November 20, 2017 19:30
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@solongordon
Copy link
Contributor Author

Review status: 0 of 13 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/rpc/context.go, line 245 at r1 (raw file):

	stats StatsHandler

	ClusterID uuid.UUID

@tschottdorf I didn't end up guarding this with a mutex since there didn't seem to be any risk of contention. (It's only modified in server.Start().) Let me know if that seems reasonable to you.


Comments from Reviewable

@a-robinson
Copy link
Contributor

Nice work, @solongordon!


Reviewed 11 of 13 files at r1, 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/cli/cli_test.go, line 343 at r2 (raw file):

		// Error returned directly from GRPC.
		{`quit`, styled(
			`initial heartbeat failed: rpc error: code = Unavailable desc = grpc: the connection is unavailable`),

This is a much less clear error message compared to what we had before. Can we find a way to make it more clear to users who don't know what the "initial heartbeat" is?


pkg/cli/start.go, line 938 at r2 (raw file):

	conn, err := rpcContext.GRPCDial(addr)
	if err != nil {
		stopper.Stop(stopperContext(stopper))

Nice catch! Isn't this also needed before the previous early return?


pkg/rpc/context.go, line 440 at r2 (raw file):

			// Run an initial heartbeat to ensure the connection is healthy.
			ctx.runHeartbeat(meta, target)
			err = value.(*connMeta).heartbeatErr.Load().(errValue).error

Isn't value.(*connMeta) just meta?


pkg/rpc/context.go, line 442 at r2 (raw file):

			err = value.(*connMeta).heartbeatErr.Load().(errValue).error
			if err != nil {
				meta.dialErr = errors.Wrap(err, "initial heartbeat failed")

Nit, but I think "initial connection heartbeat failed" may be a little clearer to folks less familiar with how this stuff is implemented.


pkg/rpc/context.go, line 443 at r2 (raw file):

			if err != nil {
				meta.dialErr = errors.Wrap(err, "initial heartbeat failed")
				ctx.removeConn(target, meta)

Doesn't this need to be go ctx.removeConn(target, meta) for the same reason as in the error case below?


pkg/rpc/context_test.go, line 801 at r2 (raw file):

	clientCtx.ClusterID = uuid.MakeV4()

	_, err = clientCtx.GRPCDial(remoteAddr)

It looks like it'll work fine right now due to the semantics of sync.Once.Do, but in case the implementation ever changes it'd be nice to try creating race conditions by calling GRPCDial in multiple goroutines and verifying they all fail with the expected error.


pkg/rpc/context_test.go, line 802 at r2 (raw file):

	_, err = clientCtx.GRPCDial(remoteAddr)
	expected := "initial heartbeat failed"

Can we be more specific and expect an error message indicating that the cluster IDs don't match?


pkg/rpc/heartbeat.go, line 51 at r2 (raw file):

// remote clocks sent to it by storing them in the remoteClockMonitor.
type HeartbeatService struct {
	rpcContext *Context

Why this change? It's typically more maintainable and testable to ask for just the few things that you need than to embed a larger type with tons of fields that you don't need.


pkg/rpc/heartbeat.go, line 70 at r2 (raw file):

			"does not match that of node %s (%s)", mo, args.Addr, amo))
	}
	if args.ClusterID != uuid.Nil && hs.rpcContext.ClusterID != uuid.Nil &&

I don't believe there's ever a valid scenario where a client with a cluster ID would connect to a node without a cluster ID. We may want to be more strict about that to avoid issues when initializing a new node. I'd say we may even just want to disallow incoming connections to a node that doesn't have a cluster ID, but that would break the init command.

cc @tschottdorf to double check that first statement.


pkg/rpc/heartbeat.go, line 71 at r2 (raw file):

	}
	if args.ClusterID != uuid.Nil && hs.rpcContext.ClusterID != uuid.Nil &&
		args.ClusterID != hs.rpcContext.ClusterID {

Huh, TIL that you can directly compare arrays in go.


pkg/rpc/heartbeat.go, line 73 at r2 (raw file):

		args.ClusterID != hs.rpcContext.ClusterID {

		return nil, errors.Errorf("unexpected cluster ID: %s", args.ClusterID)

I expect it'd be helpful to include both IDs in this case, e.g. "client cluster ID %s doesn't match server cluster ID %s"


pkg/rpc/heartbeat.proto, line 51 at r2 (raw file):

  // The configured maximum clock offset (in nanoseconds) on the server.
  optional int64 max_offset_nanos = 4 [(gogoproto.nullable) = false];
  // Cluster ID to prevent illegal connections.

nit, but s/illegal connections/connections between nodes in different clusters/


pkg/server/server.go, line 571 at r2 (raw file):

			} else if storeIdent.ClusterID != clusterID {
				return nil, nil, uuid.Nil, cluster.ClusterVersion{},
					errors.Errorf("conflicting engine cluster IDs: %s, %s", storeIdent.ClusterID, clusterID)

s/engine/store/ to match the more widely-known terminology


pkg/server/server.go, line 940 at r2 (raw file):

		atomic.StoreInt32(&initLActive, 0)
	}

At this point we should definitely have a known cluster ID, either via a bootstrapped store or via gossip -- shouldn't we update the rpcContext's ClusterID before opening up the main listener?


pkg/server/server.go, line 1055 at r2 (raw file):

	log.Event(ctx, "started node")

	// ClusterID may have been determined via gossip. Update RPC context.

Doing this after starting so much up seems wrong. Am I just mistaken or should this be done earlier?

Also, how is this safe from races with incoming/outgoing connection heartbeats? If it isn't, then are we missing a test that would have caught it when run under the go race detector?


Comments from Reviewable

@a-robinson a-robinson requested review from a-robinson and removed request for tbg November 21, 2017 21:48
@solongordon
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/rpc/heartbeat.go, line 51 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why this change? It's typically more maintainable and testable to ask for just the few things that you need than to embed a larger type with tons of fields that you don't need.

Mostly because I needed the ClusterID value to be updated if it changes in rpcContext. Is there an idiomatic way to do this with a single field? atomic.Value?


Comments from Reviewable

@solongordon
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful.


pkg/cli/start.go, line 938 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Nice catch! Isn't this also needed before the previous early return?

Done.


pkg/rpc/context.go, line 440 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Isn't value.(*connMeta) just meta?

Done.


pkg/rpc/context.go, line 442 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Nit, but I think "initial connection heartbeat failed" may be a little clearer to folks less familiar with how this stuff is implemented.

Done.


pkg/rpc/context.go, line 443 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Doesn't this need to be go ctx.removeConn(target, meta) for the same reason as in the error case below?

Ah, maybe. I didn't run across the deadlock in testing but it seems possible.

I'm a bit worried about removing the connection asynchronously since the whole point of this is to ensure that a bad connection doesn't get handed out. Does that seem like a valid concern?


pkg/rpc/context_test.go, line 802 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Can we be more specific and expect an error message indicating that the cluster IDs don't match?

Done.


pkg/rpc/heartbeat.go, line 70 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I don't believe there's ever a valid scenario where a client with a cluster ID would connect to a node without a cluster ID. We may want to be more strict about that to avoid issues when initializing a new node. I'd say we may even just want to disallow incoming connections to a node that doesn't have a cluster ID, but that would break the init command.

cc @tschottdorf to double check that first statement.

Interesting, I'll experiment with disallowing missing cluster IDs on the ping handler side and see what happens.


pkg/rpc/heartbeat.go, line 73 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I expect it'd be helpful to include both IDs in this case, e.g. "client cluster ID %s doesn't match server cluster ID %s"

Done.


pkg/rpc/heartbeat.proto, line 51 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

nit, but s/illegal connections/connections between nodes in different clusters/

Done.


pkg/server/server.go, line 571 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/engine/store/ to match the more widely-known terminology

Done.


pkg/server/server.go, line 940 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

At this point we should definitely have a known cluster ID, either via a bootstrapped store or via gossip -- shouldn't we update the rpcContext's ClusterID before opening up the main listener?

From reading the code it seemed like obtaining a cluster ID via gossip happens in node.start(). That's why I update the rpcContext way down on line 1056. Does that seem right to you or did I miss something?


pkg/server/server.go, line 1055 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Doing this after starting so much up seems wrong. Am I just mistaken or should this be done earlier?

Also, how is this safe from races with incoming/outgoing connection heartbeats? If it isn't, then are we missing a test that would have caught it when run under the go race detector?

See above comment about the placement. This seemed like the earliest I could do it but I could easily be wrong.

I didn't worry about race conditions since there are no concurrent writes going on, and the heartbeat succeeds whether this value is empty or populated. But maybe I'm missing something subtler. What race did you have in mind? I'll read up on the race detector.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from 1c26d86 to 9f932d1 Compare November 22, 2017 16:31
@solongordon
Copy link
Contributor Author

Review status: 5 of 13 files reviewed at latest revision, 16 unresolved discussions, some commit checks pending.


pkg/cli/cli_test.go, line 343 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

This is a much less clear error message compared to what we had before. Can we find a way to make it more clear to users who don't know what the "initial heartbeat" is?

Yeah, the error message does have a preamble that's less cryptic (see ~10 lines up) but I'll see if I can clean this up.


Comments from Reviewable

@solongordon
Copy link
Contributor Author

Review status: 5 of 13 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed.


pkg/cli/cli_test.go, line 343 at r2 (raw file):

Previously, solongordon wrote…

Yeah, the error message does have a preamble that's less cryptic (see ~10 lines up) but I'll see if I can clean this up.

Added back in the Failed to connect to the node: prefix.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch 2 times, most recently from e648cb5 to f9c09c2 Compare November 22, 2017 18:09
@solongordon
Copy link
Contributor Author

Review status: 4 of 13 files reviewed at latest revision, 16 unresolved discussions.


pkg/rpc/context_test.go, line 801 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It looks like it'll work fine right now due to the semantics of sync.Once.Do, but in case the implementation ever changes it'd be nice to try creating race conditions by calling GRPCDial in multiple goroutines and verifying they all fail with the expected error.

Good idea, done!


Comments from Reviewable

@a-robinson
Copy link
Contributor

Speaking of races, looks like the race detector found something it didn't like: https://teamcity.cockroachdb.com/viewLog.html?buildId=417603&buildTypeId=Cockroach_UnitTests

You can run the race detector on our tests locally using the make testrace or make stressrace commands from our Makefile. As an example, If I wanted to try to reproduce a flake in TestClusterIDMismatch under the race detector running 32 instances of the test in parallel until one fails, I'd run:

make stressrace PKG=./rpc TESTS= TestClusterIDMismatch STRESSFLAGS='-stderr -maxfails 1 -p 32'

Reviewed 8 of 9 files at r3.
Review status: 12 of 13 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 443 at r2 (raw file):

Previously, solongordon wrote…

Ah, maybe. I didn't run across the deadlock in testing but it seems possible.

I'm a bit worried about removing the connection asynchronously since the whole point of this is to ensure that a bad connection doesn't get handed out. Does that seem like a valid concern?

Removing it asynchronously is fine because we'll already have set its dialErr to be non-nil, meaning it won't be used by anyone. However, I did a little digging and it looks like the comment below is outdated and now unnecessary. You can leave this as is, and I'll send a separate PR to update the existing one.


pkg/rpc/heartbeat.go, line 51 at r2 (raw file):

Previously, solongordon wrote…

Mostly because I needed the ClusterID value to be updated if it changes in rpcContext. Is there an idiomatic way to do this with a single field? atomic.Value?

Ah, I see. atomic.Value would be an idiomatic way to handle it, but this is probably fine, then.


pkg/rpc/heartbeat.go, line 70 at r2 (raw file):

Previously, solongordon wrote…

Interesting, I'll experiment with disallowing missing cluster IDs on the ping handler side and see what happens.

Just to follow up on this with a little evidence, our existing gossip code's ClusterID logic rejects incoming connections that have a cluster ID if the server receiving the connection doesn't have one yet: https://github.com/cockroachdb/cockroach/blob/master/pkg/gossip/server.go#L129


pkg/rpc/heartbeat.go, line 73 at r2 (raw file):

Previously, solongordon wrote…

Done.

I realize you haven't made the change yet, but if we do reject incoming connections that have a cluster ID when we don't, we may want to switch these %s formatters to %q.


pkg/server/server.go, line 940 at r2 (raw file):

Previously, solongordon wrote…

From reading the code it seemed like obtaining a cluster ID via gossip happens in node.start(). That's why I update the rpcContext way down on line 1056. Does that seem right to you or did I miss something?

Ah, I'm sorry! You're right -- gossip may learn of the cluster ID before we call node.start, but it doesn't get passed up the stack to here until node.start is called, since gossip stores it separately.

The main thing I'm worried about here is that we're opening up the main listener, allowing incoming connections, when we might not have set the clusterID held by rpc.Context yet. That seems like it means we could allow incoming connections that shouldn't actually be allowed. It's also why I'm worried about data races on the ClusterID object -- if incoming connections are allowed starting right here, and incoming connections cause rpc.Context to read from the ClusterID object, then data races on ClusterID seem possible.

There's also the potential for similar issues with outgoing connections being established before ClusterID is set properly. It's ok for gossip or admin RPCs to go out without a ClusterID, but we don't want to be able to send kv/raft/distsql RPCs without a connection whose ID has been properly vetted.

To summarize, I think we need to be more careful here, unless you think I'm missing missing something. node.start is hairy enough that I'm not sure we can actually be perfect without more significant work, such as doing things a bit differently depending on whether we already have our cluster ID set. It looks we may want to move some of this into node.start itself -- in particular I'm pretty sure it isn't safe that we call node.initNodeID in node.start before setting the ClusterID on rpc.Context - in certain pathological cases we could hit #15898. Along those lines, we may also need to verify and/or purge existing connections when setting the ClusterID on rpc.Context.

Sorry for the wall of text - hopefully I'm overthinking things and we don't have to add much more complexity. Let me know if you'd like to video chat about any of this.


pkg/server/server.go, line 1055 at r2 (raw file):

Previously, solongordon wrote…

See above comment about the placement. This seemed like the earliest I could do it but I could easily be wrong.

I didn't worry about race conditions since there are no concurrent writes going on, and the heartbeat succeeds whether this value is empty or populated. But maybe I'm missing something subtler. What race did you have in mind? I'll read up on the race detector.

Let's move the conversation above, but the tl;dr is that the race I'm worried about is an incoming connection/heartbeat causing rpc.Context to read ClusterID before/while it's being updated here.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 12 of 13 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 443 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Removing it asynchronously is fine because we'll already have set its dialErr to be non-nil, meaning it won't be used by anyone. However, I did a little digging and it looks like the comment below is outdated and now unnecessary. You can leave this as is, and I'll send a separate PR to update the existing one.

I'm a bit anxious about changing the logic here. In particular, removing the connection if the initial heartbeat fails means we could be opening and closing lots of connections if the remote server is down rather than relying on gRPC to find us a good connection.

It is possible my anxiety is misplaced as I haven't looked at this code recently. An alternative to removing the connection here is to mark the connection as unavailable until the first heartbeat succeeds.


Comments from Reviewable

@solongordon
Copy link
Contributor Author

Review status: 12 of 13 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 443 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm a bit anxious about changing the logic here. In particular, removing the connection if the initial heartbeat fails means we could be opening and closing lots of connections if the remote server is down rather than relying on gRPC to find us a good connection.

It is possible my anxiety is misplaced as I haven't looked at this code recently. An alternative to removing the connection here is to mark the connection as unavailable until the first heartbeat succeeds.

Do we have an existing mechanism for marking a connection as unavailable? Or would that mean just changing the GRPCDial logic to return an error unless there has been at least one successful heartbeat?


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 12 of 13 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 443 at r2 (raw file):

Previously, solongordon wrote…

Do we have an existing mechanism for marking a connection as unavailable? Or would that mean just changing the GRPCDial logic to return an error unless there has been at least one successful heartbeat?

I was thinking the latter: we'd change GRPCDial to return an error until there has been one successful heartbeat. There is some complexity in doing this as we don't want the initial RPCs on a connection to fail so there needs to be some facility for waiting for that initial heartbeat to occur. That facility would need to accept a context so that we can wait with a timeout and also cancel waiting if the context gets cancelled. I'm not quite sure what this would look like.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from f9c09c2 to 993ef2b Compare November 27, 2017 18:09
@solongordon
Copy link
Contributor Author

Thanks for the race detector details! Thankfully it was a race in the test itself. Should be fixed.


Review status: 9 of 13 files reviewed at latest revision, 8 unresolved discussions.


pkg/rpc/context.go, line 443 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I was thinking the latter: we'd change GRPCDial to return an error until there has been one successful heartbeat. There is some complexity in doing this as we don't want the initial RPCs on a connection to fail so there needs to be some facility for waiting for that initial heartbeat to occur. That facility would need to accept a context so that we can wait with a timeout and also cancel waiting if the context gets cancelled. I'm not quite sure what this would look like.

Got it. I had already made the initial heartbeat synchronous, so I think this was a relatively straightforward change. I just added a bit more bookkeeping to the connection metadata to store whether the heartbeat has ever succeeded. GRPCDial will not hand out a connection if this is false. Let me know if that seems reasonable.


pkg/rpc/heartbeat.go, line 73 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I realize you haven't made the change yet, but if we do reject incoming connections that have a cluster ID when we don't, we may want to switch these %s formatters to %q.

Done.


pkg/server/server.go, line 940 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Ah, I'm sorry! You're right -- gossip may learn of the cluster ID before we call node.start, but it doesn't get passed up the stack to here until node.start is called, since gossip stores it separately.

The main thing I'm worried about here is that we're opening up the main listener, allowing incoming connections, when we might not have set the clusterID held by rpc.Context yet. That seems like it means we could allow incoming connections that shouldn't actually be allowed. It's also why I'm worried about data races on the ClusterID object -- if incoming connections are allowed starting right here, and incoming connections cause rpc.Context to read from the ClusterID object, then data races on ClusterID seem possible.

There's also the potential for similar issues with outgoing connections being established before ClusterID is set properly. It's ok for gossip or admin RPCs to go out without a ClusterID, but we don't want to be able to send kv/raft/distsql RPCs without a connection whose ID has been properly vetted.

To summarize, I think we need to be more careful here, unless you think I'm missing missing something. node.start is hairy enough that I'm not sure we can actually be perfect without more significant work, such as doing things a bit differently depending on whether we already have our cluster ID set. It looks we may want to move some of this into node.start itself -- in particular I'm pretty sure it isn't safe that we call node.initNodeID in node.start before setting the ClusterID on rpc.Context - in certain pathological cases we could hit #15898. Along those lines, we may also need to verify and/or purge existing connections when setting the ClusterID on rpc.Context.

Sorry for the wall of text - hopefully I'm overthinking things and we don't have to add much more complexity. Let me know if you'd like to video chat about any of this.

OK, thanks, this seems more subtle than I realized. I'll spend some more time investigating the different scenarios we need to handle here and follow up with you when I understand it better.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from 993ef2b to 5552231 Compare November 27, 2017 21:41
@petermattis
Copy link
Collaborator

Review status: 9 of 13 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 218 at r4 (raw file):

	sync.Once
	conn         *grpc.ClientConn
	validated    atomic.Value // bool. true if the heartbeat has ever succeeded

atomic.Value is general purpose. Since you're only ever storing a boolean here, it would be better/faster for this to be of type int32 and to use atomic.StoreInt32(&validated, 1) to set it and atomic.LoadInt32(&validated) == 1 to check it.

On the other hand, performance isn't a concern here so perhaps this is fine as is.


pkg/rpc/context.go, line 445 at r4 (raw file):

		if meta.dialErr == nil {
			// Run an initial heartbeat to validate the connection.
			ctx.runHeartbeat(meta, target)

This could block for up to defaultHeartbeatInterval (3s) which would also block the caller. Is that problematic for the various callers? I'm thinking of kv/transport.go:grpcTransportFactoryImpl() and the calls in storage/raft_transport.go. Perhaps a GRPCDialContext method should be added and GRPCDial should call GRPCDialContext with context.Background(). We'd need to plumb the context down into runHeartbeat(). Ugh, that doesn't look too pleasant.


Comments from Reviewable

@solongordon
Copy link
Contributor Author

Review status: 9 of 13 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 218 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

atomic.Value is general purpose. Since you're only ever storing a boolean here, it would be better/faster for this to be of type int32 and to use atomic.StoreInt32(&validated, 1) to set it and atomic.LoadInt32(&validated) == 1 to check it.

On the other hand, performance isn't a concern here so perhaps this is fine as is.

Thanks, switched to int32.


pkg/rpc/context.go, line 445 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This could block for up to defaultHeartbeatInterval (3s) which would also block the caller. Is that problematic for the various callers? I'm thinking of kv/transport.go:grpcTransportFactoryImpl() and the calls in storage/raft_transport.go. Perhaps a GRPCDialContext method should be added and GRPCDial should call GRPCDialContext with context.Background(). We'd need to plumb the context down into runHeartbeat(). Ugh, that doesn't look too pleasant.

OK, I took a stab at this. The blocking heartbeat is gone. Instead, GRPCDial now returns a wrapper around the connection which acts as a promise. Calling .Connect() on the wrapper is a blocking call which doesn't return until the connection has been heartbeated at least once. This should keep the current GRPCDial contract intact and support the grpcTransportFactoryImpl use case, since it can wait to call Connect() until after it has sorted the connections by health. All the other existing callers just call Connect() immediately.


pkg/rpc/heartbeat.go, line 51 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Ah, I see. atomic.Value would be an idiomatic way to handle it, but this is probably fine, then.

Now that we have ClusterIDContainer I was able to bring back the more specific fields.


pkg/rpc/heartbeat.go, line 70 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Just to follow up on this with a little evidence, our existing gossip code's ClusterID logic rejects incoming connections that have a cluster ID if the server receiving the connection doesn't have one yet: https://github.com/cockroachdb/cockroach/blob/master/pkg/gossip/server.go#L129

Yup, this does seem like the right thing to do. Unfortunately it breaks a ton of tests since they don't bother to set cluster ID. If it's ok with you, I'll hold off on this for now to avoid bloating this PR further. Now that I'm doing a better job of setting the cluster ID before opening the main listener, it shouldn't be an issue.


pkg/server/server.go, line 940 at r2 (raw file):

Previously, solongordon wrote…

OK, thanks, this seems more subtle than I realized. I'll spend some more time investigating the different scenarios we need to handle here and follow up with you when I understand it better.

This should be in better shape now. I introduced ClusterIDContainer to avoid races, and also so that the cluster ID is available everywhere as soon as possible after it is determined, e.g. by bootstrapping or gossip. I think this significantly reduces the potential for weird races, but please let me know if you still see risks.


pkg/server/server.go, line 1055 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Let's move the conversation above, but the tl;dr is that the race I'm worried about is an incoming connection/heartbeat causing rpc.Context to read ClusterID before/while it's being updated here.

Makes sense. Should be fixed by ClusterIDContainer.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from 5552231 to 8b3c6e9 Compare November 30, 2017 21:14
@solongordon solongordon requested a review from a team November 30, 2017 21:14
@solongordon
Copy link
Contributor Author

Good question. I didn't realize grpc.ClientConns could survive node restarts. I'll do some testing.


Review status: 3 of 46 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 232 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Style nit: if you know you're creating a pointer, explicitly create a pointer. s/Connection/&Connection/g.

Done.


pkg/rpc/context.go, line 236 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This isn't necessary. Go guarantees zero initialization of all fields.

D'oh, of course.


pkg/rpc/context.go, line 242 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The name is a bit strange to me. You can call Connect on a Connection multiple times. Perhaps WaitForInit?. Perhaps GRPCConn().

Yup, I hear that. I'll hold off on renaming to see if @tschottdorf or @a-robinson have preferences.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from 8b3c6e9 to c5157db Compare November 30, 2017 22:19
@petermattis
Copy link
Collaborator

Review status: 3 of 46 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 248 at r5 (raw file):

Previously, solongordon wrote…

Should the context be passed in by the caller, or is it better to use Context.masterCtx like GRPCDial does? Not clear to me what the difference would be in practice.

You'll want the caller to pass in a context. Context.masterCtx is tied to the global stopper and won't be canceled unless the node is shut down.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from c5157db to 8673c06 Compare December 4, 2017 16:38
@solongordon
Copy link
Contributor Author

Review status: 3 of 46 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 248 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You'll want the caller to pass in a context. Context.masterCtx is tied to the global stopper and won't be canceled unless the node is shut down.

Done. Note this required a bunch of new context plumbing since GRPCDial previously didn't take a context.


Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm: modulo the naming of rpc.Connection.Connect.


Review status: 3 of 47 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/rpc/context_test.go, line 100 at r6 (raw file):

		}

		if _, err := clientCtx.GRPCDial(remoteAddr).Connect(context.TODO()); err != nil {

I think this should be context.Background(). Ditto for any new use of context.TODO() in tests.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from 8673c06 to 6f80df8 Compare December 4, 2017 19:12
@solongordon
Copy link
Contributor Author

Review status: 3 of 47 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/rpc/context_test.go, line 100 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this should be context.Background(). Ditto for any new use of context.TODO() in tests.

Done.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Regarding my concern about internal gRPC reconnections, I think we can use a custom dialer:

// WithDialer returns a DialOption that specifies a function to use for dialing network addresses.
// If FailOnNonTempDialError() is set to true, and an error is returned by f, gRPC checks the error's
// Temporary() method to decide if it should try to reconnect to the network address.
func WithDialer(f func(string, time.Duration) (net.Conn, error)) DialOption {
	return func(o *dialOptions) {
		o.copts.Dialer = func(ctx context.Context, addr string) (net.Conn, error) {
			if deadline, ok := ctx.Deadline(); ok {
				return f(addr, deadline.Sub(time.Now()))
			}
			return f(addr, 0)
		}
	}
}

The idea would be that we have a dialer that only dials an address once and then fails. We'd need to find out where the error is reported and remove the connection from our rpc.Context or mark the connection as unhealthy so that we don't send RPCs on it. Seems doable, but out of scope for this PR.


Review status: 3 of 47 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


Comments from Reviewable

@a-robinson
Copy link
Contributor

a-robinson commented Dec 4, 2017

Yeah, a follow-up issue/PR seems sufficient to handle the reconnection issue.


Reviewed 1 of 9 files at r3, 13 of 42 files at r5, 17 of 29 files at r6, 13 of 13 files at r7.
Review status: all files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.


pkg/base/cluster_id.go, line 32 at r7 (raw file):

	//lint:ignore U1000 this marker prevents by-value copies.
	noCopy util.NoCopy
	syncutil.Mutex

A RWMutex would be beneficial here since this is read often but only written once.


pkg/base/cluster_id.go, line 60 at r7 (raw file):

		c.clusterID = val
		if log.V(2) {
			log.Infof(ctx, "ClusterID set to %d", val)

s/%d/%s/


pkg/base/cluster_id.go, line 63 at r7 (raw file):

		}
	} else if c.clusterID != val {
		log.Fatalf(ctx, "different ClusterIDs set: %d, then %d", c.clusterID, val)

s/%d/%s/


pkg/rpc/context.go, line 242 at r5 (raw file):

Previously, solongordon wrote…

Yup, I hear that. I'll hold off on renaming to see if @tschottdorf or @a-robinson have preferences.

GRPCConn sounds good to me.


pkg/rpc/context.go, line 291 at r7 (raw file):

	stats StatsHandler

	ClusterID base.ClusterIDContainer

It feels off to me that rpc.Context owns this rather than server.Server or server.Node owning it. Feel free to ignore, though.


pkg/rpc/context.go, line 480 at r7 (raw file):

		}
		conn.grpcConn, conn.dialErr = grpc.DialContext(ctx.masterCtx, target, dialOpts...)
		if ctx.GetLocalInternalServerForAddr(target) != nil {

Is it safe to ignore dialErr here? It's a dangerous practice even if it is safe in this particular instance...


pkg/rpc/context.go, line 537 at r7 (raw file):

func (ctx *Context) runHeartbeat(conn *Connection, target string) error {
	defer conn.setValidationDone()

What does this defer statement do for us? I'm a little concerned that it could mark a connection as validated that has never actually been validated (if stopper.Stop() gets called before we run any heartbeats)


pkg/rpc/context_test.go, line 814 at r3 (raw file):

	}
	wg.Wait()
	if _, ok := clientCtx.conns.Load(remoteAddr); ok {

Why did you remove this?


pkg/rpc/heartbeat.go, line 70 at r2 (raw file):

Previously, solongordon wrote…

Yup, this does seem like the right thing to do. Unfortunately it breaks a ton of tests since they don't bother to set cluster ID. If it's ok with you, I'll hold off on this for now to avoid bloating this PR further. Now that I'm doing a better job of setting the cluster ID before opening the main listener, it shouldn't be an issue.

Yup, that's ok with me. Mind filing an issue to make sure we don't forget about it?


pkg/server/node.go, line 559 at r5 (raw file):

Previously, solongordon wrote…

This is a little funky. When I started setting the cluster ID immediately after bootstrap, it broke this check which expected it to still be empty in that case. Checking for NodeID == 0 seems to be equivalent but it's still a bit mysterious. I'm open to other suggestions here.

This looks reasonable to me.


pkg/server/server.go, line 1055 at r2 (raw file):

Previously, solongordon wrote…

Makes sense. Should be fixed by ClusterIDContainer.

This initialization ordering is subtle enough that a comment around the ordering of its initialization seems worthwhile. Can you find a place to add such a comment?


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 4, 2017

@petermattis that looks like a pretty good solution. Agreed that this PR should land though; it's gotten pretty big. While adding the dialer option is probably doable, adding the test might be a mouthful (though I think the localcluster acceptance tests could do it nicely).

Adding the DialOption would also deal with one oddity of the current code (at least last I checked) which is that heartbeat loops essentially never stop (even if the node goes away).

This PR should land sooner rather than later, so take my (too many) comments with a grain of salt.


Reviewed 1 of 13 files at r1, 3 of 9 files at r3, 14 of 42 files at r5, 29 of 29 files at r6.
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


pkg/base/cluster_id.go, line 32 at r6 (raw file):

	//lint:ignore U1000 this marker prevents by-value copies.
	noCopy util.NoCopy
	syncutil.Mutex

You don't need noCopy here because you have a Mutex.


pkg/base/cluster_id.go, line 32 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

A RWMutex would be beneficial here since this is read often but only written once.

Are you sure? I find it hard to argue that one is faster than the other. The critical sections for this mutex are extremely small, so I wonder whether the RW overhead will just make it a pessimization.

I guess if we wanted raw performance, we'd go for https://golang.org/pkg/sync/atomic/#CompareAndSwapPointer and https://golang.org/pkg/sync/atomic/#LoadPointer but likely premature.


pkg/cli/start.go, line 934 at r6 (raw file):

	addr, err := addrWithDefaultHost(serverCfg.AdvertiseAddr)
	if err != nil {
		stopper.Stop(stopperContext(stopper))

Just a warning: once you rebase, you'll see that stopperContext is no more. But I think you'll get an incoming ctx, which is really better suited to your needs anyway.


pkg/rpc/context.go, line 242 at r5 (raw file):

Previously, solongordon wrote…

Yup, I hear that. I'll hold off on renaming to see if @tschottdorf or @a-robinson have preferences.

I actually like Connect because it makes it very explicit that it can block, and semantically we are connecting (it is just that there's an underlying connection cache that makes sure we usually aren't). This is similar to how GRPCDial doesn't actually dial.


pkg/rpc/context.go, line 222 at r6 (raw file):

// error while dialing; if set, connection is unusable


pkg/rpc/context.go, line 223 at r6 (raw file):

// outcome of last heartbeat


pkg/rpc/context.go, line 225 at r6 (raw file):

	heartbeatErr   atomic.Value
	valid          int32         // 1 if heartbeat has ever succeeded, 0 otherwise
	validationDone chan struct{} // Closed after first heartbeat

nit: closed after first heartbeat (for inline comments).


pkg/rpc/context.go, line 254 at r6 (raw file):

	// If connection is invalid, return latest heartbeat error.
	if atomic.LoadInt32(&c.valid) == 0 {

I think this is racy. You could have the following sequence of events:

  • LoadInt32 returns 0
  • heartbeat succeeds; heartbeatErr gets set to nil
  • errors.Wrap(nil, anything) == nil
  • return nil, nil
  • probably an NPE (null pointer exception) trying to use the nil at the caller.

I think you want to fold c.valid into errValue. I've sprinkled comments about that below that might be helpful.
If they work, add a comment to the heartbeatErr field cautioning future coders to not Store to this on more than one goroutine to avoid clobbering (an unlikely problem, but better to warn about it).


pkg/rpc/context.go, line 552 at r6 (raw file):

	// Give the first iteration a wait-free heartbeat attempt.
	heartbeatTimer.Reset(0)

var everSucceeded bool (re: raciness)


pkg/rpc/context.go, line 573 at r6 (raw file):

			cancel()
		}
		conn.heartbeatErr.Store(errValue{err})
everSucceeded = everSucceeded || err == nil
conn.heartbeatErr.Store(errValue{err: err, everSucceeded: everSucceeded})

pkg/rpc/context.go, line 576 at r6 (raw file):

		if err == nil {
			atomic.StoreInt32(&conn.valid, 1)

This can go then.


pkg/rpc/context.go, line 480 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Is it safe to ignore dialErr here? It's a dangerous practice even if it is safe in this particular instance...

Makes you wonder why we dial in the first place when we know we're going to go local.


pkg/rpc/context.go, line 537 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

What does this defer statement do for us? I'm a little concerned that it could mark a connection as validated that has never actually been validated (if stopper.Stop() gets called before we run any heartbeats)

I think he just wants to unblock clients when the connection goes away. Agree that it's slightly concerning.


pkg/rpc/heartbeat.go, line 70 at r2 (raw file):

I don't believe there's ever a valid scenario where a client with a cluster ID would connect to a node without a cluster ID

Wouldn't that happen if you let a new node an existing cluster? Sure, the new node usually reaches out first, but there's nothing that prevents other nodes in the cluster to also connect to the new one.
Hmm, just saw your other comment about the check in the gossip server. Honestly, I'm not sure this check isn't just unintentional for empty IDs. I for one would rather restrict that check to the case in which both UUIDs are nonempty. There is no reason you shouldn't be able to have a node join a cluster by having it be pointed at by existing nodes and not giving it any outgoing ones by itself. Certainly nonstandard, but it may be desired for example if replacing the first node in a cluster (which is usually pointed at by everyone and points at no one, at least after a naive setup). Feels like the Gossip check could just be removed entirely since there's no point in keeping it if it's identical to the one here.

In this code, we'll need to accept any combination of missing UUIDs anyway, since we need mixed-version compat with 1.1 which doesn't ever send any.


pkg/rpc/heartbeat.proto, line 52 at r6 (raw file):

  optional int64 max_offset_nanos = 4 [(gogoproto.nullable) = false];
  // Cluster ID to prevent connections between nodes in different clusters.
  optional bytes cluster_id = 5 [(gogoproto.nullable) = false,

I know you're just cargo-culting this, but I'm not sure this should be nullable. You'll have interop with old versions that actually don't send the field, so you can distinguish the two. For example, your code could send a trivial UUID to signal that they understand the UUID check but are not bootstrapped yet. Not suggesting you should do that, but it motivates that nullable = true (the default) seems more appropriate.


pkg/server/node.go, line 559 at r5 (raw file):

Previously, solongordon wrote…

This is a little funky. When I started setting the cluster ID immediately after bootstrap, it broke this check which expected it to still be empty in that case. Checking for NodeID == 0 seems to be equivalent but it's still a bit mysterious. I'm open to other suggestions here.

I'd add a comment explaining that there is a scenario in which NodeID == 0 but the cluster ID is already set (and why you set it that early).


pkg/server/server.go, line 940 at r2 (raw file):

Previously, solongordon wrote…

This should be in better shape now. I introduced ClusterIDContainer to avoid races, and also so that the cluster ID is available everywhere as soon as possible after it is determined, e.g. by bootstrapping or gossip. I think this significantly reduces the potential for weird races, but please let me know if you still see risks.

Are there any races that could take place if a node comes up waiting for bootstrap and two different clusters connect to it at about the same time? If not, what will happen? Is our node just going to Fatal (due to setting two different IDs)? Would be good to document the extent of the protection and its shortcomings. I think the ID container is a good place to put it as that's one hop from everything related.


pkg/server/server.go, line 557 at r6 (raw file):

	for _, engine := range engines {
		storeIdent, err := storage.ReadStoreIdent(ctx, engine)
		if storeIdent.ClusterID != uuid.Nil {

Check the error first. If there's an error, nothing to compare. If there is no error, I think you always have a ClusterID? In other word, the check moves just before the bootstrappedEngines = ... line.


Comments from Reviewable

@a-robinson
Copy link
Contributor

a-robinson commented Dec 4, 2017

Review status: all files reviewed at latest revision, 27 unresolved discussions, some commit checks failed.


pkg/rpc/heartbeat.go, line 70 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I don't believe there's ever a valid scenario where a client with a cluster ID would connect to a node without a cluster ID

Wouldn't that happen if you let a new node an existing cluster? Sure, the new node usually reaches out first, but there's nothing that prevents other nodes in the cluster to also connect to the new one.
Hmm, just saw your other comment about the check in the gossip server. Honestly, I'm not sure this check isn't just unintentional for empty IDs. I for one would rather restrict that check to the case in which both UUIDs are nonempty. There is no reason you shouldn't be able to have a node join a cluster by having it be pointed at by existing nodes and not giving it any outgoing ones by itself. Certainly nonstandard, but it may be desired for example if replacing the first node in a cluster (which is usually pointed at by everyone and points at no one, at least after a naive setup). Feels like the Gossip check could just be removed entirely since there's no point in keeping it if it's identical to the one here.

In this code, we'll need to accept any combination of missing UUIDs anyway, since we need mixed-version compat with 1.1 which doesn't ever send any.

I was going to say that I didn't think it was even possible for a node with an unset cluster ID to receive a non-Init RPC after solongordon's work here, but I forgot about 636ec6b. That change has definitely messed with my intuition to the point where I'm not sure anymore. I don't really like the thought that a new node is willing to accept cluster IDs from random IP addresses that weren't in its join list, but I guess --insecure really means insecure, and given that I think you're right about this.


pkg/server/server.go, line 1126 at r7 (raw file):

	s.distSQLServer.Start()

	s.serveMode.set(modeOperational)

Just a suggestion, but we may want to assert that our ClusterID is set before opening up the listener (i.e. log.Fatal if it isn't).


Comments from Reviewable

@solongordon
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


pkg/base/cluster_id.go, line 32 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You don't need noCopy here because you have a Mutex.

Done.


pkg/base/cluster_id.go, line 60 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/%d/%s/

Done.


pkg/base/cluster_id.go, line 63 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/%d/%s/

Done.


pkg/rpc/context.go, line 222 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

// error while dialing; if set, connection is unusable

Done.


pkg/rpc/context.go, line 223 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

// outcome of last heartbeat

Done.


pkg/rpc/context.go, line 225 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: closed after first heartbeat (for inline comments).

Done.


pkg/rpc/context.go, line 254 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think this is racy. You could have the following sequence of events:

  • LoadInt32 returns 0
  • heartbeat succeeds; heartbeatErr gets set to nil
  • errors.Wrap(nil, anything) == nil
  • return nil, nil
  • probably an NPE (null pointer exception) trying to use the nil at the caller.

I think you want to fold c.valid into errValue. I've sprinkled comments about that below that might be helpful.
If they work, add a comment to the heartbeatErr field cautioning future coders to not Store to this on more than one goroutine to avoid clobbering (an unlikely problem, but better to warn about it).

I think I'm handling this by only loading heartbeatErr once and explicitly checking for the nil case. If it got set to nil, I just return the connection. I'll look for a clearer way to express this though.


pkg/rpc/context.go, line 291 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It feels off to me that rpc.Context owns this rather than server.Serverorserver.Node` owning it. Feel free to ignore, though.

Yeah, I had the same thought. Pragmatically I think it worked out simpler this way for testing purposes, but I'll revisit.


pkg/rpc/context.go, line 480 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Makes you wonder why we dial in the first place when we know we're going to go local.

It looks like it's being ignored, but it gets checked when Connect() is called. Probably could use an explanatory comment here.


pkg/rpc/context.go, line 537 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think he just wants to unblock clients when the connection goes away. Agree that it's slightly concerning.

Yes, I wanted to make sure Connect() was unblocked if the first heartbeat never happens. Note that we're not marking the connection as valid, just that validation is done. Confusing nomenclature maybe.


pkg/rpc/context_test.go, line 814 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why did you remove this?

Based on Peter's suggestion the connection is no longer removed from the conns pool if validation fails, just marked as invalid.


pkg/server/server.go, line 557 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Check the error first. If there's an error, nothing to compare. If there is no error, I think you always have a ClusterID? In other word, the check moves just before the bootstrappedEngines = ... line.

Done.


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch 2 times, most recently from 335115b to a826bfb Compare December 5, 2017 18:55
@solongordon
Copy link
Contributor Author

Review status: 44 of 47 files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


pkg/cli/start.go, line 934 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Just a warning: once you rebase, you'll see that stopperContext is no more. But I think you'll get an incoming ctx, which is really better suited to your needs anyway.

Done.


pkg/rpc/context.go, line 242 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I actually like Connect because it makes it very explicit that it can block, and semantically we are connecting (it is just that there's an underlying connection cache that makes sure we usually aren't). This is similar to how GRPCDial doesn't actually dial.

I don't have a strong preference and GRPCConn is winning 2-to-1, so I'll go with that unless Tobi convinced anyone to change their vote.


pkg/rpc/context.go, line 254 at r6 (raw file):

Previously, solongordon wrote…

I think I'm handling this by only loading heartbeatErr once and explicitly checking for the nil case. If it got set to nil, I just return the connection. I'll look for a clearer way to express this though.

OK, I went with the everSucceeded approach like you suggested (and renamed heartbeatErr to heartbeatResult.)


pkg/rpc/context.go, line 552 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

var everSucceeded bool (re: raciness)

Done.


pkg/rpc/context.go, line 573 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…
everSucceeded = everSucceeded || err == nil
conn.heartbeatErr.Store(errValue{err: err, everSucceeded: everSucceeded})

Done.


pkg/rpc/context.go, line 576 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This can go then.

Done.


pkg/rpc/heartbeat.proto, line 52 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I know you're just cargo-culting this, but I'm not sure this should be nullable. You'll have interop with old versions that actually don't send the field, so you can distinguish the two. For example, your code could send a trivial UUID to signal that they understand the UUID check but are not bootstrapped yet. Not suggesting you should do that, but it motivates that nullable = true (the default) seems more appropriate.

Sure, makes sense. It feels a little funny since the rest of this proto has nullable = false but I definitely see the logic.


pkg/server/node.go, line 559 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd add a comment explaining that there is a scenario in which NodeID == 0 but the cluster ID is already set (and why you set it that early).

I realized that the cluster ID checks here are actually redundant now since I'm doing them in inspectEngines. I changed this to only care about node ID and noted that cluster ID is checked elsewhere.


pkg/server/server.go, line 1126 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Just a suggestion, but we may want to assert that our ClusterID is set before opening up the listener (i.e. log.Fatal if it isn't).

Good idea. I added this check right after node.start().


Comments from Reviewable

@solongordon
Copy link
Contributor Author

Review status: 19 of 47 files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.


pkg/server/server.go, line 940 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Are there any races that could take place if a node comes up waiting for bootstrap and two different clusters connect to it at about the same time? If not, what will happen? Is our node just going to Fatal (due to setting two different IDs)? Would be good to document the extent of the protection and its shortcomings. I think the ID container is a good place to put it as that's one hop from everything related.

To be honest the gossip flow is confusing enough that it's hard for me to tell. I tried to force this race in a couple ways, like specifying two different clusters in --join or sending the init signal to two node simultaneously. In both cases the race was prevented by the existing cluster ID checks in gossip/server.go.


Comments from Reviewable

@a-robinson
Copy link
Contributor

:lgtm_strong:


Reviewed 18 of 28 files at r8.
Review status: 37 of 47 files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 242 at r5 (raw file):

Previously, solongordon wrote…

I don't have a strong preference and GRPCConn is winning 2-to-1, so I'll go with that unless Tobi convinced anyone to change their vote.

Leaving it as Connect is fine.


pkg/rpc/context.go, line 537 at r7 (raw file):

Previously, solongordon wrote…

Yes, I wanted to make sure Connect() was unblocked if the first heartbeat never happens. Note that we're not marking the connection as valid, just that validation is done. Confusing nomenclature maybe.

Couldn't Connect() just also wait on the stopper?


Comments from Reviewable

@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from a826bfb to 4e4d224 Compare December 6, 2017 18:30
@solongordon
Copy link
Contributor Author

Review status: 37 of 47 files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.


pkg/rpc/context.go, line 537 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Couldn't Connect() just also wait on the stopper?

Done.


pkg/server/server.go, line 1055 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

This initialization ordering is subtle enough that a comment around the ordering of its initialization seems worthwhile. Can you find a place to add such a comment?

I added a comment to cluster_id.go about how the cluster ID is determined. Hopefully that helps.


Comments from Reviewable

In the heartbeat, nodes now share their cluster IDs and check that they
match. We allow for missing cluster IDs, since new nodes do not have a
cluster ID until they obtain one via gossip, but conflicting IDs will
result in a heartbeat error.

In addition, connections are now not added to the connection pool until
the heartbeat succeeds. This allows us to fail fast when a node attempts
to join the wrong cluster.

Refers cockroachdb#18058.

Release note: None
@solongordon solongordon force-pushed the check-cluster-id-in-heartbeat branch from 4e4d224 to 78d1ae6 Compare December 6, 2017 19:00
@solongordon solongordon merged commit 8fa8d5d into cockroachdb:master Dec 6, 2017
@solongordon solongordon deleted the check-cluster-id-in-heartbeat branch December 6, 2017 20:09
@tbg
Copy link
Member

tbg commented Dec 6, 2017

Congrats 🛩

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants