server: introduce join rpc for node id allocation #52526

irfansharif · 2020-08-08T00:20:04Z

This mostly follows the ideas in #32574, and serves as a crucial
building block for #48843. Specifically this PR introduces a new Join
RPC that new nodes can use, addressing already initialized nodes, to
learn about the cluster ID and its node id. Previously joining nodes
were responsible for allocating their own IDs and used to discover the
cluster ID.

By moving towards a more understandable flow of how nodes joins the
cluster, we can build a few useful primitives on top of this:

we can prevent mismatched version nodes from joining the cluster
we can prevent decommissioned nodes from joining the cluster
we can add the liveness record for a given node as soon as it joins,
which would simplify our liveness record handling code that is
perennially concerned with missing liveness records

The tiny bit of complexity in this PR comes from how we're able to
migrate into this behavior from the old. To that end we retain the
earlier gossip-based cluster ID discovery+node ID allocation for self
behavior. Nodes with this patch will attempt to use this join RPC, if
implemented on the addressed node, and fall back to using the previous
behavior if not. It wasn't possible to use cluster versions for this
migrations because this happens very early in the node start up process,
and the version gating this change will not be active until much later
in the crdb process lifetime.

There are some leftover TODOs that I'm looking to address in this PR.
They should be tiny, and be easy to retro-fit into what we have so far.
Specifically I'm going to plumb the client address into the RPC so the
server is able to generate backlinks (and solve the bidirectionality
problem). I'm also going to try and add the liveness record for a
joining node as part of the join rpc. Right now the tests verifying
connectivity/bootstrap/join flags pass out of the box, but I'm going to
try adding more randomized testing here to test full connectivity once I
address these TODOs.

Release justification: Low risk, high benefit changes to existing functionality
Release note: None

cockroach-teamcity · 2020-08-08T00:20:12Z

This change is

irfansharif · 2020-08-11T04:36:24Z

@tbg, this is not complete yet but should be ready for a sanity check now. The remaining TODOs (including testing) are relatively minor. I'm looking to add on top of this base commit, in this PR, changes to disallow mismatched version nodes from joining. In a separate PR I'll patch on the changes to disallow decomm'ed nodes from joining (which is a bit more involved, and follows the ideas discussed in the context of #48843).

knz

I reviewed all this in detail except the changes to server/init.go. All that I reviewed looks good (with nits).

For server/init.go I'm still a bit 🔬 🐶 . Probably I can review this after we chat and you explain me what's going on in prose.

Reviewed 14 of 14 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

pkg/cli/start.go, line 673 at r1 (raw file):

		if waitForInit {
			log.Shout(ctx, log.Severity_INFO,
				"initial startup completed.\n"+

Was there a reason to change this?

pkg/cmd/roachtest/decommission.go, line 323 at r1 (raw file):

	// XXX: Hack, to see node come into liveness table. We should do this as
	// part of the connect rpc.

reminder to remove this before the PR gets merged.

pkg/gossip/client.go, line 278 at r1 (raw file):

	// being done by an incoming client, either because an outgoing
	// matches an incoming or the client is connecting to itself.
	if nodeID := g.NodeID.Get(); nodeID != 0 && nodeID == c.peerID {

How do you plan to detect loopback connections when the nodes don't have an ID yet?

pkg/server/init.go, line 54 at r1 (raw file):

//
// TODO(irfansharif): Remove this in 21.1.
var ErrJoinRPCUnimplemented = fmt.Errorf("node does not implement join rpc")

"rpc" in capitals

pkg/server/init.go, line 107 at r1 (raw file):

	// propagates the cluster ID to the rest of the nodes).
	//
	// TODO(irfansharif): This last sentence is not yet true. Update this list

Second sentence has has two verbs.

pkg/server/init.go, line 265 at r1 (raw file):

			}

			log.Infof(ctx, "**** cluster %s has been created", state.clusterID)

Here and below: drop the **** - we don't do magic characters.
(if you need markers for your own debugging, add additional log lines with e.g. log.Warningf(ctx, "XXX") which you'll remove before merging the PR)

pkg/server/server.go, line 1021 at r1 (raw file):

// should represent the general startup operation.
//
// XXX: Write a summary. What are _all_ the components I care about?

if you need help we can chat

(reminder to remove the comment before the PR merges)

pkg/server/server.go, line 1301 at r1 (raw file):

	// We need gossip to get spun up before the init server (which internally
	// makes use of KV to allocate node IDs_. In an ideal world we'd be able to

stray _

pkg/server/server.go, line 1307 at r1 (raw file):

	// example:
	//
	//   In a two node cluster where n1 is started+bootstrapped, when n2

id -> ID throughout the comment

pkg/server/server.go, line 1394 at r1 (raw file):

		// prevent bugs during further refactors).
		if s.rpcContext.ClusterID.Get() == uuid.Nil {
			return errors.New("programming error: expected cluster id to be populated in rpc context")

New("programming error: ... -> AssertionFailedf("...

pkg/server/server.go, line 1413 at r1 (raw file):

	// when a quorum of nodes has gone fully operational.
	_ = s.stopper.RunAsyncTask(ctx, "connect-gossip", func(ctx context.Context) {
		log.Infof(ctx, "connecting to gossip network to verify cluster id %q", state.clusterID)

id in uppercase.

pkg/server/serverpb/init.proto, line 29 at r1 (raw file):

  // TODO(irfansharif): Use this field to provide the client's address so that
  // the server is able to reach back to it, setting up bidirectional network
  // links.

Add a comment that the field should be populated with the advertised address if different from the listen address..

pkg/server/serverpb/init.proto, line 47 at r1 (raw file):

service Init {
  // Bootstrap an uninitialized cluster (inter-node links set up through the

I don't understand the phrasing of the comment. Can you perhaps use more words?

tbg

This looks pretty good, though there is one major change that I suggested (in a few places, sorry about being a broken record): the join RPC should live on *node. It has no business being served by the early stages of startup since it requires a fully functional server.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/server/init.go, line 100 at r1 (raw file):

	// We hold onto a *kv.DB object to assign node IDs through the Join RPC.
	db *kv.DB

I'm not sure why this is here. The initServer should only be used in the early phase of startup (i.e. before we start the *node and, down the road, ideally before connecting gossip). In that early phase, we can't reliably use the *kv.DB. I expected the join RPC to live on InternalServer (i.e. *node) just like most other KV endpoints.

pkg/server/init.go, line 113 at r1 (raw file):

	// it. We'll need to make sure that each server loops through its join list
	// at least once so that the "links" are setup. We don't want get
	// insta-bootstrapped and never set up those links.

If we have a connected graph, I don't think we need this? In a connected graph, for any node A and B, there is a path from A to B (i.e. A joins X joins .... joins Z joins B) for any A and B. The nodes that are directly adjacent to initialized nodes can join it and become themselves initialized. This means that in a finite number of iterations all nodes have joined.

It's more weakly connected graphs that are tricky. If n1 joins n2 joins n3 and we initialize n1, then naively nothing will happen. Even with your proposed fix (n1 tries to reach out to n2 at least once) we might be in trouble because what if n2 is not reachable when n1 tries that? It basically has to keep doing this forever and forever. Nodes can also swap addresses; an n2 that was earlier reached may now be a new node in need of incoming contact in order to be able to join. So to really fix this, we have to reach out to everyone ~periodically. Note that not even gossip does that! The above example with Gossip will "usually" work because n1 will establish outgoing Gossip connections, will reach n2, and since n2 has Gossip running n1 learns about n3, etc (though n2 would also directly Gossip to n3). However, this is just the example that works. Nodes don't generally connect to every other node; in a very large cluster n1 might be perfectly happy with its gossip connectivity and would never reach out to n2 or n3, which would perpetually stay isolated.
We have "gossip bootstrap persistence" to hack around our lack of clarity here - nodes remember and auto-load all of the recent addresses observed in the cluster. That too is only a best effort mechanism, as all of these addresses might have changed again by the time they are used.

I think when it comes down to it, at the time of writing, the real requirement that we have in order to have a converging network is in fact a connected graph. This is fine to work with if you have a load balancer, but it's awkward for manual deployments - we don't want to change all nodes' join flags whenever a new node joins. So our guidance has been to point everyone at a "few" common nodes to make it "highly likely" that convergence will happen (since everyone will connect to the same few nodes).

This all seems unsatisfying, but there's not enough pain to really do much about it, in practice the band aids we have work rather well: manual deployments are mostly static (and follow the "everyone joins n1-n3") pattern, so they work well.

I would be inclined to say that it is fine that, when a node is joining a cluster, it needs to directly be able to reach one node that has already joined (or will eventually) join the cluster. This is because this is intuitively easy to understand - a node that isn't pointed at the existing cluster can't quite be added to it.

pkg/server/init.go, line 205 at r1 (raw file):

// easy.
//
// TODO(tbg): give this a KV client and thus initialize at least one store in

Isn't this TODO mostly done now, mod the migration?

pkg/server/init.go, line 268 at r1 (raw file):

			log.Infof(ctx, "**** allocated node id %d for self", state.nodeID)

			s.inspectState.clusterID = state.clusterID

data race? You're not holding the mutex and Bootstrap reads this for NeedsInit.

pkg/server/init.go, line 290 at r1 (raw file):

			s.inspectState.clusterID = state.clusterID
			// Ensure we're draining out join attempt.
			cancelJoin()

why do we have to cancel it? It just sent on the channel, so it should know it's done and shut down?

pkg/server/init.go, line 308 at r1 (raw file):

			// support). Commence startup; the Node will realize it's short a
			// NodeID and will request one.
			clusterID, err := gossip.GetClusterID()

Don't shadow the package name here.

pkg/server/init.go, line 328 at r1 (raw file):

error attempting to join: %v

pkg/server/init.go, line 382 at r1 (raw file):

// record here so as to no longer have to worry about the liveness record not
// existing for a given node.
func (s *initServer) Join(

As mentioned elsewhere, it seems that this lives in the wrong place. Join RPCs are served by full-fledged KV nodes, not by the initServer. The initServer should not do anything any more once it has handed out a NodeID. By extension, the initServer should not have a *kv.DB.

pkg/server/init.go, line 409 at r1 (raw file):

func (s *initServer) startJoinLoop(ctx context.Context, stopper *stop.Stopper) error {
	dialOpts, err := s.rpcContext.GRPCDialOptions()

Can you try handing only what's needed from rpcContext instead of the whole thing? rpc.Context is large and we do make changes in it, so it is better to not pull it into the part of the startup sequence that is hard to migrate. You don't need more than the dial options, right? Those are fair game.

pkg/server/init.go, line 459 at r1 (raw file):

func (s *initServer) attemptJoin(
	ctx context.Context, addr string, conn *grpc.ClientConn,
) (success bool, err error) {

with my comment below on unconditionally retring, this becomes just error and the special casing to stop trying is at the caller.

pkg/server/init.go, line 462 at r1 (raw file):

	initClient := serverpb.NewInitClient(conn)
	req := &serverpb.JoinNodeRequest{
		MinSupportedVersion: &clusterversion.TestingBinaryMinSupportedVersion,

testing?

pkg/server/init.go, line 467 at r1 (raw file):

	resp, err := initClient.Join(ctx, req)
	if err != nil {
		if strings.Contains(err.Error(), ErrNodeUninitialized.Error()) {

Please use a structured grpc error here: err := status.Error(codes.<SOMETHING>, "can not serve this RPC yet") and if status.Code(err) == codes.<SOMETHING> { /* fall back to gossip */ }

pkg/server/init.go, line 472 at r1 (raw file):

		}

		if grpcutil.ConnectionRefusedRe.MatchString(err.Error()) {

Do we need this? For general errors, we'll always retry, won't we? We certainly will try the other addresses first, but when we wrap around we would give this one another shot. Consequently I don't see why we need to try to suss out this particular error.

pkg/server/init.go, line 482 at r1 (raw file):

		// mechanism for the clusterID.
		if code := status.Code(errors.Cause(err)); code == codes.Unimplemented &&
			strings.Contains(err.Error(), `unknown method Join`) {

why do you need this strings.Contains?

pkg/server/init.go, line 526 at r1 (raw file):

// only the fields needed by the init server.
type initServerCfg struct {
	wrapped Config

Hmm, this pattern is a little odd. You still need to populate "just the right things" in wrapped to get a working initServerCfg. Have you considered something like

type initServerCfg struct {
  bootstrapVersion roachpb.Version
  // ...
}

newInitServerCfg(cfg Config) initServerCfg {
  return initServerCfg{bootstrapVersion: ...}
}

?

That way in tests you get to work with a small struct. Besides, initServer can access wrapped so it's all a bit dubious as is.

pkg/server/server.go, line 1302 at r1 (raw file):

	// We need gossip to get spun up before the init server (which internally
	// makes use of KV to allocate node IDs_. In an ideal world we'd be able to
	// only spin up the very small subset of KV we need to allocate node IDs (or

This is an artifact of having put the Join RPC on the initServer. If you move it to *node as I suggested in another comment, this problem goes away. I think you also don't have to have a special return error for the "not initialized yet" case, as the RPC will simply not be allowed yet at that point. (It should not be in the allowlist).

pkg/server/serverpb/init.proto, line 24 at r1 (raw file):

// MinimumSupportedVersion is. If it's not compatible with the rest of the
// cluster, the join attempt is refused.
message JoinNodeRequest {

Assuming I'm not wrong about having this live on *node, it should move closer to the other stuff from service Internal.

pkg/server/serverpb/init.proto, line 27 at r1 (raw file):

  roachpb.Version min_supported_version = 1;

  // TODO(irfansharif): Use this field to provide the client's address so that

As discussed elsewhere, this may not be worth doing.

pkg/server/serverpb/init.proto, line 40 at r1 (raw file):

// gossip to bump to for us).
// TODO(irfansharif): We should use this RPC to also generate store IDs, instead
// of having each node do it for itself after being handed out a node ID.

For the reasons discussed in one of the comments you touched in this PR, we can't generally remove the need for nodes to allocate their own nodeIDs. For that reason my preference would be to hand back exactly one NodeID here (that for the first store). This makes sure that once we test with more than one store, we generically exercise both paths.

irfansharif

I've addressed most of the comments here, mind taking another quick look @tbg? I'd normally only ask after it was a bit more polished but I'm a bit pressed for time this week, and am really hoping to land this pre-stability to make 21.1 code a bit easier. I might have to solicit another review before end of day. I'll have another commit I'm going to push to this PR after, building atop #53113, that tests all the behaviors we're interested in (including in mixed-version clusters).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)

pkg/cli/start.go, line 673 at r1 (raw file):

Previously, knz (kena) wrote…

Was there a reason to change this?

Only su[erficially; I've been starting at start up logs a fair amount while working on this and wanted this log to looks similar to other logging that happens during startup. Some of the other logging changes I made in this PR were motivated similarly by wanting to de-dup/simplify what we logged during start up.

pkg/cmd/roachtest/decommission.go, line 323 at r1 (raw file):

Previously, knz (kena) wrote…

reminder to remove this before the PR gets merged.

Yup, I use these XXX tags as markers for myself when working on something. I somehow got my IDE to spit out a list of these on demand.

pkg/server/init.go, line 54 at r1 (raw file):

Previously, knz (kena) wrote…

"rpc" in capitals

Done.

pkg/server/init.go, line 100 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm not sure why this is here. The initServer should only be used in the early phase of startup (i.e. before we start the *node and, down the road, ideally before connecting gossip). In that early phase, we can't reliably use the *kv.DB. I expected the join RPC to live on InternalServer (i.e. *node) just like most other KV endpoints.

Done.

pkg/server/init.go, line 107 at r1 (raw file):

Previously, knz (kena) wrote…

Second sentence has has two verbs.

The comment was removed entirely, we're not going to do it anymore.

pkg/server/init.go, line 113 at r1 (raw file):

I would be inclined to say that it is fine that, when a node is joining a cluster, it needs to directly be able to reach one node that has already joined (or will eventually) join the cluster. This is because this is intuitively easy to understand - a node that isn't pointed at the existing cluster can't quite be added to it.

I think I'm happy with this conclusion. It got pretty unwieldy trying to re-implement bidirectional gossip through RPCs.

pkg/server/init.go, line 205 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this TODO mostly done now, mod the migration?

I still need to seed the first store ID through the join RPC, but I've moved this TODO elsewhere.

pkg/server/init.go, line 265 at r1 (raw file):

Previously, knz (kena) wrote…

Here and below: drop the **** - we don't do magic characters.
(if you need markers for your own debugging, add additional log lines with e.g. log.Warningf(ctx, "XXX") which you'll remove before merging the PR)

Done. The "****" characters were there from before, and I figured they were to to call out bootstrap related logging (and that's the convention I followed). I've now removed the additions I made.

pkg/server/init.go, line 268 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

data race? You're not holding the mutex and Bootstrap reads this for NeedsInit.

Done (was leftover as a TODO I wanted to address in this PR).

pkg/server/init.go, line 290 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

why do we have to cancel it? It just sent on the channel, so it should know it's done and shut down?

Done.

pkg/server/init.go, line 308 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't shadow the package name here.

Done.

pkg/server/init.go, line 328 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

error attempting to join: %v

Done.

pkg/server/init.go, line 382 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

As mentioned elsewhere, it seems that this lives in the wrong place. Join RPCs are served by full-fledged KV nodes, not by the initServer. The initServer should not do anything any more once it has handed out a NodeID. By extension, the initServer should not have a *kv.DB.

Done.

pkg/server/init.go, line 409 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you try handing only what's needed from rpcContext instead of the whole thing? rpc.Context is large and we do make changes in it, so it is better to not pull it into the part of the startup sequence that is hard to migrate. You don't need more than the dial options, right? Those are fair game.

Done.

pkg/server/init.go, line 459 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

with my comment below on unconditionally retring, this becomes just error and the special casing to stop trying is at the caller.

Done.

pkg/server/init.go, line 462 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

testing?

Done.

pkg/server/init.go, line 467 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Please use a structured grpc error here: err := status.Error(codes.<SOMETHING>, "can not serve this RPC yet") and if status.Code(err) == codes.<SOMETHING> { /* fall back to gossip */ }

This was removed entirely.

pkg/server/init.go, line 472 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Do we need this? For general errors, we'll always retry, won't we? We certainly will try the other addresses first, but when we wrap around we would give this one another shot. Consequently I don't see why we need to try to suss out this particular error.

Do we want to blindly retry for general errors? Only instance I know I'd want to retry is when the server node is not yet available; what are the other possible error conditions that merit retrying?

pkg/server/init.go, line 482 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

why do you need this strings.Contains?

Done.

pkg/server/init.go, line 526 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, this pattern is a little odd. You still need to populate "just the right things" in wrapped to get a working initServerCfg. Have you considered something like
type initServerCfg struct {
  bootstrapVersion roachpb.Version
  // ...
}

newInitServerCfg(cfg Config) initServerCfg {
  return initServerCfg{bootstrapVersion: ...}
}
?

That way in tests you get to work with a small struct. Besides, initServer can access wrapped so it's all a bit dubious as is.

Done. Yea was trying this new pattern to see how it would look, exposing just the subset of the configs we're interested in, but we might as well copy them all over.

pkg/server/server.go, line 1301 at r1 (raw file):

Previously, knz (kena) wrote…

stray _

Done (woops).

pkg/server/server.go, line 1302 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is an artifact of having put the Join RPC on the initServer. If you move it to *node as I suggested in another comment, this problem goes away. I think you also don't have to have a special return error for the "not initialized yet" case, as the RPC will simply not be allowed yet at that point. (It should not be in the allowlist).

Moved the Join RPC off of initServer, but we're going to have to maintain this code path for 20.2 to be compatible with 20.1 servers. We can defer gossip start to down below in 21.1. Updated the comment to say as much.

pkg/server/server.go, line 1307 at r1 (raw file):

Previously, knz (kena) wrote…

id -> ID throughout the comment

Done.

pkg/server/server.go, line 1394 at r1 (raw file):

Previously, knz (kena) wrote…

New("programming error: ... -> AssertionFailedf("...

Done.

pkg/server/server.go, line 1413 at r1 (raw file):

Previously, knz (kena) wrote…

id in uppercase.

Done.

pkg/server/serverpb/init.proto, line 24 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Assuming I'm not wrong about having this live on *node, it should move closer to the other stuff from service Internal.

Done.

pkg/server/serverpb/init.proto, line 27 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

As discussed elsewhere, this may not be worth doing.

Done. Happy to avoid work when possible!

pkg/server/serverpb/init.proto, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

Add a comment that the field should be populated with the advertised address if different from the listen address..

This will no longer be applicable.

pkg/server/serverpb/init.proto, line 40 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

For the reasons discussed in one of the comments you touched in this PR, we can't generally remove the need for nodes to allocate their own nodeIDs. For that reason my preference would be to hand back exactly one NodeID here (that for the first store). This makes sure that once we test with more than one store, we generically exercise both paths.

For now I've kept this PR focused on allocating the node ID, leaving the store ID generation untouched. I'll follow up with another doing that.

pkg/server/serverpb/init.proto, line 47 at r1 (raw file):

Previously, knz (kena) wrote…

I don't understand the phrasing of the comment. Can you perhaps use more words?

Done. (Sorry! that was pretty terse).

tbg

Sure! This looks mostly good. I left lots of comments but the real work remaining here is to clarify the story of what comes next. Maybe I'm wrong, but it seemed like you were perhaps assuming that nodes would always call into Join when they start (but it's not how the code is written). Or perhaps they don't need to and the version protection is still fine (since, once we've reached our goal, versions can only changed when all nodes ack them)? There's a good chance I've since paged out a bit too much and you have to remind me of what the plan is, but either way it will be good to dump it in a TODO before we merge to avoid surprises along the way.
I'm very excited that on the "new master branch" we will be allowed to start gossip later, and that we will always have a NodeID (and StoreID, I hope you will get to that as well! Seems like a small step from this PR).

Reviewed 7 of 14 files at r1, 7 of 7 files at r2, 18 of 18 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)

pkg/gossip/client.go, line 278 at r1 (raw file):

Previously, knz (kena) wrote…

How do you plan to detect loopback connections when the nodes don't have an ID yet?

What made this change necessary, btw? Were you seeing nodes kill their gossip connections when they both didn't have a nodeID yet?

It's nice to know that, a cycle in from this PR, we won't ever use Gossip without NodeIDs any more.

pkg/roachpb/api.proto, line 2199 at r3 (raw file):

}

// JoinNodeRequest is used to specify to the server node what the client's

This isn't really what it's used for, no? It primarily requests a NodeID and StoreID.

In the end, we will have two RPCs- Join (which needs to be served by a live KV layer) and ... Connect/PreJoin (?) which checks against the node-local blocklist (decommissioned nodes).

I think that ultimately the startup sequence will be that nodes will always use Connect against another node very early and then use Join (if they don't have a NodeID yet). This would make it more attractive to run the version check inConnect (since it's earlier). In fact I don't think we can do it in Join, since that is only served on nodes that have KV running (so if you restart a fully down cluster, nodes would never be able to serve join each other).

I think we need to do this in Connect (with the caveat that a node will only serve connect after it itself has made it past its outgoing Connect call (to ensure the cluster won't bump the version without notifying the client)) and then we have to check it again during Join if we call it (as only that adds the new nodeID to the liveness table, making sure that version bumps won't happen in a racy way). In practice, I think that means that Join should also run whatever code backs Connect internally (meaning it needs to have the min_supported_version) field. So - this code is fine, but maybe, more accidentally than intentionally? Brain dump a comment here so we can merge this. No need to implement this fully in this PR, I am fine with TODOs, but we should know the definite path forward.

pkg/roachpb/api.proto, line 2210 at r3 (raw file):

// node id was allocated to it.
//
// TODO(irfansharif): Think about the semantics for how store IDs get allocated.

+1 to exactly one

pkg/roachpb/api.proto, line 2214 at r3 (raw file):

// and entrust the node itself to allocate IDs for the remaining stores.
//
// TODO(irfansharif): We should use this RPC to tell us the right cluster

+1

pkg/rpc/context.go, line 511 at r3 (raw file):

	ctx context.Context, req *roachpb.JoinNodeRequest, _ ...grpc.CallOption,
) (*roachpb.JoinNodeResponse, error) {
	return a.InternalServer.Join(ctx, req)

We could refuse those - node should never join itself. I'm not sure it's worth breaking the pattern of this code though, let's leave as is.

pkg/server/init.go, line 265 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done. The "****" characters were there from before, and I figured they were to to call out bootstrap related logging (and that's the convention I followed). I've now removed the additions I made.

Might just remove the ***** while you're here. Raphael is adding some linters.

pkg/server/init.go, line 472 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Do we want to blindly retry for general errors? Only instance I know I'd want to retry is when the server node is not yet available; what are the other possible error conditions that merit retrying?

I think we would blindly retry. The node is trying to join the cluster, it will need to reach a node at some point, so it should go around its join list and try (with backoff) until something works. This is similar to how Gossip will try forever to connect. It won't give up.

pkg/server/init.go, line 95 at r3 (raw file):

	dialOpts []grpc.DialOption,
	inspectState *initDiskState,
	config initServerCfg,

why aren't the dial opts in the config?

pkg/server/init.go, line 159 at r3 (raw file):

// NeedsInitLocked returns true if (and only if) none if the engines are
// initialized. In this case, server startup is blocked until either an

s/server startup/ServeAndWait/

pkg/server/init.go, line 167 at r3 (raw file):

}

// ServeAndWait waits until the server is ready to bootstrap. In the common case

s/is ready to bootstrap/ is initialized, that is, has a NodeID and has permission to join the cluster.

pkg/server/init.go, line 170 at r3 (raw file):

// of restarting an existing node, this immediately returns. When starting with
// a blank slate (i.e. only empty engines), it waits for incoming Bootstrap
// request or for a successful Join RPC, whichever happens earlier. See [1].

add "outgoing"

pkg/server/init.go, line 173 at r3 (raw file):

//
// The returned initState reflects a bootstrapped cluster (i.e. it has a cluster
// ID and a node ID for this server). See [2].

and that initDiskState within should also correspond to what you would get if you reloaded from disk (not true right now).

pkg/server/init.go, line 196 at r3 (raw file):

	// If already bootstrapped, return early.
	s.mu.Lock()
	if !s.NeedsInitLocked() {

should not be exported

pkg/server/init.go, line 241 at r3 (raw file):

// and it had no chance of joining anywhere (since we are starting the new cluster and are not serving Join yet).

pkg/server/init.go, line 256 at r3 (raw file):

			//
			// [1]: See the top-level comment in pkg/clusterversion to make
			// sense of the many versions of...versions.

of ...versions.

pkg/server/init.go, line 265 at r3 (raw file):

			s.mu.Lock()
			s.mu.inspectState.clusterID = state.clusterID

why not *s.mu.inspectState = state.initDiskState?

pkg/server/init.go, line 283 at r3 (raw file):

			}

			log.Infof(ctx, "**** joined cluster %s through join rpc", state.clusterID)

ditto

pkg/server/init.go, line 287 at r3 (raw file):

			s.mu.Lock()
			s.mu.inspectState.clusterID = state.clusterID

Here too, can we avoid manually messing with inspectState? I think it would be better if attemptJoin wrote this to disk in the first place so that we're not spending time with an initDiskState that is not actually consistent with what's on disk. (Which of course means you have to bootstrap the first store there). I'm ok with deferring this to the next PR but the plan has to be solid and written out here.

pkg/server/init.go, line 292 at r3 (raw file):

			return state, nil
		case <-gossipConnectedCh:
			// Ensure we're draining out the join attempt.

We should also have the join rpc return an error (at the server) if the cluster version has not been bumped yet. This will help avoid races in mixed version clusters where the join RPC manages to reach a new node (and gets a nodeID) but at the same time gossip connects) and we throw the nodeID away (which will then orphan the nodeID, which will confuse long-running migrations, etc).

pkg/server/init.go, line 321 at r3 (raw file):

				bootstrapped:  false,
			}
			log.Infof(ctx, "**** joined cluster %s through gossip (legacy behavior)", state.clusterID)

*****

pkg/server/init.go, line 394 at r3 (raw file):

			<-stopper.ShouldQuiesce()
			if err := conn.Close(); err != nil {
				log.Fatalf(ctx, "%v", err)

Is this fatal worthy? Can't conn.Close() return errors if the conn got disrupted?

I'm also pretty sure you need to re-open these conns for each attempt, rather than making them once upfront and hoping for the best.

pkg/server/init.go, line 408 at r3 (raw file):

	}

	for idx := 0; ; idx = (idx + 1) % len(conns) {

As mentioned elsewhere, the only reason to break out of this loop is that it succeeded or bails because of mixed version cluster. Here's the corresponding code from kvtenantccl which might be useful as a template.

cockroach/pkg/ccl/kvccl/kvtenantccl/connector.go

Lines 361 to 382 in 5416c9c

    
           // dialAddrs attempts to dial each of the configured addresses in a retry loop. 
        
           // The method will only return a non-nil error on context cancellation. 
        
           func (c *Connector) dialAddrs(ctx context.Context) error { 
        
           	for r := retry.StartWithCtx(ctx, c.rpcRetryOptions); r.Next(); { 
        
           		// Try each address on each retry iteration. 
        
           		randStart := rand.Intn(len(c.addrs)) 
        
           		for i := range c.addrs { 
        
           			addr := c.addrs[(i+randStart)%len(c.addrs)] 
        
           			conn, err := c.dialAddr(ctx, addr) 
        
           			if err != nil { 
        
           				log.Warningf(ctx, "error dialing tenant KV address %s: %v", addr, err) 
        
           				continue 
        
           			} 
        
           			client := roachpb.NewInternalClient(conn) 
        
           			c.mu.Lock() 
        
           			c.mu.client = client 
        
           			c.mu.Unlock() 
        
           			return nil 
        
           		} 
        
           	} 
        
           	return ctx.Err() 
        
           }

pkg/server/init.go, line 448 at r3 (raw file):
not Info? After all, this isn't a sign of anything unexpected. Might also add

"; falling back to Gossip-based cluster join"

pkg/server/node.go, line 1105 at r3 (raw file):

// Join implements the roachpb.InternalServer service. This is the
// "connectivity" API; individual CRDB servers are passed in a --join list and
// the join targets are addressed through this API.

This is again a bit at odds with how I expected it to work. I thought Join would only be called when a new node joins for the first time (there doesn't seem to be a compelling reason to call it every time). Let's make sure we're on the same page before proceeding (I'm free today).

pkg/server/server.go, line 1072 at r3 (raw file):

	// startRPCServer (and for the loopback grpc-gw connection).
	initConfig := newInitServerConfig(s.cfg)
	inspectState, err := inspectEngines(

I originally had put this into newInitServer to avoid having an initState floating around in NewServer. Could you make sure to either restore the original code or to avoid leaking initState into the main method here?

var initServer *initServer
{
  inspectState, err := inspectEngines()
  if err != nil {}
  initServer, err := ...
}

pkg/server/server.go, line 1290 at r3 (raw file):

	//
	// TODO(irfansharif): Can this be gated behind a version flag? So we only
	// actually start gossip pre-init server when we find out that we need to?

I would do this:

startGossipIdempotent := func() func() {
   var once sync.Once
   do := func() {
    s.gossip.Start(advAddrU, filtered)
   }
   return func() {
      once.Do(do)
   }
}()

Then pass startGossipIdempotent via initServerConfig, and invoke it there when you are falling back to Gossip. Add a second call in the main method after the init server is done. When the migration is no longer necessary, only the second call remains and can turn back to s.gossip.Start directly again.

pkg/server/server.go, line 1343 at r3 (raw file):

	// read its ID from the stores or request a new one via KV.
	//
	// TODO(irfansharif): Delete this once we 20.2 is cut. This only exists to

you mean "make this unconditional", right?

pkg/server/serverpb/init.proto, line 23 at r3 (raw file):
This is a bit vague - how about

This is primarily used by cockroach init. It writes the data for a single-node cluster comprised of this node (with NodeID one) to the first store (StoreID one), after which the node can start and allow other nodes to join the cluster.

pkg/util/grpcutil/log.go, line 149 at r3 (raw file):

var (
	transportFailedRe   = regexp.MustCompile(`^` + regexp.QuoteMeta(`grpc: addrConn.createTransport failed to connect to`))
	ConnectionRefusedRe = regexp.MustCompile(

This would then be unexported again.

pkg/util/grpcutil/log.go, line 162 at r3 (raw file):

	clientConnReuseRe = regexp.MustCompile("cannot reuse client connection")

	ConnectivitySpamRe = regexp.MustCompile(transportFailedRe.String() + `.*` +

Ditto

knz

Reviewed 1 of 7 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)

pkg/gossip/client.go, line 278 at r1 (raw file):

Were you seeing nodes kill their gossip connections when they both didn't have a nodeID yet?

Yup.

I'm not really sure how to respond to Raphael's question. In a cluster running 20.2 we won't have gossip started without NodeIDs. In mixed version settings where we're adding a 20.2 node to a 20.1 cluster, we could (we need to to maintain backwards compatibility). What exactly happens if we allow loopback connections for nodes without a node ID during that window of time? I don't think this code was designed to disallow that explicitly, I think it just happened because we didn't pay much mind to empty nodeIDs, but I could be wrong. I don't have enough understanding of gossip to think about what could break here with my patch.

pkg/roachpb/api.proto, line 2199 at r3 (raw file):
[replying to this and the top level comment] The comment here is a bit dated, but yes, I was originally hoping we'd call something on every server start (and not just when not bootstrap), but realized soon after we're going to need a separate mechanism/RPC to disallow decomm'ed nodes from entering the cluster. That RPC would be closely tied to the killfiles we'd distribute during decomm + persisting list of decomm'ed nodes in store local keys.

I see your point about preferring the version check to sit on this other Connect RPC, I think that makes sense. It's a similar gating mechanism we're using for decommissioned nodes. I was originally thinking to place the connect RPC call right after the init server set up, because it's only that point after in the code today that we gain knowledge about our node ID, but that's not a good reason. That's also what had informed the placement of the version check in the Join RPC instead of Connect as well.

Join, since that is only served on nodes that have KV running

Aside, I think I missed this detail. Mind point me to exactly where the code sits that sequences the serving of InternalServer RPCs after having "KV running". Related, where would this Connect/PreJoin RPC sit? I don't suppose it's *Node anymore, do we want a separate server for it entirely then?

pkg/server/node.go, line 1105 at r3 (raw file):

I thought Join would only be called when a new node joins for the first time

That is right. See other comment about needing a separate Connect RPC.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)

pkg/gossip/client.go, line 278 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Were you seeing nodes kill their gossip connections when they both didn't have a nodeID yet?

Yup.

I'm not really sure how to respond to Raphael's question. In a cluster running 20.2 we won't have gossip started without NodeIDs. In mixed version settings where we're adding a 20.2 node to a 20.1 cluster, we could (we need to to maintain backwards compatibility). What exactly happens if we allow loopback connections for nodes without a node ID during that window of time? I don't think this code was designed to disallow that explicitly, I think it just happened because we didn't pay much mind to empty nodeIDs, but I could be wrong. I don't have enough understanding of gossip to think about what could break here with my patch.

We talked 1:1 and the plan is for Irfan to check if he needs to make any changes here to get things working (and if so, understand why). Otherwise we'll leave this alone since NodeIDs will always be available in 21.1+

pkg/roachpb/api.proto, line 2199 at r3 (raw file):
We chatted about most of this 1:1. The version check is fine on the Join RPC (long-running migrations will include every "non-decommissioned" node in an upgrade, so once we have checked the version at join time, we'll receive all updates (or, if we're down, will block upgrades until decommissioned, in which case we can not connect to the cluster again in the future).

Aside, I think I missed this detail. Mind point me to exactly where the code sits that sequences the serving of InternalServer RPCs after having "KV running". Related, where would this Connect/PreJoin RPC sit? I don't suppose it's *Node anymore, do we want a separate server for it entirely then?

I mean that the node only servers Internal (for the most part) when we hit this:

cockroach/pkg/server/server.go

Line 1480 in 30946b9

s.grpc.setMode(modeOperational)

In particular, while the initServer is blocking, the node does not serve Join.

The Connect/PreJoin RPC would be similar to the initServer, I think. It would be registered around

cockroach/pkg/server/server.go

Line 1157 in 30946b9

serverpb.RegisterInitServer(s.grpc.Server, initServer)

(and we would already have passed the outbound Connect RPC at that point, this is necessary for correctness). It would sit on a small struct that only has the engines (which would be populated via some callback whenever we learn that a node is decommissioned, or we inject a db handle into it so that it can try to hit the liveness range, not sure, this is all tbd for me).

53800: server: defer gossip start to when absolutely needed r=irfansharif a=irfansharif This was pulled out of #52526 to keep the diff there focussed on the introduction of the RPC (and to see if that alone shaked out any failures). That change lets us make this one, were we can defer gossip start until right before we start up Node. Release justification: low risk, high benefit change to existing functionality Release note: None 53846: sql: fix inflated "overhead" in statement timings r=solongordon a=solongordon Fixes #40675 Fixes #50108 Release justification: Low-risk, high-reward fix to existing functionality. Release note (admin ui change): In some cases, the Execution Stats page would show a confusingly high Overhead latency for a statement. This could happen due to multiple statements being parsed together or due to statement execution being retried. To avoid this, we stop considering the time between when parsing ends and execution begins when determining service latency. 53906: roachtest: Unskip disk-full roachtest r=tbg a=itsbilal The disk-full roachtest no longer fails now that Pebble is the default storage engine. Remove the skipped marker. Fixes #35328. Release justification: Roachtest-only change. Release note: None. 53907: geos: bump to include seg fault fix r=sumeerbhola a=otan Resolves #53254. Release justification: bug fix to new functionality Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Solon Gordon <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Oliver Tan <[email protected]>

Deferred doing this in cockroachdb#52526. Probably a good idea to do have it, it'll bring down the cluster convergence time (time taken for all nodes to find out about the initialization) by a bit. Release justification: low risk, high benefit changes to existing functionality Release note: None

53713: server: busy loop through resolver list during join process r=irfansharif a=irfansharif Deferred doing this in #52526. Probably a good idea to do have it, it'll bring down the cluster convergence time (time taken for all nodes to find out about the initialization) by a bit. Release justification: low risk, high benefit changes to existing functionality Release note: None Co-authored-by: irfan sharif <[email protected]>

53911: roachtest: fix tpchvec/perf in some cases r=yuzefovich a=yuzefovich Previously, whenever a slowness threshold is exceeded, we would run EXPLAIN ANALYZE (DEBUG) on the query at fault using the same connection for all cluster setups. This would result in running the query only with vectorize=on which is not what we want. We will now be opening up new connections to the database after the cluster settings have been updated in order to get the correct behavior for each of the setups. Informs: #53884. Release note: None 54072: server: introduce Test{ClusterConnectivity,JoinVersionGate} r=irfansharif a=irfansharif Re-adding a few tests that I didn't check in while working on #52526. --- TestClusterConnectivity sets up an uninitialized cluster with custom join flags (individual nodes point to specific others, constructed using connectivity matrices). Included in the test configurations are the more "standard" ones where every node points to n1, or where there are bidirectional links set up between nodes. It tests various topologies and ensures that cluster IDs are distributed throughout connected components, and store/node IDs are allocated in the way we'd expect them to be. TestJoinVersionGate checks to see that improperly versioned cockroach nodes are not able to join a cluster. Release note: None 54159: opt: reduce allocations in Implicator and JoinOrderBuilder r=RaduBerinde a=mgartner #### partialidx: lazily initialize the implicator constraint cache This commit makes the constructing of the Implicator's constrain cache map lazy. Instead of making the map in the `Init` method, it is made only when it is needed. This fixes performance regressions in microbenchmarks. name old time/op new time/op delta Phases/kv-read-const/Normalize-16 170ns ± 0% 139ns ± 0% -18.66% (p=0.008 n=5+5) Phases/kv-read-const/OptBuild-16 173ns ± 1% 141ns ± 1% -18.38% (p=0.008 n=5+5) Phases/kv-read-const/Explore-16 183ns ± 1% 151ns ± 0% -17.70% (p=0.008 n=5+5) Phases/kv-read-const/ExecBuild-16 674ns ± 2% 636ns ± 1% -5.58% (p=0.008 n=5+5) Phases/kv-read/ExecBuild-16 13.3µs ± 1% 13.1µs ± 1% -1.78% (p=0.008 n=5+5) Phases/tpcc-delivery/ExecBuild-16 29.4µs ± 1% 28.9µs ± 0% -1.64% (p=0.016 n=5+4) Phases/tpcc-stock-level/Explore-16 153µs ± 1% 151µs ± 1% -1.24% (p=0.008 n=5+5) Phases/kv-read-no-prep/Explore-16 42.3µs ± 1% 41.9µs ± 0% -0.85% (p=0.008 n=5+5) Phases/tpcc-delivery/OptBuild-16 12.1µs ± 1% 12.1µs ± 0% -0.65% (p=0.032 n=5+5) Phases/kv-read/Parse-16 2.49ns ± 1% 2.48ns ± 0% ~ (p=0.413 n=5+4) Phases/kv-read/OptBuild-16 6.46µs ± 2% 6.44µs ± 1% ~ (p=0.841 n=5+5) Phases/kv-read/Normalize-16 6.48µs ± 3% 6.48µs ± 2% ~ (p=1.000 n=5+5) Phases/kv-read/Explore-16 12.1µs ± 1% 11.9µs ± 2% ~ (p=0.056 n=5+5) Phases/kv-read-no-prep/OptBuild-16 32.4µs ± 3% 32.2µs ± 1% ~ (p=1.000 n=5+5) Phases/kv-read-no-prep/Normalize-16 34.9µs ± 1% 34.7µs ± 2% ~ (p=0.222 n=5+5) Phases/kv-read-no-prep/ExecBuild-16 44.0µs ± 3% 43.7µs ± 1% ~ (p=0.310 n=5+5) Phases/kv-read-const/Parse-16 2.60ns ± 1% 2.60ns ± 1% ~ (p=0.643 n=5+5) Phases/tpcc-new-order/Normalize-16 12.8µs ± 1% 13.9µs ±10% ~ (p=0.690 n=5+5) Phases/tpcc-new-order/ExecBuild-16 38.5µs ± 1% 38.6µs ± 2% ~ (p=1.000 n=5+5) Phases/tpcc-delivery/Parse-16 2.55ns ± 1% 2.56ns ± 1% ~ (p=0.921 n=5+5) Phases/tpcc-delivery/Normalize-16 12.6µs ± 2% 12.5µs ± 1% ~ (p=0.151 n=5+5) Phases/tpcc-delivery/Explore-16 28.0µs ± 2% 27.6µs ± 1% ~ (p=0.063 n=5+5) Phases/tpcc-stock-level/OptBuild-16 52.8µs ± 1% 52.9µs ± 1% ~ (p=1.000 n=5+5) Phases/tpcc-stock-level/Normalize-16 53.7µs ± 1% 53.7µs ± 1% ~ (p=1.000 n=5+5) Phases/tpcc-stock-level/ExecBuild-16 160µs ± 1% 161µs ± 1% ~ (p=0.222 n=5+5) Phases/tpcc-new-order/Explore-16 36.2µs ± 0% 36.7µs ± 1% +1.47% (p=0.016 n=4+5) Phases/kv-read-no-prep/Parse-16 12.6µs ± 1% 12.8µs ± 1% +1.58% (p=0.016 n=4+5) Phases/tpcc-new-order/Parse-16 2.58ns ± 1% 2.67ns ± 7% +3.33% (p=0.048 n=5+5) Phases/tpcc-stock-level/Parse-16 2.52ns ± 1% 2.63ns ± 4% +4.29% (p=0.016 n=5+5) Phases/tpcc-new-order/OptBuild-16 13.0µs ± 1% 14.7µs ± 2% +12.50% (p=0.008 n=5+5) Release note: None #### opt: embed JoinOrderBuilder in Optimizer This commit embeds the JoinOrderBuilder in the Optimizer struct. This avoids allocating a new JoinOrderBuilder on the heap every time `Optimizer.Init` is called. BenchmarkPhase microbenchmarks comparing the current master branch to this commit: name old time/op new time/op delta Phases/kv-read-const/Normalize-16 171ns ± 1% 77ns ± 1% -55.11% (p=0.008 n=5+5) Phases/kv-read-const/OptBuild-16 175ns ± 1% 79ns ± 2% -54.83% (p=0.008 n=5+5) Phases/kv-read-const/Explore-16 189ns ± 7% 89ns ± 0% -53.14% (p=0.008 n=5+5) Phases/tpcc-new-order/OptBuild-16 16.6µs ± 6% 12.7µs ± 2% -23.47% (p=0.008 n=5+5) Phases/tpcc-new-order/Normalize-16 14.7µs ± 8% 12.4µs ± 1% -15.39% (p=0.008 n=5+5) Phases/kv-read-const/ExecBuild-16 687ns ± 4% 592ns ± 1% -13.89% (p=0.008 n=5+5) Phases/tpcc-new-order/Parse-16 2.94ns ± 8% 2.58ns ± 2% -12.23% (p=0.008 n=5+5) Phases/tpcc-delivery/Parse-16 2.86ns ± 8% 2.54ns ± 1% -11.33% (p=0.008 n=5+5) Phases/tpcc-new-order/ExecBuild-16 43.0µs ±10% 38.6µs ± 0% -10.21% (p=0.016 n=5+4) Phases/tpcc-new-order/Explore-16 39.0µs ± 3% 36.6µs ± 0% -6.12% (p=0.008 n=5+5) Phases/tpcc-delivery/OptBuild-16 12.4µs ± 1% 12.0µs ± 0% -3.49% (p=0.008 n=5+5) Phases/kv-read/ExecBuild-16 13.4µs ± 0% 12.9µs ± 0% -3.37% (p=0.008 n=5+5) Phases/kv-read/Explore-16 12.1µs ± 1% 11.7µs ± 2% -3.04% (p=0.008 n=5+5) Phases/tpcc-delivery/Normalize-16 12.8µs ± 1% 12.4µs ± 1% -3.00% (p=0.008 n=5+5) Phases/kv-read-no-prep/ExecBuild-16 44.1µs ± 2% 43.2µs ± 1% -2.16% (p=0.008 n=5+5) Phases/kv-read/Normalize-16 6.48µs ± 1% 6.36µs ± 1% -1.73% (p=0.008 n=5+5) Phases/kv-read-no-prep/Explore-16 42.6µs ± 1% 41.9µs ± 1% -1.55% (p=0.032 n=5+5) Phases/tpcc-delivery/ExecBuild-16 29.5µs ± 1% 29.1µs ± 0% -1.41% (p=0.008 n=5+5) Phases/kv-read-no-prep/OptBuild-16 32.2µs ± 0% 31.7µs ± 1% -1.34% (p=0.008 n=5+5) Phases/kv-read-no-prep/Normalize-16 34.8µs ± 1% 34.4µs ± 1% -1.25% (p=0.016 n=5+4) Phases/tpcc-stock-level/OptBuild-16 53.1µs ± 0% 52.7µs ± 0% -0.83% (p=0.016 n=5+5) Phases/kv-read/Parse-16 2.49ns ± 0% 2.49ns ± 1% ~ (p=0.952 n=4+5) Phases/kv-read/OptBuild-16 6.48µs ± 0% 6.44µs ± 1% ~ (p=0.310 n=5+5) Phases/kv-read-no-prep/Parse-16 12.7µs ± 2% 12.7µs ± 1% ~ (p=0.508 n=5+5) Phases/kv-read-const/Parse-16 2.59ns ± 1% 2.59ns ± 1% ~ (p=0.857 n=5+5) Phases/tpcc-delivery/Explore-16 27.9µs ± 0% 27.9µs ± 2% ~ (p=0.310 n=5+5) Phases/tpcc-stock-level/Parse-16 2.57ns ± 2% 2.54ns ± 3% ~ (p=0.254 n=5+5) Phases/tpcc-stock-level/Normalize-16 53.8µs ± 0% 53.5µs ± 2% ~ (p=0.151 n=5+5) Phases/tpcc-stock-level/Explore-16 153µs ± 0% 152µs ± 3% ~ (p=0.151 n=5+5) Phases/tpcc-stock-level/ExecBuild-16 162µs ± 1% 161µs ± 1% ~ (p=0.095 n=5+5) Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

GRPC uses the HTTPS_PROXY environment variable by default[1]. This is surprising, and likely undesirable for CRDB because it turns the proxy into an availability risk and a throughput bottleneck. We disable the use of proxies by default when retrieving grpc dial options. This diff came up in the context of cockroachdb#55289, and was a recent regression in 20.2 after having introduced the join rpc in cockroachdb#52526 (it's the rpc responsible for adding new nodes to the cluster cockroachdb#52526). Our existing RPC connections use the `WithContextDialer` option already, which has the side effect of disabling proxy use. The join RPC didn't, so an existing cluster upgrade to 20.2 wouldn't use the proxy until a node was added. Fixes cockroachdb#55289. This will need to get backported to release-20.2. [1]: https://github.com/grpc/grpc-go/blob/c0736608/Documentation/proxy.md Release note: Previously we used the HTTPS_PROXY variable for the "join RPC" when adding a node to the cluster (the RPC prevents new clusters from starting or adding nodes to an existing cluster). The proxy needed to configured to transparently pass HTTP/2+GRPC inter-node traffic. This was an unintentional addition, and this patch ignores the proxies for all intra-node traffic. They were already ignored in releases prior to 20.2.

GRPC uses the HTTPS_PROXY environment variable by default[1]. This is surprising, and likely undesirable for CRDB because it turns the proxy into an availability risk and a throughput bottleneck. We disable the use of proxies by default when retrieving grpc dial options. This diff came up in the context of cockroachdb#55289, and was a recent regression in 20.2 after having introduced the join rpc in cockroachdb#52526 (it's the rpc responsible for adding new nodes to the cluster cockroachdb#52526). Our existing RPC connections use the `WithContextDialer` option already, which has the side effect of disabling proxy use. The join RPC didn't, so an existing cluster upgrade to 20.2 wouldn't use the proxy until a node was added. Fixes cockroachdb#55289. This will need to get backported to release-20.2. [1]: https://github.com/grpc/grpc-go/blob/c0736608/Documentation/proxy.md Release note (bug fix): Previously we used the HTTPS_PROXY variable for the "join RPC" when adding a node to the cluster (the RPC prevents new clusters from starting or adding nodes to an existing cluster). The proxy needed to configured to transparently pass HTTP/2+GRPC inter-node traffic. This was an unintentional addition, and this patch ignores the proxies for all intra-node traffic. They were already ignored in releases prior to 20.2.

55502: rpc: disable use of http proxies by default r=irfansharif a=irfansharif GRPC uses the HTTPS_PROXY environment variable by default[1]. This is surprising, and likely undesirable for CRDB because it turns the proxy into an availability risk and a throughput bottleneck. We disable the use of proxies by default when retrieving grpc dial options. This diff came up in the context of #55289, and was a recent regression in 20.2 after having introduced the join rpc in #52526 (it's the rpc responsible for adding new nodes to the cluster #52526). Our existing RPC connections use the `WithContextDialer` option already, which has the side effect of disabling proxy use. The join RPC didn't, so an existing cluster upgrade to 20.2 wouldn't use the proxy until a node was added. Fixes #55289. This will need to get backported to release-20.2. [1]: https://github.com/grpc/grpc-go/blob/c0736608/Documentation/proxy.md Release note: Previously we used the HTTPS_PROXY variable for the "join RPC" when adding a node to the cluster (the RPC prevents new clusters from starting or adding nodes to an existing cluster). The proxy needed to configured to transparently pass HTTP/2+GRPC inter-node traffic. This was an unintentional addition, and this patch ignores the proxies for all intra-node traffic. They were already ignored in releases prior to 20.2. Co-authored-by: irfan sharif <[email protected]>

In 20.2, with cockroachdb#52526, we introduced the join RPC for new nodes in the system to be handed a pre-allocated node ID, a pre-allocated store ID, and be informed by the existing cluster what the cluster ID was. In 20.1 and earlier this functionality was provided by our use of gossip, and by constructing hollowed out servers (without node/store IDs) and allocating an ID at the joining node. This required an awkward dance around needing KV to be up in order to allocate the right IDs, but not having an ID for the server process itself. We retained the deprecated gossip codepaths in 20.2 in order to maintain compatibility with 20.1. Now that 20.2 is cut however, in 21.1 code we can always assume that we're talking to nodes that are at least 20.2, and therefore nodes that are able to use the join RPC correctly. This lets us strip out our usage of gossip for ID allocation, which in turn paves the way for a few other simplifications. We also delete the mixed-version/join-init roachtest, as it was only ever relevant for the 20.1/20.2 cycle. The current structure around joining nodes being allocated/informed of the right IDs has adequate coverage with TestClusterConnectivity. Release note: None

56263: server: rm gossip usage from node/store/cluster ID allocation r=irfansharif a=irfansharif In 20.2, with #52526, we introduced the join RPC for new nodes in the system to be handed a pre-allocated node ID, a pre-allocated store ID, and be informed by the existing cluster what the cluster ID was. In 20.1 and earlier this functionality was provided by our use of gossip, and by constructing hollowed out servers (without node/store IDs) and allocating an ID at the joining node. This required an awkward dance around needing KV to be up in order to allocate the right IDs, but not having an ID for the server process itself. We retained the deprecated gossip codepaths in 20.2 in order to maintain compatibility with 20.1. Now that 20.2 is cut however, in 21.1 code we can always assume that we're talking to nodes that are at least 20.2, and therefore nodes that are able to use the join RPC correctly. This lets us strip out our usage of gossip for ID allocation, which in turn paves the way for a few other simplifications. We also delete the mixed-version/join-init roachtest, as it was only ever relevant for the 20.1/20.2 cycle. The current structure around joining nodes being allocated/informed of the right IDs has adequate coverage with TestClusterConnectivity. Release note: None Co-authored-by: irfan sharif <[email protected]>

irfansharif requested a review from a team as a code owner August 8, 2020 00:20

irfansharif force-pushed the 200727.join-rpc branch 2 times, most recently from f829883 to 832e30a Compare August 8, 2020 01:08

irfansharif removed the request for review from a team August 8, 2020 01:22

irfansharif force-pushed the 200727.join-rpc branch from 832e30a to 0d193a7 Compare August 8, 2020 01:59

irfansharif marked this pull request as draft August 8, 2020 02:01

irfansharif marked this pull request as ready for review August 8, 2020 02:05

irfansharif force-pushed the 200727.join-rpc branch 5 times, most recently from c3c3459 to 7410cb1 Compare August 11, 2020 04:29

irfansharif changed the title ~~[wip,dnm] *: connect rpc~~ [wip] server: introduce join rpc for node id allocation Aug 11, 2020

irfansharif requested a review from tbg August 11, 2020 04:30

irfansharif requested a review from knz August 12, 2020 15:44

knz reviewed Aug 12, 2020

View reviewed changes

tbg requested changes Aug 13, 2020

View reviewed changes

irfansharif force-pushed the 200727.join-rpc branch 4 times, most recently from 3c5fff3 to 22365db Compare August 20, 2020 05:22

irfansharif commented Aug 20, 2020

View reviewed changes

tbg requested a review from knz August 20, 2020 11:29

tbg reviewed Aug 20, 2020

View reviewed changes

knz reviewed Aug 20, 2020

View reviewed changes

irfansharif commented Aug 20, 2020

View reviewed changes

tbg requested a review from knz August 20, 2020 15:17

tbg reviewed Aug 20, 2020

View reviewed changes

irfansharif mentioned this pull request Sep 8, 2020

server: introduce Test{ClusterConnectivity,JoinVersionGate} #54072

Merged

irfansharif mentioned this pull request Sep 10, 2020

server: always create a liveness record before starting up #53842

Merged

irfansharif mentioned this pull request Sep 30, 2020

rpc: don't enable internalClientAdapter until rpcServer is operational #39497

Closed

irfansharif mentioned this pull request Oct 13, 2020

rpc: disable use of http proxies by default #55502

Merged

irfansharif mentioned this pull request Oct 13, 2020

release-20.2: rpc: disable use of http proxies by default #55504

Merged

irfansharif mentioned this pull request Oct 29, 2020

*: introduce pkg/migrations #56107

Closed

irfansharif mentioned this pull request Nov 3, 2020

server: rm gossip usage from node/store/cluster ID allocation #56263

Merged

knz mentioned this pull request Apr 19, 2021

server: integrate the TLS auto-negotiation in the start commands #63850

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: introduce join rpc for node id allocation #52526

server: introduce join rpc for node id allocation #52526

irfansharif commented Aug 8, 2020 •

edited

Loading

cockroach-teamcity commented Aug 8, 2020

irfansharif commented Aug 11, 2020

knz left a comment

tbg left a comment

irfansharif left a comment

tbg left a comment

knz left a comment

irfansharif left a comment

tbg left a comment

	// dialAddrs attempts to dial each of the configured addresses in a retry loop.
	// The method will only return a non-nil error on context cancellation.
	func (c *Connector) dialAddrs(ctx context.Context) error {
	for r := retry.StartWithCtx(ctx, c.rpcRetryOptions); r.Next(); {
	// Try each address on each retry iteration.
	randStart := rand.Intn(len(c.addrs))
	for i := range c.addrs {
	addr := c.addrs[(i+randStart)%len(c.addrs)]
	conn, err := c.dialAddr(ctx, addr)
	if err != nil {
	log.Warningf(ctx, "error dialing tenant KV address %s: %v", addr, err)
	continue
	}
	client := roachpb.NewInternalClient(conn)
	c.mu.Lock()
	c.mu.client = client
	c.mu.Unlock()
	return nil
	}
	}
	return ctx.Err()
	}

server: introduce join rpc for node id allocation #52526

server: introduce join rpc for node id allocation #52526

Conversation

irfansharif commented Aug 8, 2020 • edited Loading

cockroach-teamcity commented Aug 8, 2020

irfansharif commented Aug 11, 2020

knz left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

irfansharif commented Aug 8, 2020 •

edited

Loading