-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: allocate NodeID via one-off RPC to initialized node #32574
Comments
Join-time NodeID/ClusterID RPCSwitch to a dedicated mechanism for joining an existing cluster that does not require a running Gossip network and Node. ProblemNodes joining a cluster for the first time currently perform an awkward dance: to allocate a nodeID, they set up a more or less fully functional node so that they can run a KV Lines 23 to 32 in c097a16
which complicates matters in all components in which the NodeID plays a role (that's many of them). A similar problem occurs for the ClusterID, and there's a corresponding container: cockroach/pkg/base/cluster_id.go Lines 21 to 33 in c097a16
The ClusterID is primarily distributed via Gossip, and nodes asking to join a cluster must thus connect to Gossip, before they're really ready to do so. Start-up sequence todayWe shouldn't have to set up an incomplete server and deal with updating it in-place. Instead, a node needing to bootstrap via joining should, at a high level, ask an existing node for the ClusterID and a freshly assigned NodeID before really entering its start sequence. The sequence to join a cluster today takes roughly the following form First, the node must find that no NodeID/ClusterID has been persisted yet cockroach/pkg/server/server.go Lines 1308 to 1315 in 6cb71bf
and that the gossip resolver slice is not empty cockroach/pkg/server/server.go Lines 1347 to 1353 in 6cb71bf
Now we accept incoming requests on the init server (but in this case there aren't any, since we look at joining an existing cluster). In today's code, Gossip is already started at this point (even though it would like to know a NodeID, tough luck). Two things can happen: either we get an explicit request to bootstrap a new cluster, in which case that is done. Otherwise, Gossip manages to connect, this means that some other node was bootstrapped, and it tells us the cluster ID and we continue. Next, the node is started. This is awkward: we're starting a node without a NodeID because that gives us a KV stack that we can then use to allocate a NodeID: Lines 358 to 375 in 6cb71bf
This is the step that will need more work. We want this code to go away. Start-up sequence (proposed)If we are waiting for init/join, roughly here we want to
The message definitions for this could look like this: message JoinNodeRequest {
// NB: a node that joins has the lowest version active that it supports,
// i.e. its MinSupportedVersion. We could just name it that, too?
// Recipient will refuse this request if it's not compatible with active_version.
roachpb.Version active_version = 1;
}
message JoinNodeResponse {
int32 node_id = 1 [(gogoproto.customname) = "NodeID"];
// A first StoreID allocated for the new node. The Node can in principle allocate
// it own StoreIDs once it has a NodeID, but it doesn't have a good way to persist
// a NodeID until it also has an initialized Store. Handing out a one-off StoreID here
// is a trivial way to avoid additional complexity related to that.
int32 store_id = 2 [(gogoproto.customname) = "StoreID"];
bytes cluster_id = 3 [(gogoproto.customname) = "ClusterID"];
}
We still haven't removed the early Gossip server. I believe we can start it only after the ClusterID/NodeID are known from a successful JoinNode request simply because it isn't required for bootstrapping any more. And we should be able to bypass Potential issue: (more) loss of bidirectionalityConsider the following setup, where an edge from i to j means that node i has node j in its join flags.
Assume all nodes are initially uninitialized (i.e. waiting for bootstrap). The cluster will only come together successfully if the node that receives the BootstrapRequest is reachable (transitively) from all other nodes. For example:
In today's code, bootstrapping 3 would actually "push" the clusterID to 1 via Gossip and then 1 would initialize, and then everyone else. This is enough of a limitation to do something about, for example add a ShouldJoinRequest which, when received, simply triggers a |
\## Motivation Today, starting up a server is complicated. This is especially true when bootstrap is necessary. By this, we mean that either - no NodeID is known. This implies that none of the engines were initialized yet - the NodeID is known (i.e. at least one store is initialized), but a new engine was added. When the process first starts, and a NodeID is known, a ClusterID is also known (since they get persisted together, on every store). For the same reason, a persisted cluster version is known. In this case, we can start a fully initialized cluster, so this is the easy case. It is more difficult when no NodeID is known, in which case the server is just starting up for the first time, with its engines all blank. It needs to somehow allocate NodeIDs (and StoreIDs) which are then written to the engines. It also needs to, at least initially, use the lowest possible cluster version it can tolerate (after all, the cluster may actually run at that lowest version). Right now, there is a delicate dance: we thread late-bound ClusterID and NodeID containers all over the place, spin up a mostly dysfunctional Node, use its KV client to allocate NodeID and StoreIDs once this is possible - we need Gossip to have connected first - and update the containers. It is complex, error prone, and ossifies any code it touches. Cluster versions deserve an extra shout-out for complexity. Even if a cluster version was persisted, the node may have been decommissioned and the cluster upgraded. Much work went into our RPC layer to prevent connections between incompatible nodes, but there is no boot-time check that results in a swift and descriptive error - there will be a fatal error, originating from the RPC subsystem. One aim of our work is to simplify this by checking the version via an RPC before setting too many gears in motion. \## Context This marks the beginning of a series of PRs aiming at improving the startup code. Ultimately, the goal is to have the Node and all Stores bootstrapped as early as possible, but in particular before starting the KV or SQL server subsystems. Furthermore, we want to achieve this without relying on Gossip, to prepare for a split between the SQL and KV servers in the context of multitenancy support (SQL server won't be able to rely on Gossip, but will still need to announce itself to the KV servers). \## This PR This PR is an initial simplifying refactor that can help achieve these goals. The init server (which hosts the Bootstrap RPC) is given more responsibilities: it is now directly in charge of determining which, if any, engines are bootstrapped, and explicitly listens to Gossip as well as the Bootstrap RPC. It returns the cluster ID to the main server start-up goroutine when it is available. As a result, the main startup code has simplified, and a thread to be pulled on further has appeared and is called out in TODOs. Down the road (i.e. in later PRs), the init server will bootstrap a NodeID and StoreIDs even when joining an existing cluster. It will initially mimic/front-load the strategy taken by Node today, i.e. use a `kv.DB`, but ultimately will bypass Gossip completely and use a simple RPC call to ask the existing cluster to assign these IDs as needed. This RPC will also establish the active cluster version, which is required for SQL multi-tenancy, and generally follows the ideas in cockroachdb#32574. Release note: None
\## Motivation Today, starting up a server is complicated. This is especially true when bootstrap is necessary. By this, we mean that either - no NodeID is known. This implies that none of the engines were initialized yet - the NodeID is known (i.e. at least one store is initialized), but a new engine was added. When the process first starts, and a NodeID is known, a ClusterID is also known (since they get persisted together, on every store). For the same reason, a persisted cluster version is known. In this case, we can start a fully initialized cluster, so this is the easy case. It is more difficult when no NodeID is known, in which case the server is just starting up for the first time, with its engines all blank. It needs to somehow allocate NodeIDs (and StoreIDs) which are then written to the engines. It also needs to, at least initially, use the lowest possible cluster version it can tolerate (after all, the cluster may actually run at that lowest version). Right now, there is a delicate dance: we thread late-bound ClusterID and NodeID containers all over the place, spin up a mostly dysfunctional Node, use its KV client to allocate NodeID and StoreIDs once this is possible - we need Gossip to have connected first - and update the containers. It is complex, error prone, and ossifies any code it touches. Cluster versions deserve an extra shout-out for complexity. Even if a cluster version was persisted, the node may have been decommissioned and the cluster upgraded. Much work went into our RPC layer to prevent connections between incompatible nodes, but there is no boot-time check that results in a swift and descriptive error - there will be a fatal error, originating from the RPC subsystem. One aim of our work is to simplify this by checking the version via an RPC before setting too many gears in motion. \## Context This marks the beginning of a series of PRs aiming at improving the startup code. Ultimately, the goal is to have the Node and all Stores bootstrapped as early as possible, but in particular before starting the KV or SQL server subsystems. Furthermore, we want to achieve this without relying on Gossip, to prepare for a split between the SQL and KV servers in the context of multitenancy support (SQL server won't be able to rely on Gossip, but will still need to announce itself to the KV servers). \## This PR This PR is an initial simplifying refactor that can help achieve these goals. The init server (which hosts the Bootstrap RPC) is given more responsibilities: it is now directly in charge of determining which, if any, engines are bootstrapped, and explicitly listens to Gossip as well as the Bootstrap RPC. It returns the cluster ID to the main server start-up goroutine when it is available. As a result, the main startup code has simplified, and a thread to be pulled on further has appeared and is called out in TODOs. Down the road (i.e. in later PRs), the init server will bootstrap a NodeID and StoreIDs even when joining an existing cluster. It will initially mimic/front-load the strategy taken by Node today, i.e. use a `kv.DB`, but ultimately will bypass Gossip completely and use a simple RPC call to ask the existing cluster to assign these IDs as needed. This RPC will also establish the active cluster version, which is required for SQL multi-tenancy, and generally follows the ideas in cockroachdb#32574. Release note: None
46843: server: simplify bootstrap via fatter init server r=andreimatei a=tbg ## Motivation Today, starting up a server is complicated. This is especially true when bootstrap is necessary. By this, we mean that either - no NodeID is known. This implies that none of the engines were initialized yet - the NodeID is known (i.e. at least one store is initialized), but a new engine was added. When the process first starts, and a NodeID is known, a ClusterID is also known (since they get persisted together, on every store). For the same reason, a persisted cluster version is known. In this case, we can start a fully initialized cluster, so this is the easy case. It is more difficult when no NodeID is known, in which case the server is just starting up for the first time, with its engines all blank. It needs to somehow allocate NodeIDs (and StoreIDs) which are then written to the engines. It also needs to, at least initially, use the lowest possible cluster version it can tolerate (after all, the cluster may actually run at that lowest version). Right now, there is a delicate dance: we thread late-bound ClusterID and NodeID containers all over the place, spin up a mostly dysfunctional Node, use its KV client to allocate NodeID and StoreIDs once this is possible - we need Gossip to have connected first - and update the containers. It is complex, error prone, and ossifies any code it touches. Cluster versions deserve an extra shout-out for complexity. Even if a cluster version was persisted, the node may have been decommissioned and the cluster upgraded. Much work went into our RPC layer to prevent connections between incompatible nodes, but there is no boot-time check that results in a swift and descriptive error - there will be a fatal error, originating from the RPC subsystem. One aim of our work is to simplify this by checking the version via an RPC before setting too many gears in motion. ## Context This marks the beginning of a series of PRs aiming at improving the startup code. Ultimately, the goal is to have the Node and all Stores bootstrapped as early as possible, but in particular before starting the KV or SQL server subsystems. Furthermore, we want to achieve this without relying on Gossip, to prepare for a split between the SQL and KV servers in the context of multitenancy support (SQL server won't be able to rely on Gossip, but will still need to announce itself to the KV servers). ## This PR This PR is an initial simplifying refactor that can help achieve these goals. The init server (which hosts the Bootstrap RPC) is given more responsibilities: it is now directly in charge of determining which, if any, engines are bootstrapped, and explicitly listens to Gossip as well as the Bootstrap RPC. It returns the cluster ID to the main server start-up goroutine when it is available. As a result, the main startup code has simplified, and a thread to be pulled on further has appeared and is called out in TODOs. Down the road (i.e. in later PRs), the init server will bootstrap a NodeID and StoreIDs even when joining an existing cluster. It will initially mimic/front-load the strategy taken by Node today, i.e. use a `kv.DB`, but ultimately will bypass Gossip completely and use a simple RPC call to ask the existing cluster to assign these IDs as needed. This RPC will also establish the active cluster version, which is required for SQL multi-tenancy, and generally follows the ideas in #32574. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
\## Motivation Today, starting up a server is complicated. This is especially true when bootstrap is necessary. By this, we mean that either - no NodeID is known. This implies that none of the engines were initialized yet - the NodeID is known (i.e. at least one store is initialized), but a new engine was added. When the process first starts, and a NodeID is known, a ClusterID is also known (since they get persisted together, on every store). For the same reason, a persisted cluster version is known. In this case, we can start a fully initialized cluster, so this is the easy case. It is more difficult when no NodeID is known, in which case the server is just starting up for the first time, with its engines all blank. It needs to somehow allocate NodeIDs (and StoreIDs) which are then written to the engines. It also needs to, at least initially, use the lowest possible cluster version it can tolerate (after all, the cluster may actually run at that lowest version). Right now, there is a delicate dance: we thread late-bound ClusterID and NodeID containers all over the place, spin up a mostly dysfunctional Node, use its KV client to allocate NodeID and StoreIDs once this is possible - we need Gossip to have connected first - and update the containers. It is complex, error prone, and ossifies any code it touches. Cluster versions deserve an extra shout-out for complexity. Even if a cluster version was persisted, the node may have been decommissioned and the cluster upgraded. Much work went into our RPC layer to prevent connections between incompatible nodes, but there is no boot-time check that results in a swift and descriptive error - there will be a fatal error, originating from the RPC subsystem. One aim of our work is to simplify this by checking the version via an RPC before setting too many gears in motion. \## Context This marks the beginning of a series of PRs aiming at improving the startup code. Ultimately, the goal is to have the Node and all Stores bootstrapped as early as possible, but in particular before starting the KV or SQL server subsystems. Furthermore, we want to achieve this without relying on Gossip, to prepare for a split between the SQL and KV servers in the context of multitenancy support (SQL server won't be able to rely on Gossip, but will still need to announce itself to the KV servers). \## This PR This PR is an initial simplifying refactor that can help achieve these goals. The init server (which hosts the Bootstrap RPC) is given more responsibilities: it is now directly in charge of determining which, if any, engines are bootstrapped, and explicitly listens to Gossip as well as the Bootstrap RPC. It returns the cluster ID to the main server start-up goroutine when it is available. As a result, the main startup code has simplified, and a thread to be pulled on further has appeared and is called out in TODOs. Down the road (i.e. in later PRs), the init server will bootstrap a NodeID and StoreIDs even when joining an existing cluster. It will initially mimic/front-load the strategy taken by Node today, i.e. use a `kv.DB`, but ultimately will bypass Gossip completely and use a simple RPC call to ask the existing cluster to assign these IDs as needed. This RPC will also establish the active cluster version, which is required for SQL multi-tenancy, and generally follows the ideas in cockroachdb#32574. Release note: None
@irfansharif @tbg will this fix also help resolve this issue #39415? |
@irfansharif I believe it maybe because we are re-working how StoreIDs are being allocated and based on the findings here #39497 that seems to be the crux of this issue. |
This would help in some situations but not in others. Consider a three-node cluster that you shut down completely. Then you try to restart n1 with an extra store. The Connect RPC can't allocate the extra StoreID since there is no KV layer available. We need to proceed with startup (leaving the extra store alone for now) and initialize the store ID later, when the other two nodes are also up. I wrote a proposal on the other thread on how this could be done: |
This mostly follows the ideas in cockroachdb#32574, and serves as a crucial building block for cockroachdb#48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node id. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster - we can prevent decommissioned nodes from joining the cluster - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. It wasn't possible to use cluster versions for this migrations because this happens very early in the node start up process, and the version gating this change will not be active until much later in the crdb process lifetime. --- There are some leftover TODOs that I'm looking to address in this PR. They should be tiny, and be easy to retro-fit into what we have so far. Specifically I'm going to plumb the client address into the RPC so the server is able to generate backlinks (and solve the bidirectionality problem). I'm also going to try and add the liveness record for a joining node as part of the join rpc. Right now the tests verifying connectivity/bootstrap/join flags pass out of the box, but I'm going to try adding more randomized testing here to test full connectivity once I address these TODOs. Release note: None
This mostly follows the ideas in cockroachdb#32574, and serves as a crucial building block for cockroachdb#48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node id. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster - we can prevent decommissioned nodes from joining the cluster - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. It wasn't possible to use cluster versions for this migrations because this happens very early in the node start up process, and the version gating this change will not be active until much later in the crdb process lifetime. --- There are some leftover TODOs that I'm looking to address in this PR. They should be tiny, and be easy to retro-fit into what we have so far. Specifically I'm going to plumb the client address into the RPC so the server is able to generate backlinks (and solve the bidirectionality problem). I'm also going to try and add the liveness record for a joining node as part of the join rpc. Right now the tests verifying connectivity/bootstrap/join flags pass out of the box, but I'm going to try adding more randomized testing here to test full connectivity once I address these TODOs. Release note: None
This mostly follows the ideas in cockroachdb#32574, and serves as a crucial building block for cockroachdb#48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node ID. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster which (this commit) - we can allocate the first store ID for a given node, which is a nice code simplification (this commit) - we can prevent decommissioned nodes from joining the cluster (future PR) - we can eliminate another usage of gossip where we previously used it to disseminate the cluster ID. In the 21.1 cycle we can defer gossip start until much later in the server start lifecycle (future PR) - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records (future PR) The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. Release note: None
This mostly follows the ideas in cockroachdb#32574, and serves as a crucial building block for cockroachdb#48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node ID. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster which (this commit) - we can allocate the first store ID for a given node, which is a nice code simplification (this commit) - we can prevent decommissioned nodes from joining the cluster (future PR) - we can eliminate another usage of gossip where we previously used it to disseminate the cluster ID. In the 21.1 cycle we can defer gossip start until much later in the server start lifecycle (future PR) - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records (future PR) The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. Release note: None
This mostly follows the ideas in cockroachdb#32574, and serves as a crucial building block for cockroachdb#48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node ID. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster which (this commit) - we can allocate the first store ID for a given node, which is a nice code simplification (this commit) - we can prevent decommissioned nodes from joining the cluster (future PR) - we can eliminate another usage of gossip where we previously used it to disseminate the cluster ID. In the 21.1 cycle we can defer gossip start until much later in the server start lifecycle (future PR) - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records (future PR) The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. Release justification: low risk, high benefit changes to existing functionality Release note: None
52526: server: introduce join rpc for node id allocation r=irfansharif a=irfansharif This mostly follows the ideas in #32574, and serves as a crucial building block for #48843. Specifically this PR introduces a new Join RPC that new nodes can use, addressing already initialized nodes, to learn about the cluster ID and its node id. Previously joining nodes were responsible for allocating their own IDs and used to discover the cluster ID. By moving towards a more understandable flow of how nodes joins the cluster, we can build a few useful primitives on top of this: - we can prevent mismatched version nodes from joining the cluster - we can prevent decommissioned nodes from joining the cluster - we can add the liveness record for a given node as soon as it joins, which would simplify our liveness record handling code that is perennially concerned with missing liveness records The tiny bit of complexity in this PR comes from how we're able to migrate into this behavior from the old. To that end we retain the earlier gossip-based cluster ID discovery+node ID allocation for self behavior. Nodes with this patch will attempt to use this join RPC, if implemented on the addressed node, and fall back to using the previous behavior if not. It wasn't possible to use cluster versions for this migrations because this happens very early in the node start up process, and the version gating this change will not be active until much later in the crdb process lifetime. --- There are some leftover TODOs that I'm looking to address in this PR. They should be tiny, and be easy to retro-fit into what we have so far. Specifically I'm going to plumb the client address into the RPC so the server is able to generate backlinks (and solve the bidirectionality problem). I'm also going to try and add the liveness record for a joining node as part of the join rpc. Right now the tests verifying connectivity/bootstrap/join flags pass out of the box, but I'm going to try adding more randomized testing here to test full connectivity once I address these TODOs. Release justification: Low risk, high benefit changes to existing functionality Release note: None Co-authored-by: irfan sharif <[email protected]>
@irfansharif this is done, right? |
Our handling of the NodeID is currently quite involved (and leaks into many places) because we pretty much require a working node to be set up before it can use its own KV store to request a NodeID.
There are two levels of improvement here. The first is that once Gossip is connected, and the Node is uninitialized, we use a one-off RPC
AllocateNodeID
which simply asks a remote node of our choosing (taken from Gossip) to allocate a NodeID for us. This shifts the complexity of using the KV store to a node that has it in perfect working condition. (Similar improvements apply to StoreID allocation, which we may be able to hoist out of the depths of(*Node).Start
where it is buried right now).This removes uninitialized NodeIDs from everywhere except Gossip. Removing it from Gossip is more involved, depending on what guarantees we want to offer. The basic idea is to read the Gossip bootstrap list directly to send
AllocateNodeID
, and not starting Gossip before that has succeeded. However, consider the case in which a node isn't able to reach out to other nodes (perhaps its join flags are stale) but other nodes are able to reach out to this node (this is a poor setup, but it could happen): we'd get stuck, unless there's at least a minimal subset of Gossip running that would add the peer's address to the list of addresses we pick from forAllocateNodeID
. This doesn't seem terrible to implement, but it's a little more involved.Related to #30553.
The text was updated successfully, but these errors were encountered: