server: simplify bootstrap via fatter init server

\## Motivation Today, starting up a server is complicated. This is especially true when bootstrap is necessary. By this, we mean that either - no NodeID is known. This implies that none of the engines were initialized yet - the NodeID is known (i.e. at least one store is initialized), but a new engine was added. When the process first starts, and a NodeID is known, a ClusterID is also known (since they get persisted together, on every store). For the same reason, a persisted cluster version is known. In this case, we can start a fully initialized cluster, so this is the easy case. It is more difficult when no NodeID is known, in which case the server is just starting up for the first time, with its engines all blank. It needs to somehow allocate NodeIDs (and StoreIDs) which are then written to the engines. It also needs to, at least initially, use the lowest possible cluster version it can tolerate (after all, the cluster may actually run at that lowest version). Right now, there is a delicate dance: we thread late-bound ClusterID and NodeID containers all over the place, spin up a mostly dysfunctional Node, use its KV client to allocate NodeID and StoreIDs once this is possible - we need Gossip to have connected first - and update the containers. It is complex, error prone, and ossifies any code it touches. Cluster versions deserve an extra shout-out for complexity. Even if a cluster version was persisted, the node may have been decommissioned and the cluster upgraded. Much work went into our RPC layer to prevent connections between incompatible nodes, but there is no boot-time check that results in a swift and descriptive error - there will be a fatal error, originating from the RPC subsystem. One aim of our work is to simplify this by checking the version via an RPC before setting too many gears in motion. \## Context This marks the beginning of a series of PRs aiming at improving the startup code. Ultimately, the goal is to have the Node and all Stores bootstrapped as early as possible, but in particular before starting the KV or SQL server subsystems. Furthermore, we want to achieve this without relying on Gossip, to prepare for a split between the SQL and KV servers in the context of multitenancy support (SQL server won't be able to rely on Gossip, but will still need to announce itself to the KV servers). \## This PR This PR is an initial simplifying refactor that can help achieve these goals. The init server (which hosts the Bootstrap RPC) is given more responsibilities: it is now directly in charge of determining which, if any, engines are bootstrapped, and explicitly listens to Gossip as well as the Bootstrap RPC. It returns the cluster ID to the main server start-up goroutine when it is available. As a result, the main startup code has simplified, and a thread to be pulled on further has appeared and is called out in TODOs. Down the road (i.e. in later PRs), the init server will bootstrap a NodeID and StoreIDs even when joining an existing cluster. It will initially mimic/front-load the strategy taken by Node today, i.e. use a `kv.DB`, but ultimately will bypass Gossip completely and use a simple RPC call to ask the existing cluster to assign these IDs as needed. This RPC will also establish the active cluster version, which is required for SQL multi-tenancy, and generally follows the ideas in cockroachdb#32574. Release note: None
tbg · Apr 1, 2020 · 634e427 · 634e427
1 parent 8acbab6
commit 634e427
Show file tree

Hide file tree

Showing 6 changed files with 471 additions and 287 deletions.
diff --git a/pkg/server/config.go b/pkg/server/config.go
@@ -261,7 +261,8 @@ type Config struct {
 	// The argument waitForInit indicates (iff true) that the
 	// server is not bootstrapped yet, will not bootstrap itself and
 	// will be waiting for an `init` command or accept bootstrapping
-	// from a joined node.
+	// from a joined node. It is set in an advisory fashion, that is,
+	// should be used for logging output only.
 	ReadyFn func(waitForInit bool)
 
 	// DelayedBootstrapFn is called if the boostrap process does not complete

diff --git a/pkg/server/init.go b/pkg/server/init.go
@@ -14,10 +14,23 @@ import (
 	"context"
 	"fmt"
 
+	"github.com/cockroachdb/cockroach/pkg/clusterversion"
+	"github.com/cockroachdb/cockroach/pkg/config/zonepb"
+	"github.com/cockroachdb/cockroach/pkg/gossip"
+	"github.com/cockroachdb/cockroach/pkg/roachpb"
 	"github.com/cockroachdb/cockroach/pkg/server/serverpb"
+	"github.com/cockroachdb/cockroach/pkg/storage"
+	"github.com/cockroachdb/cockroach/pkg/util/log"
+	"github.com/cockroachdb/cockroach/pkg/util/stop"
 	"github.com/cockroachdb/cockroach/pkg/util/syncutil"
+	"github.com/cockroachdb/cockroach/pkg/util/uuid"
+	"github.com/cockroachdb/errors"
 )
 
+// ErrClusterInitialized is reported when the Boostrap RPC is ran on
+// a node already part of an initialized cluster.
+var ErrClusterInitialized = fmt.Errorf("cluster has already been initialized")
+
 // initServer manages the temporary init server used during
 // bootstrapping.
 type initServer struct {
@@ -26,16 +39,153 @@ type initServer struct {
 		// If set, a Bootstrap() call is rejected with this error.
 		rejectErr error
 	}
-	bootstrapReqCh chan struct{}
-	connected      <-chan struct{}
-	shouldStop     <-chan struct{}
+	bootstrapBlockCh chan struct{} // blocks Bootstrap() until ServeAndWait() is invoked
+	bootstrapReqCh   chan *initState
+
+	engs []storage.Engine // late-bound in ServeAndWait
+
+	binaryVersion, binaryMinSupportedVersion       roachpb.Version
+	bootstrapVersion                               clusterversion.ClusterVersion
+	bootstrapZoneConfig, bootstrapSystemZoneConfig *zonepb.ZoneConfig
 }
 
-func newInitServer(connected <-chan struct{}, shouldStop <-chan struct{}) *initServer {
+func newInitServer(
+	binaryVersion, binaryMinSupportedVersion roachpb.Version,
+	bootstrapVersion clusterversion.ClusterVersion,
+	bootstrapZoneConfig, bootstrapSystemZoneConfig *zonepb.ZoneConfig,
+) *initServer {
 	return &initServer{
-		bootstrapReqCh: make(chan struct{}),
-		connected:      connected,
-		shouldStop:     shouldStop,
+		bootstrapReqCh:   make(chan *initState, 1),
+		bootstrapBlockCh: make(chan struct{}),
+
+		binaryVersion:             binaryVersion,
+		binaryMinSupportedVersion: binaryMinSupportedVersion,
+		bootstrapVersion:          bootstrapVersion,
+		bootstrapZoneConfig:       bootstrapZoneConfig,
+		bootstrapSystemZoneConfig: bootstrapSystemZoneConfig,
+	}
+}
+
+// initDiskState contains the part of initState that is read from stable
+// storage.
+//
+// TODO(tbg): the above is a lie in the case in which we join an existing
+// cluster. In that case, the state returned from ServeAndWait will have the
+// clusterID set from Gossip (and there will be no NodeID). The plan is to
+// allocate the IDs in ServeAndWait itself eventually, at which point the
+// lie disappears.
+type initDiskState struct {
+	// nodeID is zero if joining an existing cluster.
+	//
+	// TODO(tbg): see TODO above.
+	nodeID roachpb.NodeID
+	// All fields below are always set.
+	clusterID          uuid.UUID
+	clusterVersion     clusterversion.ClusterVersion
+	initializedEngines []storage.Engine
+	newEngines         []storage.Engine
+}
+
+// initState contains the cluster and node IDs as well as the stores, from which
+// a CockroachDB server can be started up after ServeAndWait returns.
+type initState struct {
+	initDiskState
+	// joined is true if this is a new node. Note that the initDiskState may
+	// reflect the result of bootstrapping a new cluster, i.e. it is not true
+	// that joined==true implies that the initDiskState shows no initialized
+	// engines.
+	//
+	// This flag should only be used for logging and reporting. A newly
+	// bootstrapped single-node cluster is functionally equivalent to one that
+	// restarted; any decisions should be made on persisted data instead of
+	// this flag.
+	//
+	// TODO(tbg): remove this bool. The Node can find out another way whether
+	// it just joined or restarted.
+	joined bool
+	// bootstrapped is true if a new cluster was initialized. If this is true,
+	// 'joined' above is also true. Usage of this field should follow that of
+	// 'joined' as well.
+	bootstrapped bool
+}
+
+// ServeAndWait sets up the initServer to accept Bootstrap requests (which will
+// block until then). It uses the provided engines and gossip to block until
+// either a new cluster was bootstrapped or Gossip connected to an existing
+// cluster.
+//
+// This method must be called only once.
+func (s *initServer) ServeAndWait(
+	ctx context.Context, stopper *stop.Stopper, engs []storage.Engine, g *gossip.Gossip,
+) (*initState, error) {
+	if s.engs != nil {
+		return nil, errors.New("cannot call ServeAndWait twice")
+	}
+
+	s.engs = engs // Bootstrap() is still blocked, so no data race here
+
+	inspectState, err := inspectEngines(ctx, engs, s.binaryVersion, s.binaryMinSupportedVersion)
+	if err != nil {
+		return nil, err
+	}
+
+	if len(inspectState.initializedEngines) != 0 {
+		// We have a NodeID/ClusterID, so don't allow bootstrap.
+		if err := s.testOrSetRejectErr(ErrClusterInitialized); err != nil {
+			return nil, errors.Wrap(err, "error unexpectedly set previously")
+		}
+		// If anyone mistakenly tried to bootstrap, unblock them so they can get
+		// the above error.
+		close(s.bootstrapBlockCh)
+
+		// In fact, it's crucial that we return early. This is because Gossip
+		// won't necessarily connect until a leaseholder for range 1 gossips the
+		// cluster ID, and all nodes in the cluster might be starting up right
+		// now. Without this return, we could have all nodes in the cluster
+		// deadlocked on g.Connected below. For similar reasons, we can't ever
+		// hope to initialize the newEngines below, for which we would need to
+		// increment a KV counter.
+		return &initState{
+			initDiskState: *inspectState,
+			joined:        false,
+			bootstrapped:  false,
+		}, nil
+	}
+
+	log.Info(ctx, "no stores bootstrapped and --join flag specified, awaiting init command or join with an already initialized node.")
+	close(s.bootstrapBlockCh)
+
+	select {
+	case <-stopper.ShouldQuiesce():
+		return nil, stop.ErrUnavailable
+	case state := <-s.bootstrapReqCh:
+		// Bootstrap() did its job.
+		log.Infof(ctx, "**** cluster %s has been created", state.clusterID)
+		return state, nil
+	case <-g.Connected:
+		// Gossip connected, that is, we know a ClusterID. Due to the early
+		// return above, we know that all of our engines are empty, i.e. we
+		// don't have a NodeID yet (and the cluster version is the minimum we
+		// support). Commence startup; the Node will realize it's short a NodeID
+		// and will request one.
+		//
+		// TODO(tbg): use a kv.DB to get NodeID and StoreIDs when necessary and
+		// set everything up here. This will take the Node out of that business
+		// entirely and means we'll need much fewer NodeID/ClusterIDContainers.
+		// (It's also so much simpler to think about). The RPC will also tell us
+		// a cluster version to use instead of the lowest possible one (reducing
+		// the short amount of time until the Gossip hook bumps the version);
+		// this doesn't fix anything but again, is simpler to think about.
+		clusterID, err := g.GetClusterID()
+		if err != nil {
+			return nil, err
+		}
+		inspectState.clusterID = clusterID
+		return &initState{
+			initDiskState: *inspectState,
+			joined:        true,
+			bootstrapped:  false,
+		}, nil
 	}
 }
 
@@ -52,54 +202,28 @@ func (s *initServer) testOrSetRejectErr(err error) error {
 	return nil
 }
 
-type initServerResult int
-
-const (
-	invalidInitResult initServerResult = iota
-	connectedToCluster
-	needBootstrap
-)
-
-// awaitBootstrap blocks until the connected channel is closed or a Bootstrap()
-// call is received. It returns true if a Bootstrap() call is received,
-// instructing the caller to perform cluster bootstrap. It returns false if the
-// connected channel is closed, telling the caller that someone else
-// bootstrapped the cluster. Assuming that the connected channel comes from
-// Gossip, this means that the cluster ID is now available in gossip.
-func (s *initServer) awaitBootstrap() (initServerResult, error) {
-	select {
-	case <-s.connected:
-		_ = s.testOrSetRejectErr(fmt.Errorf("already connected to cluster"))
-		return connectedToCluster, nil
-	case <-s.bootstrapReqCh:
-		return needBootstrap, nil
-	case <-s.shouldStop:
-		err := fmt.Errorf("stop called while waiting to bootstrap")
-		_ = s.testOrSetRejectErr(err)
-		return invalidInitResult, err
-	}
-}
-
-// Bootstrap unblocks an awaitBootstrap() call. If awaitBootstrap() hasn't been
-// called yet, it will not block the next time it's called.
+// Bootstrap implements the serverpb.Init service. Users set up a new
+// CockroachDB server by calling this endpoint on *exactly one node* in the
+// cluster (retrying only on that node).
+// Attempting to bootstrap a node that was already bootstrapped will result in
+// an error.
 //
-// TODO(andrei): There's a race between gossip connecting and this initServer
-// getting a Bootstrap request that allows both to succeed: there's no
-// synchronization between gossip and this server and so gossip can succeed in
-// propagating one cluster ID while this call succeeds in telling the Server to
-// bootstrap and created a new cluster ID. We should fix it somehow by tangling
-// the gossip.Server with this initServer such that they serialize access to a
-// clusterID and decide among themselves a single winner for the race.
+// NB: there is no protection against users erroneously bootstrapping multiple
+// nodes. In that case, they end up with more than one cluster, and nodes
+// panicking or refusing to connect to each other.
 func (s *initServer) Bootstrap(
 	ctx context.Context, request *serverpb.BootstrapRequest,
 ) (response *serverpb.BootstrapResponse, err error) {
+	<-s.bootstrapBlockCh // block until ServeAndWait() is active
+	// NB: this isn't necessary since bootstrapCluster would fail, but this is
+	// cleaner.
 	if err := s.testOrSetRejectErr(ErrClusterInitialized); err != nil {
 		return nil, err
 	}
-	close(s.bootstrapReqCh)
+	state, err := bootstrapCluster(ctx, s.engs, s.bootstrapVersion, s.bootstrapZoneConfig, s.bootstrapSystemZoneConfig)
+	if err != nil {
+		return nil, err
+	}
+	s.bootstrapReqCh <- state
 	return &serverpb.BootstrapResponse{}, nil
 }
-
-// ErrClusterInitialized is reported when the Boostrap RPC is ran on
-// a node already part of an initialized cluster.
-var ErrClusterInitialized = fmt.Errorf("cluster has already been initialized")