Skip to content

Commit

Permalink
server: simplify bootstrap via fatter init server
Browse files Browse the repository at this point in the history
\## Motivation

Today, starting up a server is complicated. This is especially true
when bootstrap is necessary. By this, we mean that either

- no NodeID is known. This implies that none of the engines were
  initialized yet
- the NodeID is known (i.e. at least one store is initialized), but a
  new engine was added.

When the process first starts, and a NodeID is known, a ClusterID is
also known (since they get persisted together, on every store). For the
same reason, a persisted cluster version is known. In this case, we can
start a fully initialized cluster, so this is the easy case.

It is more difficult when no NodeID is known, in which case the server
is just starting up for the first time, with its engines all blank. It
needs to somehow allocate NodeIDs (and StoreIDs) which are then written
to the engines. It also needs to, at least initially, use the lowest
possible cluster version it can tolerate (after all, the cluster may
actually run at that lowest version). Right now, there is a delicate
dance: we thread late-bound ClusterID and NodeID containers all over the
place, spin up a mostly dysfunctional Node, use its KV client to
allocate NodeID and StoreIDs once this is possible - we need Gossip
to have connected first - and update the containers. It is complex,
error prone, and ossifies any code it touches.

Cluster versions deserve an extra shout-out for complexity. Even if
a cluster version was persisted, the node may have been decommissioned
and the cluster upgraded. Much work went into our RPC layer to prevent
connections between incompatible nodes, but there is no boot-time check
that results in a swift and descriptive error - there will be a fatal
error, originating from the RPC subsystem. One aim of our work is to
simplify this by checking the version via an RPC before setting too
many gears in motion.

\## Context

This marks the beginning of a series of PRs aiming at improving the
startup code. Ultimately, the goal is to have the Node and all Stores
bootstrapped as early as possible, but in particular before starting
the KV or SQL server subsystems.

Furthermore, we want to achieve this without relying on Gossip, to
prepare for a split between the SQL and KV servers in the context of
multitenancy support (SQL server won't be able to rely on Gossip, but
will still need to announce itself to the KV servers).

\## This PR

This PR is an initial simplifying refactor that can help achieve these
goals. The init server (which hosts the Bootstrap RPC) is given more
responsibilities: it is now directly in charge of determining which, if
any, engines are bootstrapped, and explicitly listens to Gossip as well
as the Bootstrap RPC. It returns the cluster ID to the main server
start-up goroutine when it is available. As a result, the main startup
code has simplified, and a thread to be pulled on further has appeared
and is called out in TODOs.

Down the road (i.e. in later PRs), the init server will bootstrap a
NodeID and StoreIDs even when joining an existing cluster. It will
initially mimic/front-load the strategy taken by Node today, i.e. use a
`kv.DB`, but ultimately will bypass Gossip completely and use a simple
RPC call to ask the existing cluster to assign these IDs as needed. This
RPC will also establish the active cluster version, which is required
for SQL multi-tenancy, and generally follows the ideas in cockroachdb#32574.

Release note: None
  • Loading branch information
tbg committed Apr 1, 2020
1 parent 8acbab6 commit 634e427
Show file tree
Hide file tree
Showing 6 changed files with 471 additions and 287 deletions.
3 changes: 2 additions & 1 deletion pkg/server/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,8 @@ type Config struct {
// The argument waitForInit indicates (iff true) that the
// server is not bootstrapped yet, will not bootstrap itself and
// will be waiting for an `init` command or accept bootstrapping
// from a joined node.
// from a joined node. It is set in an advisory fashion, that is,
// should be used for logging output only.
ReadyFn func(waitForInit bool)

// DelayedBootstrapFn is called if the boostrap process does not complete
Expand Down
222 changes: 173 additions & 49 deletions pkg/server/init.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,23 @@ import (
"context"
"fmt"

"github.com/cockroachdb/cockroach/pkg/clusterversion"
"github.com/cockroachdb/cockroach/pkg/config/zonepb"
"github.com/cockroachdb/cockroach/pkg/gossip"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/server/serverpb"
"github.com/cockroachdb/cockroach/pkg/storage"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/util/stop"
"github.com/cockroachdb/cockroach/pkg/util/syncutil"
"github.com/cockroachdb/cockroach/pkg/util/uuid"
"github.com/cockroachdb/errors"
)

// ErrClusterInitialized is reported when the Boostrap RPC is ran on
// a node already part of an initialized cluster.
var ErrClusterInitialized = fmt.Errorf("cluster has already been initialized")

// initServer manages the temporary init server used during
// bootstrapping.
type initServer struct {
Expand All @@ -26,16 +39,153 @@ type initServer struct {
// If set, a Bootstrap() call is rejected with this error.
rejectErr error
}
bootstrapReqCh chan struct{}
connected <-chan struct{}
shouldStop <-chan struct{}
bootstrapBlockCh chan struct{} // blocks Bootstrap() until ServeAndWait() is invoked
bootstrapReqCh chan *initState

engs []storage.Engine // late-bound in ServeAndWait

binaryVersion, binaryMinSupportedVersion roachpb.Version
bootstrapVersion clusterversion.ClusterVersion
bootstrapZoneConfig, bootstrapSystemZoneConfig *zonepb.ZoneConfig
}

func newInitServer(connected <-chan struct{}, shouldStop <-chan struct{}) *initServer {
func newInitServer(
binaryVersion, binaryMinSupportedVersion roachpb.Version,
bootstrapVersion clusterversion.ClusterVersion,
bootstrapZoneConfig, bootstrapSystemZoneConfig *zonepb.ZoneConfig,
) *initServer {
return &initServer{
bootstrapReqCh: make(chan struct{}),
connected: connected,
shouldStop: shouldStop,
bootstrapReqCh: make(chan *initState, 1),
bootstrapBlockCh: make(chan struct{}),

binaryVersion: binaryVersion,
binaryMinSupportedVersion: binaryMinSupportedVersion,
bootstrapVersion: bootstrapVersion,
bootstrapZoneConfig: bootstrapZoneConfig,
bootstrapSystemZoneConfig: bootstrapSystemZoneConfig,
}
}

// initDiskState contains the part of initState that is read from stable
// storage.
//
// TODO(tbg): the above is a lie in the case in which we join an existing
// cluster. In that case, the state returned from ServeAndWait will have the
// clusterID set from Gossip (and there will be no NodeID). The plan is to
// allocate the IDs in ServeAndWait itself eventually, at which point the
// lie disappears.
type initDiskState struct {
// nodeID is zero if joining an existing cluster.
//
// TODO(tbg): see TODO above.
nodeID roachpb.NodeID
// All fields below are always set.
clusterID uuid.UUID
clusterVersion clusterversion.ClusterVersion
initializedEngines []storage.Engine
newEngines []storage.Engine
}

// initState contains the cluster and node IDs as well as the stores, from which
// a CockroachDB server can be started up after ServeAndWait returns.
type initState struct {
initDiskState
// joined is true if this is a new node. Note that the initDiskState may
// reflect the result of bootstrapping a new cluster, i.e. it is not true
// that joined==true implies that the initDiskState shows no initialized
// engines.
//
// This flag should only be used for logging and reporting. A newly
// bootstrapped single-node cluster is functionally equivalent to one that
// restarted; any decisions should be made on persisted data instead of
// this flag.
//
// TODO(tbg): remove this bool. The Node can find out another way whether
// it just joined or restarted.
joined bool
// bootstrapped is true if a new cluster was initialized. If this is true,
// 'joined' above is also true. Usage of this field should follow that of
// 'joined' as well.
bootstrapped bool
}

// ServeAndWait sets up the initServer to accept Bootstrap requests (which will
// block until then). It uses the provided engines and gossip to block until
// either a new cluster was bootstrapped or Gossip connected to an existing
// cluster.
//
// This method must be called only once.
func (s *initServer) ServeAndWait(
ctx context.Context, stopper *stop.Stopper, engs []storage.Engine, g *gossip.Gossip,
) (*initState, error) {
if s.engs != nil {
return nil, errors.New("cannot call ServeAndWait twice")
}

s.engs = engs // Bootstrap() is still blocked, so no data race here

inspectState, err := inspectEngines(ctx, engs, s.binaryVersion, s.binaryMinSupportedVersion)
if err != nil {
return nil, err
}

if len(inspectState.initializedEngines) != 0 {
// We have a NodeID/ClusterID, so don't allow bootstrap.
if err := s.testOrSetRejectErr(ErrClusterInitialized); err != nil {
return nil, errors.Wrap(err, "error unexpectedly set previously")
}
// If anyone mistakenly tried to bootstrap, unblock them so they can get
// the above error.
close(s.bootstrapBlockCh)

// In fact, it's crucial that we return early. This is because Gossip
// won't necessarily connect until a leaseholder for range 1 gossips the
// cluster ID, and all nodes in the cluster might be starting up right
// now. Without this return, we could have all nodes in the cluster
// deadlocked on g.Connected below. For similar reasons, we can't ever
// hope to initialize the newEngines below, for which we would need to
// increment a KV counter.
return &initState{
initDiskState: *inspectState,
joined: false,
bootstrapped: false,
}, nil
}

log.Info(ctx, "no stores bootstrapped and --join flag specified, awaiting init command or join with an already initialized node.")
close(s.bootstrapBlockCh)

select {
case <-stopper.ShouldQuiesce():
return nil, stop.ErrUnavailable
case state := <-s.bootstrapReqCh:
// Bootstrap() did its job.
log.Infof(ctx, "**** cluster %s has been created", state.clusterID)
return state, nil
case <-g.Connected:
// Gossip connected, that is, we know a ClusterID. Due to the early
// return above, we know that all of our engines are empty, i.e. we
// don't have a NodeID yet (and the cluster version is the minimum we
// support). Commence startup; the Node will realize it's short a NodeID
// and will request one.
//
// TODO(tbg): use a kv.DB to get NodeID and StoreIDs when necessary and
// set everything up here. This will take the Node out of that business
// entirely and means we'll need much fewer NodeID/ClusterIDContainers.
// (It's also so much simpler to think about). The RPC will also tell us
// a cluster version to use instead of the lowest possible one (reducing
// the short amount of time until the Gossip hook bumps the version);
// this doesn't fix anything but again, is simpler to think about.
clusterID, err := g.GetClusterID()
if err != nil {
return nil, err
}
inspectState.clusterID = clusterID
return &initState{
initDiskState: *inspectState,
joined: true,
bootstrapped: false,
}, nil
}
}

Expand All @@ -52,54 +202,28 @@ func (s *initServer) testOrSetRejectErr(err error) error {
return nil
}

type initServerResult int

const (
invalidInitResult initServerResult = iota
connectedToCluster
needBootstrap
)

// awaitBootstrap blocks until the connected channel is closed or a Bootstrap()
// call is received. It returns true if a Bootstrap() call is received,
// instructing the caller to perform cluster bootstrap. It returns false if the
// connected channel is closed, telling the caller that someone else
// bootstrapped the cluster. Assuming that the connected channel comes from
// Gossip, this means that the cluster ID is now available in gossip.
func (s *initServer) awaitBootstrap() (initServerResult, error) {
select {
case <-s.connected:
_ = s.testOrSetRejectErr(fmt.Errorf("already connected to cluster"))
return connectedToCluster, nil
case <-s.bootstrapReqCh:
return needBootstrap, nil
case <-s.shouldStop:
err := fmt.Errorf("stop called while waiting to bootstrap")
_ = s.testOrSetRejectErr(err)
return invalidInitResult, err
}
}

// Bootstrap unblocks an awaitBootstrap() call. If awaitBootstrap() hasn't been
// called yet, it will not block the next time it's called.
// Bootstrap implements the serverpb.Init service. Users set up a new
// CockroachDB server by calling this endpoint on *exactly one node* in the
// cluster (retrying only on that node).
// Attempting to bootstrap a node that was already bootstrapped will result in
// an error.
//
// TODO(andrei): There's a race between gossip connecting and this initServer
// getting a Bootstrap request that allows both to succeed: there's no
// synchronization between gossip and this server and so gossip can succeed in
// propagating one cluster ID while this call succeeds in telling the Server to
// bootstrap and created a new cluster ID. We should fix it somehow by tangling
// the gossip.Server with this initServer such that they serialize access to a
// clusterID and decide among themselves a single winner for the race.
// NB: there is no protection against users erroneously bootstrapping multiple
// nodes. In that case, they end up with more than one cluster, and nodes
// panicking or refusing to connect to each other.
func (s *initServer) Bootstrap(
ctx context.Context, request *serverpb.BootstrapRequest,
) (response *serverpb.BootstrapResponse, err error) {
<-s.bootstrapBlockCh // block until ServeAndWait() is active
// NB: this isn't necessary since bootstrapCluster would fail, but this is
// cleaner.
if err := s.testOrSetRejectErr(ErrClusterInitialized); err != nil {
return nil, err
}
close(s.bootstrapReqCh)
state, err := bootstrapCluster(ctx, s.engs, s.bootstrapVersion, s.bootstrapZoneConfig, s.bootstrapSystemZoneConfig)
if err != nil {
return nil, err
}
s.bootstrapReqCh <- state
return &serverpb.BootstrapResponse{}, nil
}

// ErrClusterInitialized is reported when the Boostrap RPC is ran on
// a node already part of an initialized cluster.
var ErrClusterInitialized = fmt.Errorf("cluster has already been initialized")
Loading

0 comments on commit 634e427

Please sign in to comment.