Skip to content

Commit

Permalink
Merge #85983
Browse files Browse the repository at this point in the history
85983: kv: permit different max clock offsets on different nodes in same cluster r=nvanbenschoten a=nvanbenschoten

Fixes #66868.

Before this commit, it was not possible to change the maximum clock offset of a cluster without shutting down all nodes in the cluster and bringing them all back up with a new `--max-offset` flag. This was enforced by protection added in #9612, which would crash a node if it detected different settings on different nodes. A stop-the-world cluster restart is quite disruptive and prevents us from changing the default value of the `--max-offset` flag between releases of CockroachDB or on existing CC clusters.

This commit removes this protection. The change itself is trivial. The real work is convincing ourselves that it is safe to run in a mixed max-offset cluster. We do that below.

In the following table, we detail each use of `--max-offset`. We then explore the hazards of running in a cluster with mixed `--max-offset` settings.

### Uses of `--max-offset`

| Use         | Description |
| ----------- | ----------- |
| `kv.NewTxn` | Used to configure the transaction's global uncertainty limit |
| `computeIntervalForNonTxn` | Used to configure the non-transacitonal request's global uncertainty limit |
| `TxnCoordSender.maybeCommitWait` | **When in "linearizable" mode**, used to sleep until all clocks beyond local HLC |
| `Txn.DeadlineLikelySufficient` | Used to estimate whether the current txn deadline is sufficiently far in the future |
| `closedts.TargetForPolicy` | Used to estimate lead time for global tables |
| `Replica.leaseStatus` | Used to determine the start of the stasis period for leases |
| `startupmigrations.LeaseManager.timeRemaining` | Used to determine the time remaining on a migration lease |
| `RemoteClockMonitor.VerifyClockOffset` | Used to detect clock synchronization errors |

### Hazards of mixed `--max-offset`

| Use         | skew < min(max-offset) | min(max-offset) < skew < max(max-offset) | max(max-offset) < skew |
| ----------- | ----------- | ----------- | ----------- |
| `kv.NewTxn` | N/A       | Stale reads       | Stale reads       |
| `computeIntervalForNonTxn` | N/A       | Stale reads       | Stale reads       |
| `TxnCoordSender.maybeCommitWait` | N/A       | Causal reverse       | Causal reverse       |
| `Txn.DeadlineLikelySufficient` | N/A       | Deadline exceeded retry errors       | Deadline exceeded retry errors       |
| `closedts.TargetForPolicy` | N/A       | Global table reads redirect to leaseholder       | Global table reads redirect to leaseholder       |
| `Replica.leaseStatus` | N/A       | Stale reads for non-txn requests       | Stale reads for non-txn requests       |
| `startupmigrations.LeaseManager.timeRemaining` | N/A       | Lease replaced unexpectedly, [error](https://github.com/cockroachdb/cockroach/blob/8e3ee57f3c8ea734c7221ff4c758c91cd928b6d3/pkg/startupmigrations/leasemanager/lease.go#L167)       | Lease replaced unexpectedly, [error](https://github.com/cockroachdb/cockroach/blob/8e3ee57f3c8ea734c7221ff4c758c91cd928b6d3/pkg/startupmigrations/leasemanager/lease.go#L167)       |
| `RemoteClockMonitor.VerifyClockOffset` | N/A       | Nodes in minority self-terminate **if their own max-offset < skew**  | Nodes in minority self-terminate       |

Notice that in all cases except the last, the outcome of `min(max-offset) < skew < max(max-offset)` and `max(max-offset) < skew` are identical. This means that, ignoring the last use, the hazards of `max-offset < skew` in a cluster with a homogeneous setting for `--max-offset` is the same as the hazard of `min(max-offset) < skew` in a cluster with a heterogeneous setting for `--max-offset`. As a result, we can consider `min(max-offset)` to be the "effective" max-offset of a cluster.

#### Use in `Replica.leaseStatus`

The use in `Replica.leaseStatus` of the max clock offset to determine whether a request falls into a lease's stasis period is a subtle case here, so it deserves a bit more explanation. As a reminder, the lease stasis period is described [here](https://github.com/cockroachdb/cockroach/blob/8e3ee57f3c8ea734c7221ff4c758c91cd928b6d3/pkg/kv/kvserver/kvserverpb/lease_status.proto#L27).

Consider the case where `min(max-offset) < skew < max(max-offset)` and a request is sent to an outgoing leaseholder near the end of its lease. The interesting case is where the outgoing leaseholder has a smaller max-offset than the skew. In such cases, it may not consider a non-transaction request to be in its stasis period even though a different replica saw the lease as expired, requested a new lease, and served a write. This could allow the non-transactional request to serve a stale read.

Transactional requests are [not subject](https://github.com/cockroachdb/cockroach/blob/8e3ee57f3c8ea734c7221ff4c758c91cd928b6d3/pkg/kv/kvserver/replica_range_lease.go#L611) to the stasis period, so they are not discussed in this example. However, the effect of `min(max-offset) < skew < max(max-offset)` is the same — stale reads.

#### Use in `RemoteClockMonitor.VerifyClockOffset`

The use of `RemoteClockMonitor.VerifyClockOffset` also deserves discussion because it's the one case where mixed `--max-offset` values make a difference. If the nodes with skew (i.e. those in the minority) have lower `--max-offset` values, they will self-terminate. However, if the nodes with skew have higher `--max-offset` values, they will not self-terminate. This could leave the cluster open to the other hazards indefinitely, without the skewed nodes ever deciding to self-terminate. For instance, these skewed nodes could continue to perform writes at high timestamps, causing other nodes with lower `--max-offset` values to serve stale reads.

One option I explored to address this was to have `RemoteClockMonitor.VerifyClockOffset` use the minimum max offset across the cluster instead of its local max offset when verifying clocks. This would be possible by using the `OriginMaxOffsetNanos` value communicated in each `PingRequest`. I decided against this for now in favor of keeping things simple, as it had some rough edges around ignoring stale max-offset values. It's also not clear how this protection will work in a world with dynamic clock uncertainty error bounds.

----

After this lands, we'll need to remove the following text in the docs, which was added in cockroachdb/docs#10988 and in other PRs:
```
Note that this value must be the same on all nodes in the cluster and cannot be
changed with a rolling upgrade. In order to change it, first stop every node in
the cluster. Then once the entire cluster is offline, restart each node with the
new value.
```
and
```
For existing clusters, changing the setting will require restarting all of the
nodes in your cluster at the same time; it cannot be done with a rolling
restart.
```

----

Release note (ops change): clusters can now run nodes with different --max-offset settings at the same time. This enables operators to perform a rolling restart to change the value of each node's --max-offset flag.

Co-authored-by: Nathan VanBenschoten <[email protected]>
  • Loading branch information
craig[bot] and nvanbenschoten committed Aug 13, 2022
2 parents aa77fba + d6da774 commit 70a3863
Show file tree
Hide file tree
Showing 6 changed files with 1 addition and 82 deletions.
1 change: 0 additions & 1 deletion pkg/rpc/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,6 @@ go_test(
embed = [":rpc"],
deps = [
"//pkg/base",
"//pkg/cli/exit",
"//pkg/clusterversion",
"//pkg/keys",
"//pkg/roachpb",
Expand Down
1 change: 0 additions & 1 deletion pkg/rpc/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -1821,7 +1821,6 @@ func (rpcCtx *Context) runHeartbeat(
func (rpcCtx *Context) NewHeartbeatService() *HeartbeatService {
return &HeartbeatService{
clock: rpcCtx.Clock,
maxOffset: rpcCtx.MaxOffset,
remoteClockMonitor: rpcCtx.RemoteClocks,
clusterName: rpcCtx.ClusterName(),
disableClusterNameVerification: rpcCtx.Config.DisableClusterNameVerification,
Expand Down
11 changes: 0 additions & 11 deletions pkg/rpc/context_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,6 @@ func TestHeartbeatCB(t *testing.T) {
s := newTestServer(t, serverCtx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -1025,7 +1024,6 @@ func TestHeartbeatHealthTransport(t *testing.T) {
s := grpc.NewServer(grpc.Creds(credentials.NewTLS(tlsConfig)))
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -1207,7 +1205,6 @@ func TestOffsetMeasurement(t *testing.T) {
s := newTestServer(t, serverCtx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: serverClock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -1391,7 +1388,6 @@ func TestRemoteOffsetUnhealthy(t *testing.T) {
s := newTestServer(t, nodeCtxs[i].ctx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: nodeCtxs[i].ctx.RemoteClocks,
clusterID: nodeCtxs[i].ctx.StorageClusterID,
nodeID: nodeCtxs[i].ctx.NodeID,
Expand Down Expand Up @@ -1586,7 +1582,6 @@ func grpcRunKeepaliveTestCase(testCtx context.Context, c grpcKeepaliveTestCase)
hss := &HeartbeatStreamService{
HeartbeatService: HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -1868,7 +1863,6 @@ func TestClusterIDMismatch(t *testing.T) {
s := newTestServer(t, serverCtx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -1943,7 +1937,6 @@ func TestClusterNameMismatch(t *testing.T) {
s := newTestServer(t, serverCtx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -1995,7 +1988,6 @@ func TestNodeIDMismatch(t *testing.T) {
s := newTestServer(t, serverCtx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -2070,7 +2062,6 @@ func TestVersionCheckBidirectional(t *testing.T) {
s := newTestServer(t, serverCtx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -2118,7 +2109,6 @@ func TestGRPCDialClass(t *testing.T) {
s := newTestServer(t, serverCtx)
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down Expand Up @@ -2178,7 +2168,6 @@ func TestTestingKnobs(t *testing.T) {
))
RegisterHeartbeatServer(s, &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: serverCtx.RemoteClocks,
clusterID: serverCtx.StorageClusterID,
nodeID: serverCtx.NodeID,
Expand Down
13 changes: 0 additions & 13 deletions pkg/rpc/heartbeat.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,6 @@ func (r RemoteOffset) String() string {
// remote clocks sent to it by storing them in the remoteClockMonitor.
type HeartbeatService struct {
clock hlc.WallClock
// maxOffset is the maximum tolerated clock skew between nodes.
maxOffset time.Duration
// A pointer to the RemoteClockMonitor configured in the RPC Context,
// shared by rpc clients, to keep track of remote clock measurements.
remoteClockMonitor *RemoteClockMonitor
Expand Down Expand Up @@ -158,17 +156,6 @@ func (hs *HeartbeatService) Ping(ctx context.Context, args *PingRequest) (*PingR
return nil, errors.Wrap(err, "version compatibility check failed on ping request")
}

// Enforce that clock max offsets are identical between nodes.
// Commit suicide in the event that this is ever untrue.
// This check is ignored if either offset is set to 0 (for unittests).
// Note that we validated this connection already. Different clusters
// could very well have different max offsets.
mo, amo := hs.maxOffset, time.Duration(args.OriginMaxOffsetNanos)
if hs.maxOffset != 0 && amo != 0 && mo != amo {
log.Fatalf(ctx, "locally configured maximum clock offset (%s) "+
"does not match that of node %s (%s)", mo, args.OriginAddr, amo)
}

if fn := hs.onHandlePing; fn != nil {
if err := fn(ctx, args); err != nil {
return nil, err
Expand Down
1 change: 1 addition & 0 deletions pkg/rpc/heartbeat.proto
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ message PingRequest {
// The address of the client.
optional string origin_addr = 3 [(gogoproto.nullable) = false];
// The configured maximum clock offset (in nanoseconds) on the server.
// TODO(nvanbenschoten): remove this field in v23.1. It is no longer read.
optional int64 origin_max_offset_nanos = 4 [(gogoproto.nullable) = false];
// Cluster ID to prevent connections between nodes in different clusters.
optional bytes origin_cluster_id = 5 [
Expand Down
56 changes: 0 additions & 56 deletions pkg/rpc/heartbeat_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,11 @@ import (
"time"

"github.com/cockroachdb/cockroach/pkg/base"
"github.com/cockroachdb/cockroach/pkg/cli/exit"
"github.com/cockroachdb/cockroach/pkg/clusterversion"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/settings/cluster"
"github.com/cockroachdb/cockroach/pkg/util/hlc"
"github.com/cockroachdb/cockroach/pkg/util/leaktest"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/util/stop"
"github.com/cockroachdb/cockroach/pkg/util/syncutil"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
Expand Down Expand Up @@ -51,7 +49,6 @@ func TestHeartbeatReply(t *testing.T) {
st := cluster.MakeTestingClusterSettings()
heartbeat := &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: newRemoteClockMonitor(clock, maxOffset, time.Hour, 0),
clusterID: &base.ClusterIDContainer{},
settings: st,
Expand Down Expand Up @@ -103,7 +100,6 @@ func (mhs *ManualHeartbeatService) Ping(
}
hs := HeartbeatService{
clock: mhs.clock,
maxOffset: mhs.maxOffset,
remoteClockMonitor: mhs.remoteClockMonitor,
clusterID: &base.ClusterIDContainer{},
settings: mhs.settings,
Expand All @@ -126,7 +122,6 @@ func TestManualHeartbeat(t *testing.T) {
}
regularHeartbeat := &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: newRemoteClockMonitor(clock, maxOffset, time.Hour, 0),
clusterID: &base.ClusterIDContainer{},
settings: st,
Expand Down Expand Up @@ -158,54 +153,6 @@ func TestManualHeartbeat(t *testing.T) {
}
}

func TestClockOffsetMismatch(t *testing.T) {
defer leaktest.AfterTest(t)()

var hasExited syncutil.AtomicBool
log.SetExitFunc(false, func(_ exit.Code) {
if hasExited.Get() {
// The test below asserts that a clock offset error
// triggers log.Fatal; however we also want the test
// to assert that log.Fatal is not called more than
// once, hence this check here.
t.Errorf("multiple log.Fatal calls encountered")
}
hasExited.Set(true)
})
defer log.ResetExitFunc()

ctx := context.Background()

clock := &timeutil.DefaultTimeSource{}
maxOffset := 250 * time.Millisecond
st := cluster.MakeTestingClusterSettings()
hs := &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: newRemoteClockMonitor(clock, maxOffset, time.Hour, 0),
clusterID: &base.ClusterIDContainer{},
settings: st,
}
hs.clusterID.Set(ctx, uuid.Nil)

request := &PingRequest{
Ping: "testManual",
OriginAddr: "test",
OriginMaxOffsetNanos: (500 * time.Millisecond).Nanoseconds(),
ServerVersion: st.Version.BinaryVersion(),
}

if hasExited.Get() {
t.Fatalf("fatal call arrived too early")
}

response, err := hs.Ping(context.Background(), request)

if !hasExited.Get() {
t.Fatalf("should not have reached but got response=%v err=%v", response, err)
}
}

func TestClusterIDCompare(t *testing.T) {
defer leaktest.AfterTest(t)()
uuid1, uuid2 := uuid.MakeV4(), uuid.MakeV4()
Expand All @@ -227,7 +174,6 @@ func TestClusterIDCompare(t *testing.T) {
st := cluster.MakeTestingClusterSettings()
heartbeat := &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: newRemoteClockMonitor(clock, maxOffset, time.Hour, 0),
clusterID: &base.ClusterIDContainer{},
settings: st,
Expand Down Expand Up @@ -272,7 +218,6 @@ func TestNodeIDCompare(t *testing.T) {
st := cluster.MakeTestingClusterSettings()
heartbeat := &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: newRemoteClockMonitor(clock, maxOffset, time.Hour, 0),
clusterID: &base.ClusterIDContainer{},
nodeID: &base.NodeIDContainer{},
Expand Down Expand Up @@ -311,7 +256,6 @@ func TestTenantVersionCheck(t *testing.T) {
true /* initialize */)
heartbeat := &HeartbeatService{
clock: clock,
maxOffset: maxOffset,
remoteClockMonitor: newRemoteClockMonitor(clock, maxOffset, time.Hour, 0),
clusterID: &base.ClusterIDContainer{},
settings: st,
Expand Down

0 comments on commit 70a3863

Please sign in to comment.