Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migration: introduce primitives for below-raft migrations #58088

Merged
merged 1 commit into from
Dec 30, 2020

Conversation

irfansharif
Copy link
Contributor

These include:

  • The KV ranged Migrate command. This command forces all ranges
    overlapping with the request spans to execute the (below-raft)
    migrations corresponding to the specific, stated version. This has the
    effect of moving the range out of any legacy modes operation they may
    currently be in. KV waits for this command to durably apply on all the
    replicas before returning, guaranteeing to the caller that all
    pre-migration state has been completely purged from the system.

  • The FlushAllEngines RPC. This is be used to instruct the target node
    to persist all in-memory state to disk. Like we mentioned above, KV
    currently waits for the Migrate command to have applied on all
    replicas before returning. With the applied state, there's no
    necessity to durably persist it (the representative version is already
    stored in the raft log). Out of an abundance of caution, and to really
    really ensure that no pre-migrated state is ever seen in the system,
    we provide the migration manager a mechanism to flush out all
    in-memory state to disk. This will let us guarantee that by the time a
    specific cluster version is bumped, all pre-migrated state from prior
    to a certain version will have been fully purged from the system.

  • The PurgeOutdatedReplicas RPC. This too comes up in the context of
    wanting the ensure that ranges where we've executed a ranged Migrate
    command over have no way of ever surfacing pre-migrated state. This can
    happen with older replicas in the replica GC queue and with applied
    state that is not yet persisted. Currently we wait for the Migrate
    to have applied on all replicas of a range before returning to the
    caller. This does not include earlier incarnations of the range,
    possibly sitting idle in the replica GC queue. These replicas can
    still request leases, and go through the request evaluation paths,
    possibly tripping up assertions that check to see no pre-migrated
    state is found. The PurgeOutdatedReplicas lets the migration manager
    do exactly as the name suggests, ensuring all "outdated" replicas are
    processed before declaring the specific cluster version bump complete.

  • The concept of a "replica state version". This is what's used to
    construct the migration manager's view of what's "outdated", telling
    us which migrations can be assumed to have run against a particular
    replica. When we introduce backwards incompatible changes to the
    replica state (for example using the unreplicated truncated state
    instead of the replicated variant), the version would inform us if,
    for a given replica, we should expect a state representation prior to,
    or after the migration (in our example this corresponds to whether or
    not we can assume an unreplicated truncated state).

  • A gating mechanism for "outdated" snapshots. Below-raft migrations
    want to rely on the invariant that there are no extant replicas in the
    system that haven't seen the specific Migrate command. This is made
    difficult by the fact that it's possible for in-flight snapshots to
    instantiate replicas with pre-migrated state. To guard against this,
    we gate any incoming snapshots that were generated from a replica that
    (at the time) had a version less than our store's current active
    version.
    As part of this change, we re-order the steps taken by the migration
    manager so that it executes a given migration first before bumping
    version gates cluster wide. This is in consideration of the fact that
    the migration process can be arbitrarily long running, and it'd be
    undesirable to pause rebalancing/repair mechanisms from functioning
    during this time (which we're doing by disallowing snapshots from
    older replica versions).


This PR motivates all of the above by also onboarding the
TruncatedAndRangeAppliedState migration, lets us do the following:

i. Use the RangeAppliedState on all ranges
ii. Use the unreplicated TruncatedState on all ranges

In 21.2 we'll finally be able to delete holdover code that knows how to
handle the legacy replicated truncated state.

Release note(general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = -, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.


The ideas here follow from our original prototype in #57445. This PR
needs to be broken down into a few more piecemeal ones:

  • replica state versioning and gating old snapshots (separate PR)
  • pkg/migration primitives
    • Migrate (stand-alone commit)
    • FlushAllEngines/PurgeOutdatedReplicas (stand-alone commit)
    • Manager.Migrate changes (stand-alone commit)
  • the truncated state migration (separate PR)

@irfansharif irfansharif requested a review from a team as a code owner December 19, 2020 04:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif removed the request for review from a team December 19, 2020 15:22
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! I didn't see anything substantial, so I think this is just a question of some more ardent polishing.

Reviewed 57 of 57 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/kv/db.go, line 685 at r1 (raw file):

// Migrate proactively forces ranges overlapping with the provided keyspace to
// transition out of any legacy modes of operation (as defined by the target

the phrasing of comments around Migrate could use a clarification of its semantics. This is given a target version, so you'd think that it would somehow run from whatever version it's at to the target version, but it doesn't do that (because it's a primitive used in ececuting a single version step). So it's really more a way to attach some below-raft code to a particular version.


pkg/kv/kvserver/client_migration_test.go, line 79 at r1 (raw file):

	desc := tc.RemoveVotersOrFatal(t, k, tc.Target(n2))

	unregister := batcheval.TestingRegisterMigrationInterceptor(migrationVersion, func() {})

add a comment why this is needed.


pkg/kv/kvserver/client_migration_test.go, line 101 at r1 (raw file):

	// Check to see that the replica was purged from n2.
	require.Nil(t, store.LookupReplica(roachpb.RKey(k)))

nice.


pkg/kv/kvserver/client_migration_test.go, line 115 at r1 (raw file):

	blockSnapshotsCh := make(chan struct{})
	knobs, ltk := makeReplicationTestKnobs()
	ltk.storeKnobs.ReceiveSnapshot = func(h *kvserver.SnapshotRequest_Header) error {

make this closure idempotent, so that a double-close here won't happen (as it would crash the whole pkg)
There's no way to make the snapshot we're waiting for uniquely identifiable, right? We could at least check that it's for the right range.


pkg/kv/kvserver/client_migration_test.go, line 143 at r1 (raw file):

	// added.
	<-blockUntilSnapshotCh
	desc := tc.LookupRangeOrFatal(t, k)

this uses a consistent kv lookup, right? Otherwise it could be racy.


pkg/kv/kvserver/client_migration_test.go, line 152 at r1 (raw file):

	// Migrate the scratch range. We expect this to hang given the in-flight
	// snapshot is held up.
	err := tc.Server(n1).DB().Migrate(ctx, desc.StartKey, desc.EndKey, migrationVersion)

How long does it hang?


pkg/kv/kvserver/client_migration_test.go, line 153 at r1 (raw file):

	// snapshot is held up.
	err := tc.Server(n1).DB().Migrate(ctx, desc.StartKey, desc.EndKey, migrationVersion)
	require.Error(t, err)

Check the type of the error or, if we can't get this to work, string match. (But I think the type stuff should work now thanks to EncodedError).


pkg/kv/kvserver/client_migration_test.go, line 236 at r1 (raw file):

	// command application on n3.
	err := tc.Server(n1).DB().Migrate(ctx, desc.StartKey, desc.EndKey, endV)
	require.Error(t, err)

ditto


pkg/kv/kvserver/client_replica_test.go, line 3610 at r1 (raw file):

	key := tc.ScratchRange(t)
	require.NoError(t, tc.WaitForSplitAndInitialization(key))

Huh, never saw this one before. We don't want to fold this into ScratchRange?


pkg/kv/kvserver/replica_application_result.go, line 297 at r1 (raw file):

func (r *Replica) handleVersionResult(ctx context.Context, version *roachpb.Version) {
	if (*version == roachpb.Version{}) {

not nil?


pkg/kv/kvserver/replica_proposal.go, line 776 at r1 (raw file):

		r.mu.RUnlock()
		if !usingAppliedStateKey {
			// The range applied state was introduced in v21.1. The cluster

It was not introduced in v21.1, but sometime earlier.


pkg/kv/kvserver/store.go, line 2827 at r1 (raw file):

		alloc, err := qp.Acquire(ctx, 1)
		if err != nil {
			retErr = err

use g.GoCtx(func(ctx context.Context) error { return err }) and then you don't have to bother with retErr?


pkg/kv/kvserver/store.go, line 2834 at r1 (raw file):

			defer alloc.Release()

			processed, err := s.replicaGCQueue.process(ctx, repl, cfg)

process doesn't even use the system config (it's forced into it only because of queueImpl. I think we should make use of that and pass nil here to avoid a gossip depedency.


pkg/kv/kvserver/store.go, line 2836 at r1 (raw file):

			processed, err := s.replicaGCQueue.process(ctx, repl, cfg)
			if err != nil {
				return err

errors.Wrapf(err, "on %s", repl.Desc())


pkg/kv/kvserver/store_create_replica.go, line 188 at r1 (raw file):

	if s.ClusterSettings().Version.IsActive(ctx, clusterversion.ReplicaVersions) {
		version := s.ClusterSettings().Version.ActiveVersion(ctx).Version
		repl.mu.state.Version = &version

This strikes me as a problematic. Anything in repl.mu.state must match on-disk state, and here there's no on-disk state. I think the version should be unspecified until a snapshot arrives. This is going to mildly awkward, but we can probably apply this idea here in spirit whenever we look at the replica's versions, which will then be the right place to do so. If we need to do this multiple times, should be easy enough to come up with a helper.


pkg/kv/kvserver/store_snapshot.go, line 627 at r1 (raw file):

	}

	// We gate any incoming snapshots that were generated from a replica, that

s/gate/reject/


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1007 at r1 (raw file):

		var replicaVersion roachpb.Version
		if rec.ClusterSettings().Version.IsActive(ctx, clusterversion.ReplicaVersions) {
			replicaVersion = rec.ClusterSettings().Version.ActiveVersion(ctx).Version

I don't think this is right. The version needs to be that of the LHS (and the split needs to declare a read latch on the version, and the migrate cmd a write). As is, we could conceivably split a v10 LHS that has passed the migrate check and create a v9 RHS, and then bump the active version to v10 even though we didn't upgrade the v9 rhs.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 41 at r1 (raw file):

	// cumbersome. We could spruce up the migration type and allow authors to
	// define the allow authors for specific set of keys each migration needs to
	// grab latches and locks over.

needs rw latch for range version key.

Also, read latch on the range descriptor. You're accessing the rangeID which technically comes from the descriptor.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 61 at r1 (raw file):

}

// Migrate executes the below-raft migration corresponding to the given version.

put a reference to the exhaustive comment (not sure where the best place to put it is, I like MigrateRequest)


pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r1 (raw file):

  // initialized (either during bootstrap, or through snapshots, or splits),
  // it's annotated with a replica version. During bootstrap, replicas source
  // the active cluster version at the time. For replicas created through

Will need an update as I commented on that code elsewhere.


pkg/kv/kvserver/stateloader/initial.go, line 105 at r1 (raw file):

	replicaVersion roachpb.Version,
) error {
	initialTruncStateType := TruncatedStateUnreplicated

nit: const


pkg/kv/kvserver/stateloader/initial.go, line 111 at r1 (raw file):

// WriteInitialRangeStateWithTruncatedState is the same as
// WriteInitialRangeState, but allows the caller to override the truncated
// state.

type


pkg/migration/manager.go, line 126 at r1 (raw file):

		// upgrades) are running binaries that know about the version.

		// Bumping the version gates only after running the migration itself is

This makes it seem as though we migrate-before-bump only to accommodate below-raft migrations, but it's because of the general invariant that we want that if version X is active, it is fully active. If we roll out the migration after activating the version, then what good is the version? You would have to wait for X+1 instead, so you arrive at the same place but with an annoying one-off. You're really just explaining the details of below-raft migrations here which is useful, but seems like the wrong place. At a high level, it seems that on line 105 above, you could instead write

// First, run the actual migration. The cluster version bump will be rolled out below. This
// means that when a migration for version X first starts running and for some time after
// its completions, parts of the cluster might correspond to version X-1, and others to version X.
// However, once version X has fully been rolled out, there is certainty that all data has been
// migrated, and in particular any legacy behavior is no longer exercised.

and put the detail on below-raft migrations in a more appropriate place (the helpers related to below-raft migrations)?


pkg/migration/manager.go, line 128 at r1 (raw file):

		// Bumping the version gates only after running the migration itself is
		// important for below-raft migrations. Below-raft migrations mutate
		// replica state, making use of the Migrate(version=V) primitive when

which they issue against the entire keyspace.


pkg/migration/migrations.go, line 81 at r1 (raw file):

// TODO(irfansharif): Introduce a `system.migrations` table, and populate it here.
func (m *Migration) Run(ctx context.Context, h *Helper) (err error) {
	ctx = logtags.AddTag(ctx, fmt.Sprintf("migration=%s", h.ClusterVersion()), nil)

🤔 why not AddTag(ctx, "migration", h.ClusterVersion())?


pkg/migration/migrations.go, line 90 at r1 (raw file):

}

// defaultPageSize controls how many ranges are paged in by default when

range descriptors


pkg/migration/migrations.go, line 121 at r1 (raw file):

		// TODO(irfansharif): Instead of logging this to the debug log, we
		// should insert these into a `system.migrations` table for external

I'm not even so sure about that any more! I would be curious how @knz things this should ultimately work. A special migrations logging channel?


pkg/migration/migrations.go, line 135 at r1 (raw file):

	// Make sure that all stores have synced. Given we're a below-raft
	// migrations, this ensures that the applied state is flushed to disk.

Hrm, unfortunately, the following could happen:

  1. replica X applies the change, we return from the migration above
  2. node w/ replica X restarts, unapplying the change (it wasn't synced)
  3. the flush request below syncs, but the cmd isn't applied yet

I think we just have to run this whole method in the context of an everynode.


pkg/migration/migrations.go, line 155 at r1 (raw file):

	op := fmt.Sprintf("purge-outdated-replicas=%s", req.Version)
	if err := h.EveryNode(ctx, op, func(ctx context.Context, client serverpb.MigrationClient) error {
		_, err := client.PurgeOutdatedReplicas(ctx, req)

I guess this one needs to flush too, and does need to do so in the same everynode

Makes me wonder if Everynode is still the right name, since we appear to need its stabilizing property all around. h.UntilClusterStable?


pkg/migration/migrations_test.go, line 74 at r1 (raw file):

						s.VisitReplicas(func(repl *kvserver.Replica) bool {
							err = f(repl)
							return err == nil

// wantMore


pkg/migration/migrations_test.go, line 241 at r1 (raw file):

	})

	// Register the below raft migration.

Is it still ok to register these now? I thought it would have to be done before starting up the system.


pkg/migration/migrations_test.go, line 247 at r1 (raw file):

	// Register the top-level migration.
	migrated := false
	unregister := migration.TestingRegisterMigrationInterceptor(endCV, func(ctx context.Context, h *migration.Helper) error {

ditto


pkg/migration/migrations_test.go, line 264 at r1 (raw file):

	}

	if err := g.Wait(); !testutils.IsError(err, fmt.Sprintf("snapshot generated at %s, only accepting versions >= %s", startCV, endCV)) {

can you make this error typed and assert on the type (errors.Is)? I believe it should round-trip now.


pkg/migration/migrations_test.go, line 331 at r1 (raw file):

	require.NoError(t, err)

	if got := repl.Version(); got != endCV.Version {

This is flaky in theory if repl happens to be a follower.


pkg/roachpb/api.go, line 1329 at r1 (raw file):

	return isWrite | isRange | isAlone | isUnsplittable | canBackpressure
}
func (*MigrateRequest) flags() int { return isWrite | isRange | isAlone }

yep, there's the isAlone.


pkg/server/migration.go, line 173 at r1 (raw file):

	ctx = logtags.AddTag(ctx, "purge-outdated-replicas", nil)

	if err := m.server.node.stores.VisitStores(func(s *kvserver.Store) error {

Awkward complication: stores can be added asynchronously, so we have to find a way to determine that the async bootstrap has happened. I hate that this is necessary but it is.


pkg/server/migration_test.go, line 231 at r1 (raw file):

	}

	s.Stopper().Stop(context.Background())

why not defer a few lines up? You're leaking the stopper on fatals now.


pkg/server/serverpb/migration.proto, line 74 at r1 (raw file):

   // FlushAllEngines is used to instruct the target node to flush all its
   // engines.
   rpc FlushAllEngines (FlushAllEnginesRequest) returns (FlushAllEnginesResponse) { }

do we actually want to flush? I think we just want to sync?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the super detailed review, that was really good ha. I think I've addressed all of it. The one flaking test was from our snapshot gating mechanism, but turns out that's not actually needed (explained in comments).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/kv/db.go, line 685 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

the phrasing of comments around Migrate could use a clarification of its semantics. This is given a target version, so you'd think that it would somehow run from whatever version it's at to the target version, but it doesn't do that (because it's a primitive used in ececuting a single version step). So it's really more a way to attach some below-raft code to a particular version.

Done.


pkg/kv/kvserver/client_migration_test.go, line 79 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

add a comment why this is needed.

Done.


pkg/kv/kvserver/client_migration_test.go, line 115 at r1 (raw file):

make this closure idempotent

Done. I don't think we need to check that it's for the right range, right? We're using the base.ReplicationManual mode, so there are no snapshots other than the one we're generating. Also I'm not sure how to cleanly check for it being the right range when we'll only know the descriptor below, when creating the scratch range.


pkg/kv/kvserver/client_migration_test.go, line 143 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

this uses a consistent kv lookup, right? Otherwise it could be racy.

Yup.


pkg/kv/kvserver/client_migration_test.go, line 152 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

How long does it hang?

Too long, changed it out to 50ms (cargo-culting from what you did here).


pkg/kv/kvserver/client_migration_test.go, line 153 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Check the type of the error or, if we can't get this to work, string match. (But I think the type stuff should work now thanks to EncodedError).

Done (the type didn't match context.DeadlineExceeded 🤷‍♀️)


pkg/kv/kvserver/client_migration_test.go, line 236 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto

Done.


pkg/kv/kvserver/client_replica_test.go, line 3610 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Huh, never saw this one before. We don't want to fold this into ScratchRange?

Don't actually need it here (was just cargo culting as usual). I'm not sure why it isn't already folded in. Maybe for tests that disable the split queues?


pkg/kv/kvserver/replica_application_result.go, line 297 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

not nil?

Only called when non-nil, structurally identical to the other callsites within handleNonTrivialReplicatedEvalResult. This is actually dead-code, we'll only ever have non-nil versions passed in as part of the replica state when there's a below-raft migration happening (so with a non-zero version attached). I had it written this way to convince myself that we're never setting in-memory version states unless the cluster version gating the use of replica.Version was active (and so it was being funneled in from elsewhere where we presumably checked for that condition). Do you prefer to skip it entirely? Add a comment? Assertion?


pkg/kv/kvserver/replica_proposal.go, line 776 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It was not introduced in v21.1, but sometime earlier.

Done.


pkg/kv/kvserver/store.go, line 2827 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

use g.GoCtx(func(ctx context.Context) error { return err }) and then you don't have to bother with retErr?

Done.


pkg/kv/kvserver/store.go, line 2834 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

process doesn't even use the system config (it's forced into it only because of queueImpl. I think we should make use of that and pass nil here to avoid a gossip depedency.

Done.


pkg/kv/kvserver/store.go, line 2836 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

errors.Wrapf(err, "on %s", repl.Desc())

Done.


pkg/kv/kvserver/store_create_replica.go, line 188 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This strikes me as a problematic. Anything in repl.mu.state must match on-disk state, and here there's no on-disk state. I think the version should be unspecified until a snapshot arrives. This is going to mildly awkward, but we can probably apply this idea here in spirit whenever we look at the replica's versions, which will then be the right place to do so. If we need to do this multiple times, should be easy enough to come up with a helper.

Done, and makes sense. I only had this here to be able to gate incoming snapshots instantiating outdated replicas (so we could compare the snapshot's version with the store's version, but we could've been checking for the store's active cluster version there anyway, so I'm not sure anymore why I had it. Removed.


pkg/kv/kvserver/store_snapshot.go, line 627 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/gate/reject/

No longer applicable.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1007 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't think this is right. The version needs to be that of the LHS (and the split needs to declare a read latch on the version, and the migrate cmd a write). As is, we could conceivably split a v10 LHS that has passed the migrate check and create a v9 RHS, and then bump the active version to v10 even though we didn't upgrade the v9 rhs.

Done, thanks for catching this. The split already declares a read span over all replicated range-ID keys (of which version key is one.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 41 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

needs rw latch for range version key.

Also, read latch on the range descriptor. You're accessing the rangeID which technically comes from the descriptor.

Done, but I'm not sure I follow why declaring a read latch over the range descriptor is necessary. What's the possible hazard if we don't declare it?


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 61 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

put a reference to the exhaustive comment (not sure where the best place to put it is, I like MigrateRequest)

Done.


pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Will need an update as I commented on that code elsewhere.

Done.


pkg/kv/kvserver/stateloader/initial.go, line 105 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: const

Done.


pkg/kv/kvserver/stateloader/initial.go, line 111 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

type

Done.


pkg/migration/manager.go, line 126 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This makes it seem as though we migrate-before-bump only to accommodate below-raft migrations, but it's because of the general invariant that we want that if version X is active, it is fully active. If we roll out the migration after activating the version, then what good is the version? You would have to wait for X+1 instead, so you arrive at the same place but with an annoying one-off. You're really just explaining the details of below-raft migrations here which is useful, but seems like the wrong place. At a high level, it seems that on line 105 above, you could instead write

// First, run the actual migration. The cluster version bump will be rolled out below. This
// means that when a migration for version X first starts running and for some time after
// its completions, parts of the cluster might correspond to version X-1, and others to version X.
// However, once version X has fully been rolled out, there is certainty that all data has been
// migrated, and in particular any legacy behavior is no longer exercised.

and put the detail on below-raft migrations in a more appropriate place (the helpers related to below-raft migrations)?

Done. As for moving the comment block, I'm not sure I'm finding a better place for it. On the one hand it's a bit out of odd here, but it's also now a single page with fully commentary on the entire machinery for long running migrations, which is a nice.


pkg/migration/manager.go, line 128 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

which they issue against the entire keyspace.

Done.


pkg/migration/migrations.go, line 81 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

🤔 why not AddTag(ctx, "migration", h.ClusterVersion())?

Doh, done.


pkg/migration/migrations.go, line 90 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

range descriptors

Done.


pkg/migration/migrations.go, line 121 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm not even so sure about that any more! I would be curious how @knz things this should ultimately work. A special migrations logging channel?

Referring to #58183 now instead.


pkg/migration/migrations.go, line 135 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hrm, unfortunately, the following could happen:

  1. replica X applies the change, we return from the migration above
  2. node w/ replica X restarts, unapplying the change (it wasn't synced)
  3. the flush request below syncs, but the cmd isn't applied yet

I think we just have to run this whole method in the context of an everynode.

Done, PTAL.


pkg/migration/migrations.go, line 155 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I guess this one needs to flush too, and does need to do so in the same everynode

Makes me wonder if Everynode is still the right name, since we appear to need its stabilizing property all around. h.UntilClusterStable?

Done.


pkg/migration/migrations_test.go, line 74 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// wantMore

Done (with the named return type).


pkg/migration/migrations_test.go, line 241 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is it still ok to register these now? I thought it would have to be done before starting up the system.

Yup, but it's kind of a shitty pattern to use because we're mutating global state. If there are multiple tests running all registering to the same version, there'll be a confusing panic. I'm haven't done anything about it in this review pass, but I think plumbing a testing knob in is in order.


pkg/migration/migrations_test.go, line 247 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto

See above.


pkg/migration/migrations_test.go, line 264 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

can you make this error typed and assert on the type (errors.Is)? I believe it should round-trip now.

No longer applicable.


pkg/migration/migrations_test.go, line 331 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is flaky in theory if repl happens to be a follower.

Shouldn't be, the migration will bump the version on all followers. But this test was flaky for another reason: we need to ensure all nodes have heartbeat their liveness records at least once before attempting a migration, lest we consider them unavailable.


pkg/server/migration.go, line 173 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Awkward complication: stores can be added asynchronously, so we have to find a way to determine that the async bootstrap has happened. I hate that this is necessary but it is.

Done. Ha, I'm learning a lot from how paranoid your reviews are. I should be less sloppy.


pkg/server/migration_test.go, line 231 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

why not defer a few lines up? You're leaking the stopper on fatals now.

No good reason, done.


pkg/server/serverpb/migration.proto, line 74 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

do we actually want to flush? I think we just want to sync?

good point, Done.

@irfansharif irfansharif changed the title [wip] migration: introduce primitives for below-raft migrations migration: introduce primitives for below-raft migrations Dec 23, 2020
@irfansharif irfansharif force-pushed the 201211.replica-version branch 2 times, most recently from 245553a to f0c8afd Compare December 23, 2020 15:56
@tbg tbg requested a review from nvanbenschoten December 23, 2020 20:09
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 25 of 25 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


pkg/kv/kvserver/client_migration_test.go, line 153 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done (the type didn't match context.DeadlineExceeded 🤷‍♀️)

Ah, it's a wrapped grpc error with code codes.DeadlineExceeded, but status.Code(err) doesn't know how to wrap through to the wrapped grpc error. It's not worth improving; this is fine


pkg/kv/kvserver/replica_application_result.go, line 297 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Only called when non-nil, structurally identical to the other callsites within handleNonTrivialReplicatedEvalResult. This is actually dead-code, we'll only ever have non-nil versions passed in as part of the replica state when there's a below-raft migration happening (so with a non-zero version attached). I had it written this way to convince myself that we're never setting in-memory version states unless the cluster version gating the use of replica.Version was active (and so it was being funneled in from elsewhere where we presumably checked for that condition). Do you prefer to skip it entirely? Add a comment? Assertion?

Assertion sounds right.


pkg/kv/kvserver/replica_proposal.go, line 776 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done.

Doesn't look done - I meant that the range applied state was not introduced in v21.1. It was introduced a lot earlier. It is first enforced on all ranges in 21.1.


pkg/kv/kvserver/replica_write.go, line 243 at r3 (raw file):

				// [1]: See PurgeOutdatedReplicas from the Migration service.
				// [2]: pkg/migration
				desc := r.Desc()

Add something to the comment about the races. It is possible that between the proposal returning and the call to r.Desc() below, the descriptor has changed. But the only thing that matters is that r.Desc() is at least as up to date as the descriptor the command applied on previously. If a replica got removed - fine, waitForApplication will fail; we will have to cope with that. If one got added - it was either already a learner when we migrated (in which case waitForApplication will get it) or, thanks to Migrate read latching on the range descriptor, the replication change serialized after the migration and so the snapshot will also include the migration (by virtue of being started after a replica change txn).

Hmm, in a cluster that is constantly changing its replica sets, I wonder if we can get into a situation in which a Migrate command never manages to complete - all it takes is a single range in each attempt to throw things off. I wonder if an error on waitForApplication should lead to a retry of just that RPC instead of propagating an error for the whole migrate invocation.


pkg/kv/kvserver/store.go, line 2824 at r3 (raw file):

				return err
			})
			return false

Can you explain the return false here? On errors below, you keep going. What does it mean if the quota pool doesn't let us acquire?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 974 at r3 (raw file):

		}

		replicaVersion, err := MakeStateLoader(rec).LoadVersion(ctx, rec.Engine())

This is the third (at least) stateloader in this code, are there any issues with taking the opportunity to declare it once? This also makes sure that my TODO above wondering why this is using rec.Engine() is not needed multiple times.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 41 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done, but I'm not sure I follow why declaring a read latch over the range descriptor is necessary. What's the possible hazard if we don't declare it?

Assume we run a migration that declares no keys at all. It could run concurrently with a split. So you could be reading rangeID 15 here but the split finishes first, and propagates the old version to the RHS. Yet, you return claiming to have migrated the whole range, but the right hand side of the split has the old version still (and isn't migrated).

I have to admit that I don't 100% feel sure that I completely grok this. Usually, a span based command (i.e. one that uses EndKey) declares its span [Key, EndKey) and so automatically it's not going to interact weirdly with a split. But here we are intentionally not doing that, so we need another mechanism that makes sure that we have migrated every range that touches our key span, because that is what we promise to do. A read latch on the range descriptor makes sure that nobody changes the keyspan while we migrate.

I wouldn't mind if @nvanbenschoten came back to this comment and sanity checked me here at some point.


pkg/kv/kvserver/kvserverpb/state.proto, line 87 at r3 (raw file):

  // that all replica versions are bumped along side it. For everything else,
  // the coupling is neither necessary nor enforced. When a new replica is being
  // initialized (either during bootstrap, or through snapshots, or splits),

I'm not sure we refer to any of this as bootstrap. Let's explicitly say that replicas are either created at cluster creation time (in which case they source the binary version, i.e. the cluster version of that new cluster) and otherwise they inherit the version from either the snapshot or the LHS of the split.


pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r3 (raw file):

  // initialized (either during bootstrap, or through snapshots, or splits),
  // it's annotated with a replica version. During bootstrap, replicas source
  // the binary version. For replicas created through snapshots or splits, they

(i.e. the version the cluster is bootstrapped with)


pkg/migration/helper.go, line 124 at r3 (raw file):

//   (b) We'll invoke the closure
//   (c) We'll retrieve the list of node IDs again to account for the
//       possibility of a new node being added during (b)

or a node restarting


pkg/migration/manager.go, line 146 at r3 (raw file):

		// forward.
		//
		// That still leaves rooms for replicas in the GC queue to evade

replica GC queue


pkg/migration/migrations.go, line 150 at r3 (raw file):

	// truncated state and the range applied state. We're sure to also durably
	// persist any changes made in the same closure. Doing so in separate
	// UntilClusterStable closure runs the (small) risk that a node might have

nit: s/runs/would run/


pkg/server/migration.go, line 161 at r3 (raw file):

	for _, eng := range m.server.engines {
		batch := eng.NewBatch()
		if err := batch.Commit(true); err != nil {

Can someone from storage check that this is doing the right thing? Seems like the kind of operation that might accidentally be optimized into a noop. @sumeerbhola or @jbowens maybe?

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @nvanbenschoten, and @sumeerbhola)


pkg/server/migration.go, line 161 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can someone from storage check that this is doing the right thing? Seems like the kind of operation that might accidentally be optimized into a noop. @sumeerbhola or @jbowens maybe?

Good call @tbg. An empty batch skips the WAL: https://github.com/cockroachdb/pebble/blob/108cd0260986c5972d20bd807a4d7192d912c97a/commit.go#L241-L243

You can use batch.LogData(nil) to add a log-only entry to the batch and ensure the WAL is synced.

@irfansharif irfansharif force-pushed the 201211.replica-version branch from f0c8afd to 512875f Compare December 30, 2020 02:16
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, thanks for the review Tobi! Will merge on green 🤞

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica_write.go, line 243 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Add something to the comment about the races. It is possible that between the proposal returning and the call to r.Desc() below, the descriptor has changed. But the only thing that matters is that r.Desc() is at least as up to date as the descriptor the command applied on previously. If a replica got removed - fine, waitForApplication will fail; we will have to cope with that. If one got added - it was either already a learner when we migrated (in which case waitForApplication will get it) or, thanks to Migrate read latching on the range descriptor, the replication change serialized after the migration and so the snapshot will also include the migration (by virtue of being started after a replica change txn).

Hmm, in a cluster that is constantly changing its replica sets, I wonder if we can get into a situation in which a Migrate command never manages to complete - all it takes is a single range in each attempt to throw things off. I wonder if an error on waitForApplication should lead to a retry of just that RPC instead of propagating an error for the whole migrate invocation.

Done, and added a TODO here. I expect a fair bit of shaking out to happen as we onboard more migrations in earnest, and I'll whittle it down then.


pkg/kv/kvserver/store.go, line 2824 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you explain the return false here? On errors below, you keep going. What does it mean if the quota pool doesn't let us acquire?

Errors from the quota pool, quickly skimming through it, seem to only come about when the pool is already closed (we're not doing that here) or configured with zero total capacity (also not applicable). Seeing as how it's "dead-code", short of asserting on no-errors, terminating early seems appropriate. For the errors below they'll only come about when individually processing replicas, which I figured shouldn't interfere much with the processing of other replicas.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 974 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is the third (at least) stateloader in this code, are there any issues with taking the opportunity to declare it once? This also makes sure that my TODO above wondering why this is using rec.Engine() is not needed multiple times.

Done.


pkg/kv/kvserver/kvserverpb/state.proto, line 87 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm not sure we refer to any of this as bootstrap. Let's explicitly say that replicas are either created at cluster creation time (in which case they source the binary version, i.e. the cluster version of that new cluster) and otherwise they inherit the version from either the snapshot or the LHS of the split.

Done.


pkg/kv/kvserver/kvserverpb/state.proto, line 89 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

(i.e. the version the cluster is bootstrapped with)

Done.


pkg/migration/helper.go, line 124 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

or a node restarting

Done.


pkg/migration/manager.go, line 146 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

replica GC queue

Done.


pkg/migration/migrations.go, line 150 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: s/runs/would run/

Done.


pkg/server/migration.go, line 161 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Good call @tbg. An empty batch skips the WAL: https://github.com/cockroachdb/pebble/blob/108cd0260986c5972d20bd807a4d7192d912c97a/commit.go#L241-L243

You can use batch.LogData(nil) to add a log-only entry to the batch and ensure the WAL is synced.

Done, thanks for verifying Jackson.

@irfansharif irfansharif force-pushed the 201211.replica-version branch from 512875f to 1d1d5a8 Compare December 30, 2020 04:50
and onboard the first long running migration (for the truncated state,
see below). The primitives here include:

- The KV ranged `Migrate` command. This command forces all ranges
  overlapping with the request spans to execute the (below-raft)
  migrations corresponding to the specific, stated version. This has the
  effect of moving the range out of any legacy modes operation they may
  currently be in. KV waits for this command to durably apply on all the
  replicas before returning, guaranteeing to the caller that all
  pre-migration state has been completely purged from the system.

- The `SyncAllEngines` RPC. This is be used to instruct the target node
  to persist releveant in-memory state to disk. Like we mentioned above,
  KV currently waits for the `Migrate` command to have applied on all
  replicas before returning. With the applied state, there's no
  necessity to durably persist it (the representative version is already
  stored in the raft log). Out of an abundance of caution, and to really
  really ensure that no pre-migrated state is ever seen in the system,
  we provide the migration manager a mechanism to flush out all
  in-memory state to disk. This will let us guarantee that by the time a
  specific cluster version is bumped, all pre-migrated state from prior
  to a certain version will have been fully purged from the system.
  We'll also use it in conjunction with PurgeOutdatedReplicas below.

- The `PurgeOutdatedReplicas` RPC. This too comes up in the context of
  wanting the ensure that ranges where we've executed a ranged `Migrate`
  command over have no way of ever surfacing pre-migrated state. This can
  happen with older replicas in the replica GC queue and with applied
  state that is not yet persisted. Currently we wait for the `Migrate`
  to have applied on all replicas of a range before returning to the
  caller. This does not include earlier incarnations of the range,
  possibly sitting idle in the replica GC queue. These replicas can
  still request leases, and go through the request evaluation paths,
  possibly tripping up assertions that check to see no pre-migrated
  state is found. The `PurgeOutdatedReplicas` lets the migration manager
  do exactly as the name suggests, ensuring all "outdated" replicas are
  processed before declaring the specific cluster version bump complete.

- The concept of a "replica state version". This is what's used to
  construct the migration manager's view of what's "outdated", telling
  us which migrations can be assumed to have run against a particular
  replica.  When we introduce backwards incompatible changes to the
  replica state (for example using the unreplicated truncated state
  instead of the replicated variant), the version would inform us if,
  for a given replica, we should expect a state representation prior to,
  or after the migration (in our example this corresponds to whether or
  not we can assume an unreplicated truncated state).

As part of this commit, we also re-order the steps taken by the
migration manager so that it executes a given migration first before
bumping version gates cluster wide. This is because we want authors of
migrations to ascertain that their own migrations have run to
completion, instead of attaching that check to the next version.

---

This PR motivates all of the above by also onboarding the
TruncatedAndRangeAppliedState migration, lets us do the following:

  i. Use the RangeAppliedState on all ranges
  ii. Use the unreplicated TruncatedState on all ranges

In 21.2 we'll finally be able to delete holdover code that knows how to
handle the legacy replicated truncated state.

Release note (general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = <major>-<minor>, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

---

The ideas here follow from our original prototype in cockroachdb#57445.
@irfansharif irfansharif force-pushed the 201211.replica-version branch from 1d1d5a8 to 89781b1 Compare December 30, 2020 05:37
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Dec 30, 2020

Build succeeded:

@craig craig bot merged commit 75ad154 into cockroachdb:master Dec 30, 2020
@irfansharif irfansharif deleted the 201211.replica-version branch January 8, 2021 21:41
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 31, 2021
Fixes cockroachdb#58378.
Fixes cockroachdb#62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in cockroachdb#62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Mar 31, 2021
Fixes cockroachdb#58378.
Fixes cockroachdb#62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in cockroachdb#62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None
craig bot pushed a commit that referenced this pull request Mar 31, 2021
60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten

Fixes #60779.
Fixes #60580.

We were only checking that the batch header timestamp was equal to or
greater than this pushee's min timestamp, so this is as far as we can
bump the timestamp cache.

62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890

This patch slightly improves the performance of many
spatial builtins by storing the number of edges used
in the loop conditions of for loops into a variable.
We discovered this was taking a lot of time when
profiling the point-in-polygon optimization.

Release note: None

62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif

Fixes #58378.
Fixes #62267.

Previously it was possible for us to have replicas in-memory, with
pre-migrated state, even after a migration was finalized. This led to
the kind of badness we were observing in #62267, where it appeared that
a replica was not using the applied state key despite us having migrated
into it (see TruncatedAndRangeAppliedState, introduced in #58088).

---

To see how, consider the following set of events:

- Say r42 starts off on n1, n2, and n3
- n3 flaps and so we place a replica for r42 on n4
- n3's replica, r42/3, is now GC-able, but still un-GC-ed
- We run the applied state migration, first migrating all ranges into it
  and then purging outdated replicas
- Well, we should want to purge r42/3, cause it's un-migrated and
  evaluating anything on it (say a lease request) is unsound because
  we've bumped version gates that tell the kvserver to always expect
  post-migration state
- What happens when we try to purge r42/3? Previous to this PR if it
  didn't have a replica version, we'd skip over it (!)
- Was it possible for r42/3 to not have a replica version? Shouldn't it
  have been accounted for when we migrated all ranges? No, that's precisely
  why the migration infrastructure purge outdated replicas. The migrate
  request only returns once its applied on all followers; in our example
  that wouldn't include r42/3 since it was no longer one
- The stop-gap in #60429 made it so that we didn't GC r42/3, when we
  should've been doing the opposite. When iterating over a store's
  replicas for purging purposes, an empty replica version is fine and
  expected; we should interpret that as signal that we're dealing with a
  replica that was obviously never migrated (to even start using replica
  versions in the first place). Because it didn't have a valid replica
  version installed, we can infer that it's soon to be GC-ed (else we
  wouldn't have been able to finalize the applied state + replica
  version migration)
- The conditions above made it possible for us to evaluate requests on
  replicas with migration state out-of-date relative to the store's
  version
- Boom

Release note: None


62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan

* Subzones may be defined in a different order. We did not take this
  into account which can cause bugs when e.g. ADD REGION adds a subzone
  in the end rather than in the old "expected" location in the subzones
  array. This has been fixed by comparing subzones using an unordered
  map.
* The ApplyZoneConfig we previously did overwrote subzone fields on the
  original subzone array element, meaning that if there was a mismatch
  it would not be reported through validation. This is now fixed by
  applying the expected zone config to *zonepb.NewZoneConfig() instead.
* Added logic to only check for zone config matches subzones from
  active subzone IDs.
* Improve the error messaging when a subzone config is mismatching -
  namely, add index and partitioning information and differentiate
  between missing fields and missing / extraneous zone configs

Resolves #62790

Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET
LOCALITY / crdb_internal.validate_multi_region_zone_config where
validation errors could occur when the database of a REGIONAL BY ROW
table has a new region added. Also fix a validation bug partition zone
mismatches configs were not caught.

62872: build: use -json for RandomSyntax test r=otan a=rafiss

I'm hoping this will help out with an issue where the test failures seem
to be missing helpful logs.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 7, 2021
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state
and the old range applied state keys in cockroachdb#58088. We're now guaranteed to
never see the legacy truncated+applied state representations.

Release justification: cluster version cleanup
Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 7, 2021
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state
and the old range applied state keys (RaftAppliedIndexLegacyKey,
LeaseAppliedIndexLegacyKey, RangeStatsLegacyKey) in cockroachdb#58088. We're now
guaranteed to never see the legacy truncated+applied state
representations, making this code (finally) safe to delete.

Release justification: cluster version cleanup
Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 7, 2021
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state
and the old range applied state keys (RaftAppliedIndexLegacyKey,
LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're
now guaranteed to never see the legacy truncated+applied state
representations, making this code (finally) safe to delete.

Release justification: cluster version cleanup
Release note: None

---

This turned out to be pretty hairy and invasive. I'll wait for a few CI
runs and inspect the diffs here a bit more closely. Given the surface
area here, I'm curious if we feel this is worth getting in during
stability. Some questions to consider:

- The various legacy keys that were removed were removed in their
  entirety. Our CLI to scan over store keys will no longer recognize these
  keys. Is that ok?
- I don't think we need any checks to make sure we're not re-using some
  of these keys in the future -- but I'd like other opinions.
- Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for
  the unreplicated truncated key as well, so just renamed it more
  appropriately.
- v21.1 nodes have to interop with v20.2 nodes where the legacy
  representation was possible, and it identified such cases through
  booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState
  for example). This means that we still need to explicitly set them in
  v21.2 code. Crucially we'll stop reading these fields in v21.2 code,
  letting us remove these fields entirely in v22.1.
- The treatment above does not extend to ReplicaState.UsingAppliedStateKey;
  that field was only being used to trigger an idempotent migration to
  start using the newer applied state key representation, not as a way
  to express that the replica should expect the legacy representation.
  Since cockroachdb#58088 has already migrated every range in the system, we no
  longer need to trigger anything, and this field can be removed.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 7, 2021
Fixes cockroachdb#66544. We fully migrated away from the replicated truncated state
and the old range applied state keys (RaftAppliedIndexLegacyKey,
LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're
now guaranteed to never see the legacy truncated+applied state
representations, making this code (finally) safe to delete.

Release justification: cluster version cleanup
Release note: None

---

This turned out to be pretty hairy and invasive. I'll wait for a few CI
runs and inspect the diffs here a bit more closely. Given the surface
area here, I'm curious if we feel this is worth getting in during
stability. Some questions to consider:

- The various legacy keys that were removed were removed in their
  entirety. Our CLI to scan over store keys will no longer recognize these
  keys. Is that ok?
- I don't think we need any checks to make sure we're not re-using some
  of these keys in the future -- but I'd like other opinions.
- Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for
  the unreplicated truncated key as well, so just renamed it more
  appropriately.
- v21.1 nodes have to interop with v20.2 nodes where the legacy
  representation was possible, and it identified such cases through
  booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState
  for example). This means that we still need to explicitly set them in
  v21.2 code. Crucially we'll stop reading these fields in v21.2 code,
  letting us remove these fields entirely in v22.1.
- The treatment above does not extend to ReplicaState.UsingAppliedStateKey;
  that field was only being used to trigger an idempotent migration to
  start using the newer applied state key representation, not as a way
  to express that the replica should expect the legacy representation.
  Since cockroachdb#58088 has already migrated every range in the system, we no
  longer need to trigger anything, and this field can be removed.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 13, 2021
Fixes cockroachdb#66544.
Fixes cockroachdb#66834.

We fully migrated away from the replicated truncated state
and the old range applied state keys (RaftAppliedIndexLegacyKey,
LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're
now guaranteed to never see the legacy truncated+applied state
representations, making this code (finally) safe to delete.

Release justification: cluster version cleanup
Release note: None

---

This turned out to be pretty hairy and invasive. I'll wait for a few CI
runs and inspect the diffs here a bit more closely. Given the surface
area here, I'm curious if we feel this is worth getting in during
stability. Some questions to consider:

- The various legacy keys that were removed were removed in their
  entirety. Our CLI to scan over store keys will no longer recognize these
  keys. Is that ok?
- I don't think we need any checks to make sure we're not re-using some
  of these keys in the future -- but I'd like other opinions.
- Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for
  the unreplicated truncated key as well, so just renamed it more
  appropriately.
- v21.1 nodes have to interop with v20.2 nodes where the legacy
  representation was possible, and it identified such cases through
  booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState
  for example). This means that we still need to explicitly set them in
  v21.2 code. Crucially we'll stop reading these fields in v21.2 code,
  letting us remove these fields entirely in v22.1.
- The treatment above does not extend to ReplicaState.UsingAppliedStateKey;
  that field was only being used to trigger an idempotent migration to
  start using the newer applied state key representation, not as a way
  to express that the replica should expect the legacy representation.
  Since cockroachdb#58088 has already migrated every range in the system, we no
  longer need to trigger anything, and this field can be removed.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 14, 2021
Fixes cockroachdb#66544.
Fixes cockroachdb#66834.

We fully migrated away from the replicated truncated state
and the old range applied state keys (RaftAppliedIndexLegacyKey,
LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in cockroachdb#58088. We're
now guaranteed to never see the legacy truncated+applied state
representations, making this code (finally) safe to delete.

Release justification: cluster version cleanup
Release note: None

---

This turned out to be pretty hairy and invasive. I'll wait for a few CI
runs and inspect the diffs here a bit more closely. Given the surface
area here, I'm curious if we feel this is worth getting in during
stability. Some questions to consider:

- The various legacy keys that were removed were removed in their
  entirety. Our CLI to scan over store keys will no longer recognize these
  keys. Is that ok?
- I don't think we need any checks to make sure we're not re-using some
  of these keys in the future -- but I'd like other opinions.
- Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for
  the unreplicated truncated key as well, so just renamed it more
  appropriately.
- v21.1 nodes have to interop with v20.2 nodes where the legacy
  representation was possible, and it identified such cases through
  booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState
  for example). This means that we still need to explicitly set them in
  v21.2 code. Crucially we'll stop reading these fields in v21.2 code,
  letting us remove these fields entirely in v22.1.
- The treatment above does not extend to ReplicaState.UsingAppliedStateKey;
  that field was only being used to trigger an idempotent migration to
  start using the newer applied state key representation, not as a way
  to express that the replica should expect the legacy representation.
  Since cockroachdb#58088 has already migrated every range in the system, we no
  longer need to trigger anything, and this field can be removed.
craig bot pushed a commit that referenced this pull request Sep 14, 2021
67737: opt: zigzag scan hints r=cucaroach a=cucaroach

Fixes #35814

Enable table@{FORCE_ZIGZAG} and table@{FORCE_ZIGZAG=index} hints to lower the cost
of zigzag plans to that they will be preferred over other plans.

Note that this is a breaking change, if customer has an index called
FORCE_ZIGZAG and is hinting with table@{FORCE_ZIGZAG} that will no longer do what
it did before. Not sure what the ramifications of that are.

Release note (sql change): Extend index scan hint to allow zigzag joins to be preferred.


69887: kv,migration: rm code handling legacy raft truncated state r=irfansharif a=irfansharif

Fixes #66544. 
Fixes #66834.

We fully migrated away from the replicated truncated state
and the old range applied state keys (RaftAppliedIndexLegacyKey,
LeaseAppliedIndexLegacyKey, and RangeStatsLegacyKey) in #58088. We're
now guaranteed to never see the legacy truncated+applied state
representations, making this code (finally) safe to delete.

Release justification: cluster version cleanup
Release note: None

---

This turned out to be pretty hairy and invasive. I'll wait for a few CI
runs and inspect the diffs here a bit more closely. Given the surface
area here, I'm curious if we feel this is worth getting in during
stability. Some questions to consider:

- The various legacy keys that were removed were removed in their
  entirety. Our CLI to scan over store keys will no longer recognize these
  keys. Is that ok?
- I don't think we need any checks to make sure we're not re-using some
  of these keys in the future -- but I'd like other opinions.
- Looks like we re-purposed the LocalRaftTruncatedStateLegacySuffix for
  the unreplicated truncated key as well, so just renamed it more
  appropriately.
- v21.1 nodes have to interop with v20.2 nodes where the legacy
  representation was possible, and it identified such cases through
  booleans in proto messages (SnapshotRequest.Header.UseUnreplicatedTruncatedState
  for example). This means that we still need to explicitly set them in
  v21.2 code. Crucially we'll stop reading these fields in v21.2 code,
  letting us remove these fields entirely in v22.1.
- The treatment above does not extend to ReplicaState.UsingAppliedStateKey;
  that field was only being used to trigger an idempotent migration to
  start using the newer applied state key representation, not as a way
  to express that the replica should expect the legacy representation.
  Since [#58088](#58088) has already migrated every range in the system, we no
  longer need to trigger anything, and this field can be removed.

70059: roachtest: make --workload optional r=mgartner a=mgartner

This commit makes the `--workload` flag optional because some roachtests
do not require a workload binary.

Release note: None

70120: opt: match ScalarGroupBy in EliminateIndexJoinOrProjectInsideGroupBy r=mgartner a=mgartner

This commit extends EliminateIndexJoinOrProjectInsideGroupBy so that
ScalarGroupBy expressions are also matched. This allows the rule to
eliminate unnecessary index joins in more cases.

The primary motivation for this change was to make partial index
validation queries more efficient. These queries always have
ScalarGroupBy expressions because they are in the form:

    SELECT count(1) FROM table@partial_index WHERE predicate

Prior to this change, an index join was planned for these queries which
would operate on every row in the partial index. This could be extremely
expensive for large partial indexes.

Fixes #70116

Release note (performance improvement): A limitation has been fixed that
made creating partial indexes inefficient.

70162: ui: fix default behaviours of columns on stmt page on cc console r=maryliag a=maryliag

When a CC Console user open the Statement page for the first time
(no cache was created for column selector), this commits make sure
that the default columns will be displayed.

Fixes: #70160

Release justification: Category 4
Release note (bug fix): Default columns being displayed on Statements
page on CC console when the user never made any selection

Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 29, 2021
Fixes cockroachdb#72029.

using_applied_state_key was previously used to check whether the Range
had upgraded to begin using the RangeAppliedState key. When set to true,
replicas receiving the state (through a ReplicatedEvalResult) knew to
begin using the new key. In cockroachdb#58088 (in 21.1) we introduced a migration
to iterate through all ranges in the system and have them start using
the RangeAppliedState key. In 21.2, this field was always set to true --
21.2 nodes were already using the RangeAppliedState key, or receiving
messages from 21.1 nodes that were also using the RangeAppliedState key.
In 21.2 (and in 21.1 for that matter) we didn't need to trigger the "if
set to true in an incoming message, start using the RangeAppliedState
key" code path.

When looking to get rid of this field in 22.1 (cockroachdb#70464), we observed that
there was an unintentional read of this field in 21.2 nodes (see cockroachdb#72029 and
\cockroachdb#72222); the saga is as follows:
 - Removing this field in 22.1 meant it was set as false when received at
   21.2 nodes.
 - This should've been fine! We weren't using this field to trigger any
   upgrade code paths (like it was originally intended for).
 - It turns out that in 21.2 we were using the ReplicaState from the
   incoming snapshot to update our in-memory replica state
 - Because the proto field was being phased out, there was now a divergence
   between the on-disk state (field set to true, from earlier 21.2
   operations) and the in-memory state (field set to false, because sent
   from a version that attempted to get rid of this field).

Removing proto fields from the replica state are not possible until we stop
using the protobuf copy of the replica state when applying a snapshot
(cockroachdb#72222). Once that's done, we should be able to stop sending the replica
state as part of the snapshot in the subsequent release.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 29, 2021
72132: coldata: fix BenchmarkAppend so that it could be run with multiple count r=yuzefovich a=yuzefovich

Previously, we could get an index out of bounds because we'd allocate
larger `Bytes.data` slice than `int32` offset can support.

Release note: None

72239: kvserver: resurrect using_applied_state_key in ReplicaState r=irfansharif a=irfansharif

Fixes #72029.

`using_applied_state_key` was previously used to check whether the Range
had upgraded to begin using the `RangeAppliedState` key. When set to true,
replicas receiving the state (through a `ReplicatedEvalResult`) knew to
begin using the new key. In #58088 (in 21.1) we introduced a migration
to iterate through all ranges in the system and have them start using
the `RangeAppliedState` key. In 21.2, this field was always set to true --
21.2 nodes were already using the `RangeAppliedState` key, or receiving
messages from 21.1 nodes that were also using the `RangeAppliedState` key.
In 21.2 (and in 21.1 for that matter) we didn't need to trigger the "if
set to true in an incoming message, start using the `RangeAppliedState`
key" code path.

When looking to get rid of this field in 22.1 (#70464), we observed that
there was an unintentional read of this field in 21.2 nodes (see #72029 and
\#72222); the saga is as follows:
 - Removing this field in 22.1 meant it was set as false when received at
   21.2 nodes.
 - This should've been fine! We weren't using this field to trigger any
   upgrade code paths (like it was originally intended for).
 - It turns out that in 21.2 we were using the `ReplicaState` from the
   incoming snapshot to update our in-memory replica state
 - Because the proto field was being phased out, there was now a divergence
   between the on-disk state (field set to true, from earlier 21.2
   operations) and the in-memory state (field set to false, because sent
   from a version that attempted to get rid of this field).

Removing proto fields from the replica state are not possible until we stop
using the protobuf copy of the replica state when applying a snapshot
(#72222). Once that's done, we should be able to stop sending the replica
state as part of the snapshot in the subsequent release.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Nov 2, 2021
We used to have to send the raft log along with snapshots as a result of
(in the olden days) requiring all replicas to agree on the truncated
state. This hasn't been (generally) true as of v19.1 (cockroachdb#35701), though it
was still a possibility until v22.1 (cockroachdb#58088). This commit removes the
corresponding code.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Nov 2, 2021
We used to have to send the raft log along with snapshots as a result of
(in the olden days) requiring all replicas to agree on the truncated
state. This hasn't been (generally) true as of v19.1 (cockroachdb#35701), though it
was still a possibility until v22.1 (cockroachdb#58088). This commit removes the
corresponding code.

Release note: None
craig bot pushed a commit that referenced this pull request Nov 2, 2021
72310: kvserver: remove unused snapshot log entry code r=erikgrinaker a=tbg

We used to have to send the raft log along with snapshots as a result of
(in the olden days) requiring all replicas to agree on the truncated
state. This hasn't been (generally) true as of v19.1 (#35701), though it
was still a possibility until v22.1 (#58088). This commit removes the
corresponding code.

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants