[prototype] migration: onboard the first long-running migration #57445

irfansharif · 2020-12-03T08:54:52Z

This PR onboards the first real long-running migration using the
infrastructure we've been building up within pkg/migration. It adds in
the final missing pieces described in our original RFC (#48843). These
components were originally prototyped in #56107.

The migration in question (which happens to be a below-Raft one, first
attempted in #42822) now lets us do the following:

i. Use the RangeAppliedState on all ranges
ii. Use the unreplicated TruncatedState on all ranges

The missing pieces we introduce along side this migration are:

a. The Migrate KV request. This forces all ranges overlapping with
the request spans to execute the (below-raft) migrations corresponding
to the specific version, moving out of any legacy modes they may
currently be in. KV waits for this command to durably apply on all the
replicas before returning, guaranteeing to the caller that all
pre-migration state has been completely purged from the system.

b. IterateRangeDescriptors. This provides a handle on every range
descriptor in the system, which callers can then use to send out
arbitrary KV requests to in order to run arbitrary KV-level
migrations. These requests will typically just be the Migrate
request, with added code next to the Migrate command implementation
to do the specific KV-level things intended for the specified version.

c. The system.migrations table. We use it to store metadata about
ongoing migrations for external visibility/introspection. The schema
(listed below) is designed with an eye towards scriptability. We
want operators to be able programmatically use this table to control
their upgrade processes, if needed.

  CREATE TABLE public.migrations (
      version STRING NOT NULL,
      status STRING NOT NULL,
      description STRING NOT NULL,
      start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP,
      completed TIMESTAMP NULL,
      progress STRING NULL,
      CONSTRAINT "primary" PRIMARY KEY (version ASC),
      FAMILY "primary" (version, status, description, start, completed, progress)
  )

Release note(general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = <major>-<minor>, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

Release note(general change): We introduce a new system.migrations
table for introspection into crdb internal data migrations. These
migrations are the "maintenance duties" mentioned above. The table
surfaces the currently ongoing migrations, the previously completed
migrations, and in the case of failure, the errors from the last failed
attempt.

cockroach-teamcity · 2020-12-03T08:55:00Z

This change is

irfansharif · 2020-12-03T08:59:01Z

Heads up, this is still a [wip] branch. It's a bit far from landing. I have zero tests, and intend to add them here. I also intend to break it up into a few commits: introducing the system table, introducing the request type, introducing IterateRangeDescriptors, and only after all that, adding the migration itself. For now I've just squashed them for an early first-pass look/sanity check before I spend more time on final polish.

I have a few open questions in there as comments (grep for XXX:), but any+all thoughts welcome on the rest of it.

tbg

This looks pretty good already! I left a bunch of comments but I think it's basically time to figure out how to write proper tests and how to put this into a number of smaller PRs.

The migration table introduction looks like it could be a good last PR. The migrate command + testing could come first. Then a PR that updates the manager and adds tests and hooks up the migrate command. Just a suggestion.

Reviewed 49 of 49 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)

pkg/kv/kvserver/replica_proposal.go, line 781 at r1 (raw file):

			// below. If we're not running 21.1 yet, migrate over as we've done
			// since the introduction of the applied state key.
			if r.ClusterSettings().Version.IsActive(ctx, clusterversion.TruncatedAndRangeAppliedStateMigration) {

This is exactly the kind of thing that could cause spurious fatal errors when there are replicaGC'able replicas left around that didn't get included in the raft command. On these, you can still hit the evaluation path (they'll try to request leases). I don't know if this PR already does anything to avoid this behavior (I'll just leave this comment now and will continue reviewing then), but as we've discussed in the past there are two ingredients:

track the range version (in the replica state) and disallow snapshots from an old version to any newer version
force a run of replicaGC on all replicas in the cluster that don't have the updated range version (via EveryNode presumably)

That way we can

call Migrate on the whole keyspace
force the run

and be assured that all replicas now in the cluster are at the new version, i.e. it's safe to bump the cluster version.

pkg/kv/kvserver/replica_write.go, line 223 at r1 (raw file):

				// applied on all peers, which we achieve via waitForApplication.
				//
				// XXX: Could a snapshot below maxLeaseIndex be in-flight to a

Snapshots that don't include the recipient (as specified by replicaID and descriptor in the snap vs the replicaID of the raft instance) are discarded by the recipient, so this should not be an issue. The particular case you mention will have the learner have a higher replicaID than the incoming snapshot, so the snapshot will also be out for that reason. Finally, a snapshot is discarded unless it moves the LAI forward.

pkg/kv/kvserver/batcheval/cmd_migrate.go, line 16 at r1 (raw file):

	"context"

	cv "github.com/cockroachdb/cockroach/pkg/clusterversion"

small nit, but I'm generally -1 on unnecessary import renames.

pkg/migration/helper.go, line 39 at r1 (raw file):

// to execute all relevant node-level operations on them. If any of the nodes
// are found to be unavailable, an error is returned.
func (h *Helper) RequiredNodeIDs(ctx context.Context) ([]roachpb.NodeID, error) {

Should this be exported given that you need to wrap it in a stabilizing retry loop, and thus is a footgun if used from a migration work function?

pkg/migration/helper.go, line 100 at r1 (raw file):

// (accessible using the `progress` column in `system.migrations`) for the
// ongoing migration.
func (h *Helper) UpdateProgress(ctx context.Context, progress string) error {

"progress" is a bit of a misnomer, how about "status"

pkg/migration/helper.go, line 136 at r1 (raw file):

//
// By the time EveryNode returns, we'll have thus invoked the closure against
// every node in the cluster.

That's not true. A node can always join just as EveryNode returns. What we have achieved is that there is now a causal happened-before relation between executing something on every node in the cluster and the next node joining the cluster. We managed to run something on all nodes without a new node joining half-way through, which could have allowed it to pick up some state (off one of the existing nodes that hadn't finished processing yet).

For the example of bumping the cluster version, we'll have achieved proving that by the time EveryNode returns, every node in the cluster, current and future, will use the new cluster version.

pkg/migration/helper.go, line 162 at r1 (raw file):

	for {
		// TODO(irfansharif): We can/should send out these RPCs in parallel.

We should do that soon, too. Ideally in the burst of commits coming out of this one!

pkg/migration/migrations.go, line 110 at r1 (raw file):

func TruncatedStateMigration(ctx context.Context, h *Helper) error {
	// TODO(irfansharif): What should the default block size be?
	const blockSize = 50

At 50k ranges, with each range taking ~200ms (a rather conservative assumption but in a wide-area cluster could be true due to hops from the manager to the replicas) to process the Migrate, a block size of 1 is >2.7 hours. Block size 50 is ~3.3 minutes. Block size 200 should easily be in the cards given that each range is doing very little work here, so down to 50s in this case which seems reasonable.

pkg/migration/migrations.go, line 116 at r1 (raw file):

		for _, desc := range descriptors {
			if bytes.Compare(desc.StartKey, keys.LocalMax) < 0 {
				// XXX: Does it make sense to "migrate" ranges with local keys?

Yeah, this is a bit of a wart. We do want to reach the first range, but we can't address StartKey. However, LocalMax is on r1, so in this case we basically just use LocalMax instead of StartKey.

I wonder if we'd get sanity back if we officially let r1 begin at LocalMax, but that is not worth getting into here or probably ever.

pkg/migration/migrations.go, line 128 at r1 (raw file):

		numMigratedRanges += len(descriptors)
		progress := fmt.Sprintf("[batch %d/??] migrated %d ranges", batchIdx, numMigratedRanges)

Doesn't IterateRangeDescriptors have to handle txn retries? The closure here would need to be idempotent, but the counters aren't. Maybe IterateRangeDescriptors should take an init func() as well which can reset them.

pkg/migration/migrations.go, line 140 at r1 (raw file):

	// XXX: I think we'll need to make sure all stores have synced once to
	// persist all applications of the `Migrate` raft command.

Very good point. It sounds reasonable to do this as a separate round of EveryNode (as opposed to doing it on each raft command application, which would cause tons of syncs).
It's probably also just generally a good idea to make this happen on every migration that does work. It could even be part of bumping the cluster version, which by definition already requires syncing; we could just extend that to explicitly mandate syncing all engines.

Hmm, thinking about this further, this is not good enough when nodes restart rapidly. I think the best way to address this is to make EveryNode more aggressive about retrying: in addition to the list of nodes stabilizing, their corresponding epochs also need to remain stable (and, of course, everyone live). That way we don't impose a requirement that everything we do needs to be synced immediately, which is super expensive and we're also pretty much guaranteed to get it wrong.

pkg/roachpb/api.go, line 1295 at r1 (raw file):

func (*AdminScatterRequest) flags() int                  { return isAdmin | isRange | isAlone }
func (*AdminVerifyProtectedTimestampRequest) flags() int { return isAdmin | isRange | isAlone }
func (*MigrateRequest) flags() int                       { return isWrite | isRange | isAlone }

I wonder if MigrateRange is a better name.

pkg/roachpb/api.proto, line 1500 at r1 (raw file):

// is already going to be the one responsible for stepping through all
// intermediate versions, it's probably saner to only carry out the specific
// migration associated with the target version.

Yeah, that's the impression I had as well.

pkg/roachpb/api.proto, line 1502 at r1 (raw file):

// migration associated with the target version.
//
// TODO(irfansharif): Should this be an admin request instead? There's no good

No, admin requests are really just a way to invoke random code on the leaseholder. Here we want to properly go through the raft pipeline, which admin requests don't.

pkg/roachpb/api.proto, line 1509 at r1 (raw file):

  RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];

  // The version that will become active next. A request with the

incomplete

pkg/roachpb/method.go, line 161 at r1 (raw file):
This comment could be better, something like

Migrate updates the range state to conform to a specified cluster version. It is our main mechanism for phasing out legacy code below Raft.

pkg/sql/catalog/systemschema/system.go, line 347 at r1 (raw file):

	MigrationsTableSchema = `
CREATE TABLE system.migrations (

Is this going to be gossiped? (The answer should be "no", I forget what the rules are).

pkg/sql/catalog/systemschema/system.go, line 615 at r1 (raw file):

	// MigrationsTable is the descriptor for the migrations table. It contains
	// details for all past and ongoing migrations.

We'll want someone from SQL to review this part of the PR. This all looks sane, but I'd want someone to sign off that it is.

pkg/sql/logictest/testdata/logic_test/distsql_agg, line 578 at r1 (raw file):

SELECT corr(y, x)::decimal, covar_pop(y, x)::decimal, covar_samp(y, x)::decimal FROM statistics_agg_test
----
0.04522896319136315  3.75  3.787878787878788

Huh. What are these about?

This PR onboards the first real long-running migration using the infrastructure we've been building up within pkg/migration. It adds in the final missing pieces described in our original RFC (cockroachdb#48843). These components were originally prototyped in cockroachdb#56107. The migration in question (which happens to be a below-Raft one, first attempted in cockroachdb#42822) now lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges The missing pieces we introduce along side this migration are: a. The `Migrate` KV request. This forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific version, moving out of any legacy modes they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. b. `IterateRangeDescriptors`. This provides a handle on every range descriptor in the system, which callers can then use to send out arbitrary KV requests to in order to run arbitrary KV-level migrations. These requests will typically just be the `Migrate` request, with added code next to the `Migrate` command implementation to do the specific KV-level things intended for the specified version. c. The `system.migrations` table. We use it to store metadata about ongoing migrations for external visibility/introspection. The schema (listed below) is designed with an eye towards scriptability. We want operators to be able programmatically use this table to control their upgrade processes, if needed. CREATE TABLE public.migrations ( version STRING NOT NULL, status STRING NOT NULL, description STRING NOT NULL, start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP, completed TIMESTAMP NULL, progress STRING NULL, CONSTRAINT "primary" PRIMARY KEY (version ASC), FAMILY "primary" (version, status, description, start, completed, progress) ) Release note (general change): Cluster version upgrades, as initiated by `SET CLUSTER SETTING version = <major>-<minor>`, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. Release note (general change): We introduce a new `system.migrations` table for introspection into crdb internal data migrations. These migrations are the "maintenance duties" mentioned above. The table surfaces the currently ongoing migrations, the previously completed migrations, and in the case of failure, the errors from the last failed attempt.

And address open comments in cockroachdb#57445. Release note: None

irfansharif

Thanks for looking. I've addressed all open comments here, and added a bunch of tests. I'm still going to leave this up as an open prototype branch (for this top level review + discussion + running CI) and pull out bits into separate PRs elsewhere.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @tbg, and @yuzefovich)

pkg/kv/kvserver/replica_proposal.go, line 781 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is exactly the kind of thing that could cause spurious fatal errors when there are replicaGC'able replicas left around that didn't get included in the raft command. On these, you can still hit the evaluation path (they'll try to request leases). I don't know if this PR already does anything to avoid this behavior (I'll just leave this comment now and will continue reviewing then), but as we've discussed in the past there are two ingredients:

track the range version (in the replica state) and disallow snapshots from an old version to any newer version

force a run of replicaGC on all replicas in the cluster that don't have the updated range version (via EveryNode presumably)

That way we can

call Migrate on the whole keyspace

force the run

and be assured that all replicas now in the cluster are at the new version, i.e. it's safe to bump the cluster version.

Instead of tracking the range version in the replica state (somewhat out of laziness), I opted to instead introduce a GCReplicas RPC to the Migration service and wire it in using EveryNode. Whenever we're running a ranged Migrate command across the entire keyspace for vX, we'll want to GC previous version replicas immediately after. I think almost always this would mean processing the entire the replica GC queue anyway.
As for disallowing previous version snapshots, I think our MLAI safety checks (discussed below) will already prevent it from happening.

pkg/kv/kvserver/replica_write.go, line 223 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Snapshots that don't include the recipient (as specified by replicaID and descriptor in the snap vs the replicaID of the raft instance) are discarded by the recipient, so this should not be an issue. The particular case you mention will have the learner have a higher replicaID than the incoming snapshot, so the snapshot will also be out for that reason. Finally, a snapshot is discarded unless it moves the LAI forward.

Done, incorporated into the comment here.

pkg/kv/kvserver/batcheval/cmd_migrate.go, line 16 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

small nit, but I'm generally -1 on unnecessary import renames.

Done.

pkg/migration/helper.go, line 39 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Should this be exported given that you need to wrap it in a stabilizing retry loop, and thus is a footgun if used from a migration work function?

good idea, Done.

pkg/migration/helper.go, line 100 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"progress" is a bit of a misnomer, how about "status"

I agree that "status" is better, but for parity with (the now soured upon) system.jobs, I reserved the use of "status" for the overall status of the migration ("ongoing/succeeded/failed"). Progress was the closest next thing here that seemed appropriate.

pkg/migration/helper.go, line 136 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That's not true. A node can always join just as EveryNode returns. What we have achieved is that there is now a causal happened-before relation between executing something on every node in the cluster and the next node joining the cluster. We managed to run something on all nodes without a new node joining half-way through, which could have allowed it to pick up some state (off one of the existing nodes that hadn't finished processing yet).

For the example of bumping the cluster version, we'll have achieved proving that by the time EveryNode returns, every node in the cluster, current and future, will use the new cluster version.

yea, I had tried elaborating on that in the paragraphs, but we're database people, I should talk in terms of causality. Reworked this entire block.

pkg/migration/helper.go, line 162 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We should do that soon, too. Ideally in the burst of commits coming out of this one!

Done.

pkg/migration/migrations.go, line 110 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

At 50k ranges, with each range taking ~200ms (a rather conservative assumption but in a wide-area cluster could be true due to hops from the manager to the replicas) to process the Migrate, a block size of 1 is >2.7 hours. Block size 50 is ~3.3 minutes. Block size 200 should easily be in the cards given that each range is doing very little work here, so down to 50s in this case which seems reasonable.

Makes sense, I'll include this back-of-the-envelope calculation in there.

pkg/migration/migrations.go, line 116 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah, this is a bit of a wart. We do want to reach the first range, but we can't address StartKey. However, LocalMax is on r1, so in this case we basically just use LocalMax instead of StartKey.

I wonder if we'd get sanity back if we officially let r1 begin at LocalMax, but that is not worth getting into here or probably ever.

Done.

pkg/migration/migrations.go, line 128 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Doesn't IterateRangeDescriptors have to handle txn retries? The closure here would need to be idempotent, but the counters aren't. Maybe IterateRangeDescriptors should take an init func() as well which can reset them.

Ah, good point. I like the init idea, done.

pkg/migration/migrations.go, line 140 at r1 (raw file):

It sounds reasonable to do this as a separate round of EveryNode

Done. I've added a primitive that simply flushes all engines. Seems like a useful property for the migrations infrastructure to provide.

we could just extend that to explicitly mandate syncing all engines.

I prefer dead-simple to-the-point APIs, so keeping BumpClusterVersion RPC as is.

pkg/roachpb/api.go, line 1295 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I wonder if MigrateRange is a better name.

I think I'll keep it as is. We'd technically still be able to send out a ranged Migrate request over arbitrary keyspans, we're only choosing to paginate because it'd be insane not to.

pkg/roachpb/api.proto, line 1500 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah, that's the impression I had as well.

Cool.

pkg/roachpb/api.proto, line 1502 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No, admin requests are really just a way to invoke random code on the leaseholder. Here we want to properly go through the raft pipeline, which admin requests don't.

Makes sense, this TODO was more for aesthetic reasons than anything functional.

pkg/roachpb/api.proto, line 1509 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

incomplete

Looks like I forgot to push out some changes. Done.

pkg/roachpb/method.go, line 161 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This comment could be better, something like

Migrate updates the range state to conform to a specified cluster version. It is our main mechanism for phasing out legacy code below Raft.

Done.

pkg/sql/catalog/systemschema/system.go, line 347 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this going to be gossiped? (The answer should be "no", I forget what the rules are).

I'm not sure (lazily hoping one of the SQL reviewers can answer).

pkg/sql/logictest/testdata/logic_test/distsql_agg, line 578 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Huh. What are these about?

Who knows, just came up trying to re-write the logictests to include details for the new system table.

make -j1 testlogic TESTFLAGS='--rewrite'

I've previously leaned on @yuzefovich for emotional support during these.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, and @tbg)

pkg/sql/logictest/testdata/logic_test/distsql_agg, line 578 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Who knows, just came up trying to re-write the logictests to include details for the new system table.
make -j1 testlogic TESTFLAGS='--rewrite'
I've previously leaned on @yuzefovich for emotional support during these.

I think I fixed it on master, if you rebase, these should go away.

Tests the machinery that's being used to bump cluster versions during version upgrades. Pulled out of our prototype in cockroachdb#57445. Release note: None

irfansharif · 2020-12-09T20:11:40Z

pkg/sql/catalog/systemschema/system.go

@@ -342,6 +342,18 @@ CREATE TABLE system.sqlliveness (
    expiration       DECIMAL NOT NULL,
  	FAMILY fam0_session_id_expiration (session_id, expiration)
 )`
+
+	MigrationsTableSchema = `
+CREATE TABLE system.migrations (


@ajwerner, would love see what you think of this schema.

57637: server: add TestBumpClusterVersion r=irfansharif a=irfansharif Tests the machinery that's being used to bump cluster versions during version upgrades. Pulled out of our prototype in #57445. Release note: None Co-authored-by: irfan sharif <[email protected]>

We can separate out the `Helper`, `Migration`, and various utilities into their own files. We'll add tests for individual components in future commits; the physical separation here sets the foundation for doing so (prototyped in cockroachdb#57445). This commit is purely code movement. Release note: None

We re-define what the Migration type is to be able to annotate it a description. We'll later use this description when populating the `system.migrations` table (originally prototyped in cockroachdb#57445). Release note: None

We make it a bit more ergonomic (this revision was originally prototyped in cockroachdb#57445). Release note: None

It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None

57650: migration: introduce various other bells and whistles... r=irfansharif a=irfansharif ...to pave the way for real migrations This PR cannibalizes most of our prototype in #57445 and shimmies in a few interfaces to make it all unit-testable. While here, we: - parallelize the execution of the EveryNode primitive - introduce the IterateRangeDescriptors primitive - re-write the migration registration process for better ergonomics - introduce a more structured Migration type, to annotate it with migration specific metadata (it also paves the way for logging the progress of an ongoing migration to a system table). See individual commits for details. 58036: roachtest: add test_9_6_diagnostics to psycopg blocklist r=rafiss a=arulajmani Previously, this test was skipped as our pg server version (9.5) did not meet the threshold (9.6) for this test to run. After the pg server version bump to 13 this is no longer the case. A change to explicitly skip this test for CRDB has been merged upstream, but it won't apply until the next release (psycopg_2_8_7) comes out and we switch to testing against that. Until then, this test lives in the blocklist. Closes #57986 Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: arulajmani <[email protected]>

irfansharif · 2020-12-18T19:47:10Z

Most of this prototype was folded into #57650. The contentious opinions here (#57662) were scrapped. As for introspection into in-progress migration state, we're simply making use of logging channels for now. There's something to be said for "jobs-ifying" long running migrations at a per-migration level (which in this prototype we simply re-built and dumped into a system.migrations table), but that's not a thread we're going to pull on for a bit I think.

These include: - The KV ranged `Migrate` command. This command forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific, stated version. This has the effect of moving the range out of any legacy modes operation they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. - The `FlushAllEngines` RPC. This is be used to instruct the target node to persist all in-memory state to disk. Like we mentioned above, KV currently waits for the `Migrate` command to have applied on all replicas before returning. With the applied state, there's no necessity to durably persist it (the representative version is already stored in the raft log). Out of an abundance of caution, and to really really ensure that no pre-migrated state is ever seen in the system, we provide the migration manager a mechanism to flush out all in-memory state to disk. This will let us guarantee that by the time a specific cluster version is bumped, all pre-migrated state from prior to a certain version will have been fully purged from the system. - The `PurgeOutdatedReplicas` RPC. This too comes up in the context of wanting the ensure that ranges where we've executed a ranged `Migrate` command over have no way of ever surfacing pre-migrated state. This can happen with older replicas in the replica GC queue and with applied state that is not yet persisted. Currently we wait for the `Migrate` to have applied on all replicas of a range before returning to the caller. This does not include earlier incarnations of the range, possibly sitting idle in the replica GC queue. These replicas can still request leases, and go through the request evaluation paths, possibly tripping up assertions that check to see no pre-migrated state is found. The `PurgeOutdatedReplicas` lets the migration manager do exactly as the name suggests, ensuring all "outdated" replicas are processed before declaring the specific cluster version bump complete. - The concept of a "replica state version". This is what's used to construct the migration manager's view of what's "outdated", telling us which migrations can be assumed to have run against a particular replica. When we introduce backwards incompatible changes to the replica state (for example using the unreplicated truncated state instead of the replicated variant), the version would inform us if, for a given replica, we should expect a state representation prior to, or after the migration (in our example this corresponds to whether or not we can assume an unreplicated truncated state). - A gating mechanism for "outdated" snapshots. Below-raft migrations want to rely on the invariant that there are no extant replicas in the system that haven't seen the specific Migrate command. This is made difficult by the fact that it's possible for in-flight snapshots to instantiate replicas with pre-migrated state. To guard against this, we gate any incoming snapshots that were generated from a replica that (at the time) had a version less than our store's current active version. As part of this change, we re-order the steps taken by the migration manager so that it executes a given migration first before bumping version gates cluster wide. This is in consideration of the fact that the migration process can be arbitrarily long running, and it'd be undesirable to pause rebalancing/repair mechanisms from functioning during this time (which we're doing by disallowing snapshots from older replica versions). --- This PR motivates all of the above by also onboarding the TruncatedAndRangeAppliedState migration, lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges In 21.2 we'll finally be able to delete holdover code that knows how to handle the legacy replicated truncated state. Release note(general change): Cluster version upgrades, as initiated by SET CLUSTER SETTING version = <major>-<minor>, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. --- The ideas here follow from our original prototype in cockroachdb#57445. This PR needs to be broken down into a few more piecemeal ones: - replica state versioning and gating old snapshots (separate PR) - pkg/migration primitives - `Migrate` (stand-alone commit) - `FlushAllEngines/PurgeOutdatedReplicas` (stand-alone commit) - `Manager.Migrate` changes (stand-alone commit) - the truncated state migration (separate PR)

and onboard the truncated state migration. These include: - The KV ranged `Migrate` command. This command forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific, stated version. This has the effect of moving the range out of any legacy modes operation they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. - The `SyncAllEngines` RPC. This is be used to instruct the target node to persist releveant in-memory state to disk. Like we mentioned above, KV currently waits for the `Migrate` command to have applied on all replicas before returning. With the applied state, there's no necessity to durably persist it (the representative version is already stored in the raft log). Out of an abundance of caution, and to really really ensure that no pre-migrated state is ever seen in the system, we provide the migration manager a mechanism to flush out all in-memory state to disk. This will let us guarantee that by the time a specific cluster version is bumped, all pre-migrated state from prior to a certain version will have been fully purged from the system. We'll also use it in conjunction with PurgeOutdatedReplicas below. - The `PurgeOutdatedReplicas` RPC. This too comes up in the context of wanting the ensure that ranges where we've executed a ranged `Migrate` command over have no way of ever surfacing pre-migrated state. This can happen with older replicas in the replica GC queue and with applied state that is not yet persisted. Currently we wait for the `Migrate` to have applied on all replicas of a range before returning to the caller. This does not include earlier incarnations of the range, possibly sitting idle in the replica GC queue. These replicas can still request leases, and go through the request evaluation paths, possibly tripping up assertions that check to see no pre-migrated state is found. The `PurgeOutdatedReplicas` lets the migration manager do exactly as the name suggests, ensuring all "outdated" replicas are processed before declaring the specific cluster version bump complete. - The concept of a "replica state version". This is what's used to construct the migration manager's view of what's "outdated", telling us which migrations can be assumed to have run against a particular replica. When we introduce backwards incompatible changes to the replica state (for example using the unreplicated truncated state instead of the replicated variant), the version would inform us if, for a given replica, we should expect a state representation prior to, or after the migration (in our example this corresponds to whether or not we can assume an unreplicated truncated state). As part of this commit, we also re-order the steps taken by the migration manager so that it executes a given migration first before bumping version gates cluster wide. This is because we want authors of migrations to ascertain that their own migrations have run to completion, instead of attaching that check to the next version. --- This PR motivates all of the above by also onboarding the TruncatedAndRangeAppliedState migration, lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges In 21.2 we'll finally be able to delete holdover code that knows how to handle the legacy replicated truncated state. Release note (general change): Cluster version upgrades, as initiated by SET CLUSTER SETTING version = <major>-<minor>, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. --- The ideas here follow from our original prototype in cockroachdb#57445.

and onboard the first long running migration (for the truncated state, see below). The primitives here include: - The KV ranged `Migrate` command. This command forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific, stated version. This has the effect of moving the range out of any legacy modes operation they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. - The `SyncAllEngines` RPC. This is be used to instruct the target node to persist releveant in-memory state to disk. Like we mentioned above, KV currently waits for the `Migrate` command to have applied on all replicas before returning. With the applied state, there's no necessity to durably persist it (the representative version is already stored in the raft log). Out of an abundance of caution, and to really really ensure that no pre-migrated state is ever seen in the system, we provide the migration manager a mechanism to flush out all in-memory state to disk. This will let us guarantee that by the time a specific cluster version is bumped, all pre-migrated state from prior to a certain version will have been fully purged from the system. We'll also use it in conjunction with PurgeOutdatedReplicas below. - The `PurgeOutdatedReplicas` RPC. This too comes up in the context of wanting the ensure that ranges where we've executed a ranged `Migrate` command over have no way of ever surfacing pre-migrated state. This can happen with older replicas in the replica GC queue and with applied state that is not yet persisted. Currently we wait for the `Migrate` to have applied on all replicas of a range before returning to the caller. This does not include earlier incarnations of the range, possibly sitting idle in the replica GC queue. These replicas can still request leases, and go through the request evaluation paths, possibly tripping up assertions that check to see no pre-migrated state is found. The `PurgeOutdatedReplicas` lets the migration manager do exactly as the name suggests, ensuring all "outdated" replicas are processed before declaring the specific cluster version bump complete. - The concept of a "replica state version". This is what's used to construct the migration manager's view of what's "outdated", telling us which migrations can be assumed to have run against a particular replica. When we introduce backwards incompatible changes to the replica state (for example using the unreplicated truncated state instead of the replicated variant), the version would inform us if, for a given replica, we should expect a state representation prior to, or after the migration (in our example this corresponds to whether or not we can assume an unreplicated truncated state). As part of this commit, we also re-order the steps taken by the migration manager so that it executes a given migration first before bumping version gates cluster wide. This is because we want authors of migrations to ascertain that their own migrations have run to completion, instead of attaching that check to the next version. --- This PR motivates all of the above by also onboarding the TruncatedAndRangeAppliedState migration, lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges In 21.2 we'll finally be able to delete holdover code that knows how to handle the legacy replicated truncated state. Release note (general change): Cluster version upgrades, as initiated by SET CLUSTER SETTING version = <major>-<minor>, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. --- The ideas here follow from our original prototype in cockroachdb#57445.

58088: migration: introduce primitives for below-raft migrations r=irfansharif a=irfansharif These include: - The KV ranged `Migrate` command. This command forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific, stated version. This has the effect of moving the range out of any legacy modes operation they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. - The `FlushAllEngines` RPC. This is be used to instruct the target node to persist all in-memory state to disk. Like we mentioned above, KV currently waits for the `Migrate` command to have applied on all replicas before returning. With the applied state, there's no necessity to durably persist it (the representative version is already stored in the raft log). Out of an abundance of caution, and to really really ensure that no pre-migrated state is ever seen in the system, we provide the migration manager a mechanism to flush out all in-memory state to disk. This will let us guarantee that by the time a specific cluster version is bumped, all pre-migrated state from prior to a certain version will have been fully purged from the system. - The `PurgeOutdatedReplicas` RPC. This too comes up in the context of wanting the ensure that ranges where we've executed a ranged `Migrate` command over have no way of ever surfacing pre-migrated state. This can happen with older replicas in the replica GC queue and with applied state that is not yet persisted. Currently we wait for the `Migrate` to have applied on all replicas of a range before returning to the caller. This does not include earlier incarnations of the range, possibly sitting idle in the replica GC queue. These replicas can still request leases, and go through the request evaluation paths, possibly tripping up assertions that check to see no pre-migrated state is found. The `PurgeOutdatedReplicas` lets the migration manager do exactly as the name suggests, ensuring all "outdated" replicas are processed before declaring the specific cluster version bump complete. - The concept of a "replica state version". This is what's used to construct the migration manager's view of what's "outdated", telling us which migrations can be assumed to have run against a particular replica. When we introduce backwards incompatible changes to the replica state (for example using the unreplicated truncated state instead of the replicated variant), the version would inform us if, for a given replica, we should expect a state representation prior to, or after the migration (in our example this corresponds to whether or not we can assume an unreplicated truncated state). - A gating mechanism for "outdated" snapshots. Below-raft migrations want to rely on the invariant that there are no extant replicas in the system that haven't seen the specific Migrate command. This is made difficult by the fact that it's possible for in-flight snapshots to instantiate replicas with pre-migrated state. To guard against this, we gate any incoming snapshots that were generated from a replica that (at the time) had a version less than our store's current active version. As part of this change, we re-order the steps taken by the migration manager so that it executes a given migration first before bumping version gates cluster wide. This is in consideration of the fact that the migration process can be arbitrarily long running, and it'd be undesirable to pause rebalancing/repair mechanisms from functioning during this time (which we're doing by disallowing snapshots from older replica versions). --- This PR motivates all of the above by also onboarding the TruncatedAndRangeAppliedState migration, lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges In 21.2 we'll finally be able to delete holdover code that knows how to handle the legacy replicated truncated state. Release note(general change): Cluster version upgrades, as initiated by SET CLUSTER SETTING version = <major>-<minor>, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. --- The ideas here follow from our original prototype in #57445. This PR needs to be broken down into a few more piecemeal ones: - replica state versioning and gating old snapshots (separate PR) - pkg/migration primitives - `Migrate` (stand-alone commit) - `FlushAllEngines/PurgeOutdatedReplicas` (stand-alone commit) - `Manager.Migrate` changes (stand-alone commit) - the truncated state migration (separate PR) Co-authored-by: irfan sharif <[email protected]>

irfansharif requested review from knz and tbg December 3, 2020 08:54

irfansharif requested review from a team as code owners December 3, 2020 08:54

irfansharif removed the request for review from a team December 3, 2020 08:55

irfansharif mentioned this pull request Dec 3, 2020

*: introduce pkg/migrations #56107

Closed

tbg reviewed Dec 3, 2020

View reviewed changes

irfansharif added 2 commits December 6, 2020 02:20

[wip] *: add a bunch of tests for the migration infrastructure

12bf0c2

And address open comments in cockroachdb#57445. Release note: None

irfansharif force-pushed the 201201.migrate-raft-state branch from 45ce792 to 12bf0c2 Compare December 7, 2020 14:34

irfansharif requested review from a team and miretskiy and removed request for a team December 7, 2020 14:34

irfansharif commented Dec 7, 2020

View reviewed changes

irfansharif mentioned this pull request Dec 7, 2020

server: add TestBumpClusterVersion #57637

Merged

yuzefovich reviewed Dec 7, 2020

View reviewed changes

irfansharif mentioned this pull request Dec 7, 2020

migration: introduce various other bells and whistles... #57650

Merged

irfansharif changed the title ~~[wip] migration: onboard the first long-running migration~~ [prototype] migration: onboard the first long-running migration Dec 7, 2020

irfansharif mentioned this pull request Dec 7, 2020

*: introduce the kv Migrate command, and GCReplicas/FlushAllEngines RPCs #57662

Closed

irfansharif commented Dec 9, 2020

View reviewed changes

irfansharif requested a review from ajwerner December 9, 2020 20:11

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 11, 2020

migration: re-define the migration registration process

932a815

We make it a bit more ergonomic (this revision was originally prototyped in cockroachdb#57445). Release note: None

irfansharif closed this Dec 18, 2020

irfansharif mentioned this pull request Dec 19, 2020

migration: introduce primitives for below-raft migrations #58088

Merged

irfansharif deleted the 201201.migrate-raft-state branch January 8, 2021 21:41

jseldess mentioned this pull request Jan 29, 2021

migration: introduce primitives for below-raft migrations cockroachdb/docs#9546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prototype] migration: onboard the first long-running migration #57445

[prototype] migration: onboard the first long-running migration #57445

irfansharif commented Dec 3, 2020

cockroach-teamcity commented Dec 3, 2020

irfansharif commented Dec 3, 2020

tbg left a comment

irfansharif left a comment

yuzefovich left a comment

irfansharif Dec 9, 2020

irfansharif commented Dec 18, 2020

[prototype] migration: onboard the first long-running migration #57445

[prototype] migration: onboard the first long-running migration #57445

Conversation

irfansharif commented Dec 3, 2020

cockroach-teamcity commented Dec 3, 2020

irfansharif commented Dec 3, 2020

tbg left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

irfansharif Dec 9, 2020

Choose a reason for hiding this comment

irfansharif commented Dec 18, 2020