Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prototype] migration: onboard the first long-running migration #57445

Closed

Conversation

irfansharif
Copy link
Contributor

This PR onboards the first real long-running migration using the
infrastructure we've been building up within pkg/migration. It adds in
the final missing pieces described in our original RFC (#48843). These
components were originally prototyped in #56107.

The migration in question (which happens to be a below-Raft one, first
attempted in #42822) now lets us do the following:

i. Use the RangeAppliedState on all ranges
ii. Use the unreplicated TruncatedState on all ranges

The missing pieces we introduce along side this migration are:

a. The Migrate KV request. This forces all ranges overlapping with
the request spans to execute the (below-raft) migrations corresponding
to the specific version, moving out of any legacy modes they may
currently be in. KV waits for this command to durably apply on all the
replicas before returning, guaranteeing to the caller that all
pre-migration state has been completely purged from the system.

b. IterateRangeDescriptors. This provides a handle on every range
descriptor in the system, which callers can then use to send out
arbitrary KV requests to in order to run arbitrary KV-level
migrations. These requests will typically just be the Migrate
request, with added code next to the Migrate command implementation
to do the specific KV-level things intended for the specified version.

c. The system.migrations table. We use it to store metadata about
ongoing migrations for external visibility/introspection. The schema
(listed below) is designed with an eye towards scriptability. We
want operators to be able programmatically use this table to control
their upgrade processes, if needed.

  CREATE TABLE public.migrations (
      version STRING NOT NULL,
      status STRING NOT NULL,
      description STRING NOT NULL,
      start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP,
      completed TIMESTAMP NULL,
      progress STRING NULL,
      CONSTRAINT "primary" PRIMARY KEY (version ASC),
      FAMILY "primary" (version, status, description, start, completed, progress)
  )

Release note(general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = <major>-<minor>, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

Release note(general change): We introduce a new system.migrations
table for introspection into crdb internal data migrations. These
migrations are the "maintenance duties" mentioned above. The table
surfaces the currently ongoing migrations, the previously completed
migrations, and in the case of failure, the errors from the last failed
attempt.

@irfansharif irfansharif requested review from knz and tbg December 3, 2020 08:54
@irfansharif irfansharif requested review from a team as code owners December 3, 2020 08:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif removed the request for review from a team December 3, 2020 08:55
@irfansharif
Copy link
Contributor Author

Heads up, this is still a [wip] branch. It's a bit far from landing. I have zero tests, and intend to add them here. I also intend to break it up into a few commits: introducing the system table, introducing the request type, introducing IterateRangeDescriptors, and only after all that, adding the migration itself. For now I've just squashed them for an early first-pass look/sanity check before I spend more time on final polish.

I have a few open questions in there as comments (grep for XXX:), but any+all thoughts welcome on the rest of it.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good already! I left a bunch of comments but I think it's basically time to figure out how to write proper tests and how to put this into a number of smaller PRs.

The migration table introduction looks like it could be a good last PR. The migrate command + testing could come first. Then a PR that updates the manager and adds tests and hooks up the migrate command. Just a suggestion.

Reviewed 49 of 49 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/kv/kvserver/replica_proposal.go, line 781 at r1 (raw file):

			// below. If we're not running 21.1 yet, migrate over as we've done
			// since the introduction of the applied state key.
			if r.ClusterSettings().Version.IsActive(ctx, clusterversion.TruncatedAndRangeAppliedStateMigration) {

This is exactly the kind of thing that could cause spurious fatal errors when there are replicaGC'able replicas left around that didn't get included in the raft command. On these, you can still hit the evaluation path (they'll try to request leases). I don't know if this PR already does anything to avoid this behavior (I'll just leave this comment now and will continue reviewing then), but as we've discussed in the past there are two ingredients:

  • track the range version (in the replica state) and disallow snapshots from an old version to any newer version
  • force a run of replicaGC on all replicas in the cluster that don't have the updated range version (via EveryNode presumably)

That way we can

  1. call Migrate on the whole keyspace
  2. force the run

and be assured that all replicas now in the cluster are at the new version, i.e. it's safe to bump the cluster version.


pkg/kv/kvserver/replica_write.go, line 223 at r1 (raw file):

				// applied on all peers, which we achieve via waitForApplication.
				//
				// XXX: Could a snapshot below maxLeaseIndex be in-flight to a

Snapshots that don't include the recipient (as specified by replicaID and descriptor in the snap vs the replicaID of the raft instance) are discarded by the recipient, so this should not be an issue. The particular case you mention will have the learner have a higher replicaID than the incoming snapshot, so the snapshot will also be out for that reason. Finally, a snapshot is discarded unless it moves the LAI forward.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 16 at r1 (raw file):

	"context"

	cv "github.com/cockroachdb/cockroach/pkg/clusterversion"

small nit, but I'm generally -1 on unnecessary import renames.


pkg/migration/helper.go, line 39 at r1 (raw file):

// to execute all relevant node-level operations on them. If any of the nodes
// are found to be unavailable, an error is returned.
func (h *Helper) RequiredNodeIDs(ctx context.Context) ([]roachpb.NodeID, error) {

Should this be exported given that you need to wrap it in a stabilizing retry loop, and thus is a footgun if used from a migration work function?


pkg/migration/helper.go, line 100 at r1 (raw file):

// (accessible using the `progress` column in `system.migrations`) for the
// ongoing migration.
func (h *Helper) UpdateProgress(ctx context.Context, progress string) error {

"progress" is a bit of a misnomer, how about "status"


pkg/migration/helper.go, line 136 at r1 (raw file):

//
// By the time EveryNode returns, we'll have thus invoked the closure against
// every node in the cluster.

That's not true. A node can always join just as EveryNode returns. What we have achieved is that there is now a causal happened-before relation between executing something on every node in the cluster and the next node joining the cluster. We managed to run something on all nodes without a new node joining half-way through, which could have allowed it to pick up some state (off one of the existing nodes that hadn't finished processing yet).

For the example of bumping the cluster version, we'll have achieved proving that by the time EveryNode returns, every node in the cluster, current and future, will use the new cluster version.


pkg/migration/helper.go, line 162 at r1 (raw file):

	for {
		// TODO(irfansharif): We can/should send out these RPCs in parallel.

We should do that soon, too. Ideally in the burst of commits coming out of this one!


pkg/migration/migrations.go, line 110 at r1 (raw file):

func TruncatedStateMigration(ctx context.Context, h *Helper) error {
	// TODO(irfansharif): What should the default block size be?
	const blockSize = 50

At 50k ranges, with each range taking ~200ms (a rather conservative assumption but in a wide-area cluster could be true due to hops from the manager to the replicas) to process the Migrate, a block size of 1 is >2.7 hours. Block size 50 is ~3.3 minutes. Block size 200 should easily be in the cards given that each range is doing very little work here, so down to 50s in this case which seems reasonable.


pkg/migration/migrations.go, line 116 at r1 (raw file):

		for _, desc := range descriptors {
			if bytes.Compare(desc.StartKey, keys.LocalMax) < 0 {
				// XXX: Does it make sense to "migrate" ranges with local keys?

Yeah, this is a bit of a wart. We do want to reach the first range, but we can't address StartKey. However, LocalMax is on r1, so in this case we basically just use LocalMax instead of StartKey.

I wonder if we'd get sanity back if we officially let r1 begin at LocalMax, but that is not worth getting into here or probably ever.


pkg/migration/migrations.go, line 128 at r1 (raw file):

		numMigratedRanges += len(descriptors)
		progress := fmt.Sprintf("[batch %d/??] migrated %d ranges", batchIdx, numMigratedRanges)

Doesn't IterateRangeDescriptors have to handle txn retries? The closure here would need to be idempotent, but the counters aren't. Maybe IterateRangeDescriptors should take an init func() as well which can reset them.


pkg/migration/migrations.go, line 140 at r1 (raw file):

	// XXX: I think we'll need to make sure all stores have synced once to
	// persist all applications of the `Migrate` raft command.

Very good point. It sounds reasonable to do this as a separate round of EveryNode (as opposed to doing it on each raft command application, which would cause tons of syncs).
It's probably also just generally a good idea to make this happen on every migration that does work. It could even be part of bumping the cluster version, which by definition already requires syncing; we could just extend that to explicitly mandate syncing all engines.

Hmm, thinking about this further, this is not good enough when nodes restart rapidly. I think the best way to address this is to make EveryNode more aggressive about retrying: in addition to the list of nodes stabilizing, their corresponding epochs also need to remain stable (and, of course, everyone live). That way we don't impose a requirement that everything we do needs to be synced immediately, which is super expensive and we're also pretty much guaranteed to get it wrong.


pkg/roachpb/api.go, line 1295 at r1 (raw file):

func (*AdminScatterRequest) flags() int                  { return isAdmin | isRange | isAlone }
func (*AdminVerifyProtectedTimestampRequest) flags() int { return isAdmin | isRange | isAlone }
func (*MigrateRequest) flags() int                       { return isWrite | isRange | isAlone }

I wonder if MigrateRange is a better name.


pkg/roachpb/api.proto, line 1500 at r1 (raw file):

// is already going to be the one responsible for stepping through all
// intermediate versions, it's probably saner to only carry out the specific
// migration associated with the target version.

Yeah, that's the impression I had as well.


pkg/roachpb/api.proto, line 1502 at r1 (raw file):

// migration associated with the target version.
//
// TODO(irfansharif): Should this be an admin request instead? There's no good

No, admin requests are really just a way to invoke random code on the leaseholder. Here we want to properly go through the raft pipeline, which admin requests don't.


pkg/roachpb/api.proto, line 1509 at r1 (raw file):

  RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];

  // The version that will become active next. A request with the

incomplete


pkg/roachpb/method.go, line 161 at r1 (raw file):
This comment could be better, something like

Migrate updates the range state to conform to a specified cluster version. It is our main mechanism for phasing out legacy code below Raft.


pkg/sql/catalog/systemschema/system.go, line 347 at r1 (raw file):

	MigrationsTableSchema = `
CREATE TABLE system.migrations (

Is this going to be gossiped? (The answer should be "no", I forget what the rules are).


pkg/sql/catalog/systemschema/system.go, line 615 at r1 (raw file):

	// MigrationsTable is the descriptor for the migrations table. It contains
	// details for all past and ongoing migrations.

We'll want someone from SQL to review this part of the PR. This all looks sane, but I'd want someone to sign off that it is.


pkg/sql/logictest/testdata/logic_test/distsql_agg, line 578 at r1 (raw file):

SELECT corr(y, x)::decimal, covar_pop(y, x)::decimal, covar_samp(y, x)::decimal FROM statistics_agg_test
----
0.04522896319136315  3.75  3.787878787878788

Huh. What are these about?

This PR onboards the first real long-running migration using the
infrastructure we've been building up within pkg/migration. It adds in
the final missing pieces described in our original RFC (cockroachdb#48843). These
components were originally prototyped in cockroachdb#56107.

The migration in question (which happens to be a below-Raft one, first
attempted in cockroachdb#42822) now lets us do the following:

  i.  Use the RangeAppliedState on all ranges
  ii. Use the unreplicated TruncatedState on all ranges

The missing pieces we introduce along side this migration are:

  a. The `Migrate` KV request. This forces all ranges overlapping with
  the request spans to execute the (below-raft) migrations corresponding
  to the specific version, moving out of any legacy modes they may
  currently be in. KV waits for this command to durably apply on all the
  replicas before returning, guaranteeing to the caller that all
  pre-migration state has been completely purged from the system.

  b. `IterateRangeDescriptors`. This provides a handle on every range
  descriptor in the system, which callers can then use to send out
  arbitrary KV requests to in order to run arbitrary KV-level
  migrations. These requests will typically just be the `Migrate`
  request, with added code next to the `Migrate` command implementation
  to do the specific KV-level things intended for the specified version.

  c. The `system.migrations` table. We use it to store metadata about
  ongoing migrations for external visibility/introspection. The schema
  (listed below) is designed with an eye towards scriptability. We
  want operators to be able programmatically use this table to control
  their upgrade processes, if needed.

      CREATE TABLE public.migrations (
          version STRING NOT NULL,
          status STRING NOT NULL,
          description STRING NOT NULL,
          start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP,
          completed TIMESTAMP NULL,
          progress STRING NULL,
          CONSTRAINT "primary" PRIMARY KEY (version ASC),
          FAMILY "primary" (version, status, description, start, completed, progress)
      )

Release note (general change): Cluster version upgrades, as initiated by
`SET CLUSTER SETTING version = <major>-<minor>`, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

Release note (general change): We introduce a new `system.migrations`
table for introspection into crdb internal data migrations. These
migrations are the "maintenance duties" mentioned above. The table
surfaces the currently ongoing migrations, the previously completed
migrations, and in the case of failure, the errors from the last failed
attempt.
@irfansharif irfansharif force-pushed the 201201.migrate-raft-state branch from 45ce792 to 12bf0c2 Compare December 7, 2020 14:34
@irfansharif irfansharif requested review from a team and miretskiy and removed request for a team December 7, 2020 14:34
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking. I've addressed all open comments here, and added a bunch of tests. I'm still going to leave this up as an open prototype branch (for this top level review + discussion + running CI) and pull out bits into separate PRs elsewhere.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @tbg, and @yuzefovich)


pkg/kv/kvserver/replica_proposal.go, line 781 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is exactly the kind of thing that could cause spurious fatal errors when there are replicaGC'able replicas left around that didn't get included in the raft command. On these, you can still hit the evaluation path (they'll try to request leases). I don't know if this PR already does anything to avoid this behavior (I'll just leave this comment now and will continue reviewing then), but as we've discussed in the past there are two ingredients:

  • track the range version (in the replica state) and disallow snapshots from an old version to any newer version
  • force a run of replicaGC on all replicas in the cluster that don't have the updated range version (via EveryNode presumably)

That way we can

  1. call Migrate on the whole keyspace
  2. force the run

and be assured that all replicas now in the cluster are at the new version, i.e. it's safe to bump the cluster version.

Instead of tracking the range version in the replica state (somewhat out of laziness), I opted to instead introduce a GCReplicas RPC to the Migration service and wire it in using EveryNode. Whenever we're running a ranged Migrate command across the entire keyspace for vX, we'll want to GC previous version replicas immediately after. I think almost always this would mean processing the entire the replica GC queue anyway.
As for disallowing previous version snapshots, I think our MLAI safety checks (discussed below) will already prevent it from happening.


pkg/kv/kvserver/replica_write.go, line 223 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Snapshots that don't include the recipient (as specified by replicaID and descriptor in the snap vs the replicaID of the raft instance) are discarded by the recipient, so this should not be an issue. The particular case you mention will have the learner have a higher replicaID than the incoming snapshot, so the snapshot will also be out for that reason. Finally, a snapshot is discarded unless it moves the LAI forward.

Done, incorporated into the comment here.


pkg/kv/kvserver/batcheval/cmd_migrate.go, line 16 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

small nit, but I'm generally -1 on unnecessary import renames.

Done.


pkg/migration/helper.go, line 39 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Should this be exported given that you need to wrap it in a stabilizing retry loop, and thus is a footgun if used from a migration work function?

good idea, Done.


pkg/migration/helper.go, line 100 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"progress" is a bit of a misnomer, how about "status"

I agree that "status" is better, but for parity with (the now soured upon) system.jobs, I reserved the use of "status" for the overall status of the migration ("ongoing/succeeded/failed"). Progress was the closest next thing here that seemed appropriate.


pkg/migration/helper.go, line 136 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That's not true. A node can always join just as EveryNode returns. What we have achieved is that there is now a causal happened-before relation between executing something on every node in the cluster and the next node joining the cluster. We managed to run something on all nodes without a new node joining half-way through, which could have allowed it to pick up some state (off one of the existing nodes that hadn't finished processing yet).

For the example of bumping the cluster version, we'll have achieved proving that by the time EveryNode returns, every node in the cluster, current and future, will use the new cluster version.

yea, I had tried elaborating on that in the paragraphs, but we're database people, I should talk in terms of causality. Reworked this entire block.


pkg/migration/helper.go, line 162 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We should do that soon, too. Ideally in the burst of commits coming out of this one!

Done.


pkg/migration/migrations.go, line 110 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

At 50k ranges, with each range taking ~200ms (a rather conservative assumption but in a wide-area cluster could be true due to hops from the manager to the replicas) to process the Migrate, a block size of 1 is >2.7 hours. Block size 50 is ~3.3 minutes. Block size 200 should easily be in the cards given that each range is doing very little work here, so down to 50s in this case which seems reasonable.

Makes sense, I'll include this back-of-the-envelope calculation in there.


pkg/migration/migrations.go, line 116 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah, this is a bit of a wart. We do want to reach the first range, but we can't address StartKey. However, LocalMax is on r1, so in this case we basically just use LocalMax instead of StartKey.

I wonder if we'd get sanity back if we officially let r1 begin at LocalMax, but that is not worth getting into here or probably ever.

Done.


pkg/migration/migrations.go, line 128 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Doesn't IterateRangeDescriptors have to handle txn retries? The closure here would need to be idempotent, but the counters aren't. Maybe IterateRangeDescriptors should take an init func() as well which can reset them.

Ah, good point. I like the init idea, done.


pkg/migration/migrations.go, line 140 at r1 (raw file):

It sounds reasonable to do this as a separate round of EveryNode

Done. I've added a primitive that simply flushes all engines. Seems like a useful property for the migrations infrastructure to provide.

we could just extend that to explicitly mandate syncing all engines.

I prefer dead-simple to-the-point APIs, so keeping BumpClusterVersion RPC as is.


pkg/roachpb/api.go, line 1295 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I wonder if MigrateRange is a better name.

I think I'll keep it as is. We'd technically still be able to send out a ranged Migrate request over arbitrary keyspans, we're only choosing to paginate because it'd be insane not to.


pkg/roachpb/api.proto, line 1500 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah, that's the impression I had as well.

Cool.


pkg/roachpb/api.proto, line 1502 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No, admin requests are really just a way to invoke random code on the leaseholder. Here we want to properly go through the raft pipeline, which admin requests don't.

Makes sense, this TODO was more for aesthetic reasons than anything functional.


pkg/roachpb/api.proto, line 1509 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

incomplete

Looks like I forgot to push out some changes. Done.


pkg/roachpb/method.go, line 161 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This comment could be better, something like

Migrate updates the range state to conform to a specified cluster version. It is our main mechanism for phasing out legacy code below Raft.

Done.


pkg/sql/catalog/systemschema/system.go, line 347 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this going to be gossiped? (The answer should be "no", I forget what the rules are).

I'm not sure (lazily hoping one of the SQL reviewers can answer).


pkg/sql/logictest/testdata/logic_test/distsql_agg, line 578 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Huh. What are these about?

Who knows, just came up trying to re-write the logictests to include details for the new system table.

make -j1 testlogic TESTFLAGS='--rewrite'

I've previously leaned on @yuzefovich for emotional support during these.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, and @tbg)


pkg/sql/logictest/testdata/logic_test/distsql_agg, line 578 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Who knows, just came up trying to re-write the logictests to include details for the new system table.

make -j1 testlogic TESTFLAGS='--rewrite'

I've previously leaned on @yuzefovich for emotional support during these.

I think I fixed it on master, if you rebase, these should go away.

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 7, 2020
Tests the machinery that's being used to bump cluster versions during version
upgrades. Pulled out of our prototype in cockroachdb#57445.

Release note: None
@irfansharif irfansharif changed the title [wip] migration: onboard the first long-running migration [prototype] migration: onboard the first long-running migration Dec 7, 2020
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 9, 2020
Tests the machinery that's being used to bump cluster versions during version
upgrades. Pulled out of our prototype in cockroachdb#57445.

Release note: None
@@ -342,6 +342,18 @@ CREATE TABLE system.sqlliveness (
expiration DECIMAL NOT NULL,
FAMILY fam0_session_id_expiration (session_id, expiration)
)`

MigrationsTableSchema = `
CREATE TABLE system.migrations (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner, would love see what you think of this schema.

@irfansharif irfansharif requested a review from ajwerner December 9, 2020 20:11
craig bot pushed a commit that referenced this pull request Dec 9, 2020
57637: server: add TestBumpClusterVersion r=irfansharif a=irfansharif

Tests the machinery that's being used to bump cluster versions during version
upgrades. Pulled out of our prototype in #57445.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 11, 2020
We can separate out the `Helper`, `Migration`, and various utilities
into their own files. We'll add tests for individual components in
future commits; the physical separation here sets the foundation for
doing so (prototyped in cockroachdb#57445). This commit is purely code movement.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 11, 2020
We re-define what the Migration type is to be able to annotate it a
description. We'll later use this description when populating the
`system.migrations` table (originally prototyped in cockroachdb#57445).

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 11, 2020
We make it a bit more ergonomic (this revision was originally
prototyped in cockroachdb#57445).

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 11, 2020
It's not currently wired up to anything. We'll use it in future PRs to
send out `Migrate` requests to the entire keyspace. This was originally
prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for
the motivation here.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 17, 2020
It's not currently wired up to anything. We'll use it in future PRs to
send out `Migrate` requests to the entire keyspace. This was originally
prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for
the motivation here.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 18, 2020
It's not currently wired up to anything. We'll use it in future PRs to
send out `Migrate` requests to the entire keyspace. This was originally
prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for
the motivation here.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 18, 2020
It's not currently wired up to anything. We'll use it in future PRs to
send out `Migrate` requests to the entire keyspace. This was originally
prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for
the motivation here.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 18, 2020
It's not currently wired up to anything. We'll use it in future PRs to
send out `Migrate` requests to the entire keyspace. This was originally
prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for
the motivation here.

Release note: None
craig bot pushed a commit that referenced this pull request Dec 18, 2020
57650: migration: introduce various other bells and whistles... r=irfansharif a=irfansharif

...to pave the way for real migrations

This PR cannibalizes most of our prototype in #57445 and shimmies in a few
interfaces to make it all unit-testable. While here, we:
- parallelize the execution of the EveryNode primitive
- introduce the IterateRangeDescriptors primitive
- re-write the migration registration process for better ergonomics
- introduce a more structured Migration type, to annotate it with migration
  specific metadata (it also paves the way for logging the progress of an
  ongoing migration to a system table).

See individual commits for details. 


58036: roachtest: add test_9_6_diagnostics to psycopg blocklist r=rafiss a=arulajmani

Previously, this test was skipped as our pg server version (9.5) did
not meet the threshold (9.6) for this test to run. After the pg server
version bump to 13 this is no longer the case. A change to explicitly
skip this test for CRDB has been merged upstream, but it won't apply
until the next release (psycopg_2_8_7) comes out and we switch to
testing against that. Until then, this test lives in the blocklist.

Closes #57986

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: arulajmani <[email protected]>
@irfansharif
Copy link
Contributor Author

Most of this prototype was folded into #57650. The contentious opinions here (#57662) were scrapped. As for introspection into in-progress migration state, we're simply making use of logging channels for now. There's something to be said for "jobs-ifying" long running migrations at a per-migration level (which in this prototype we simply re-built and dumped into a system.migrations table), but that's not a thread we're going to pull on for a bit I think.

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 19, 2020
These include:

- The KV ranged `Migrate` command. This command forces all ranges
  overlapping with the request spans to execute the (below-raft)
  migrations corresponding to the specific, stated version. This has the
  effect of moving the range out of any legacy modes operation they may
  currently be in. KV waits for this command to durably apply on all the
  replicas before returning, guaranteeing to the caller that all
  pre-migration state has been completely purged from the system.

- The `FlushAllEngines` RPC. This is be used to instruct the target node
  to persist all in-memory state to disk. Like we mentioned above, KV
  currently waits for the `Migrate` command to have applied on all
  replicas before returning. With the applied state, there's no
  necessity to durably persist it (the representative version is already
  stored in the raft log). Out of an abundance of caution, and to really
  really ensure that no pre-migrated state is ever seen in the system,
  we provide the migration manager a mechanism to flush out all
  in-memory state to disk. This will let us guarantee that by the time a
  specific cluster version is bumped, all pre-migrated state from prior
  to a certain version will have been fully purged from the system.

- The `PurgeOutdatedReplicas` RPC. This too comes up in the context of
  wanting the ensure that ranges where we've executed a ranged `Migrate`
  command over have no way of ever surfacing pre-migrated state. This can
  happen with older replicas in the replica GC queue and with applied
  state that is not yet persisted. Currently we wait for the `Migrate`
  to have applied on all replicas of a range before returning to the
  caller. This does not include earlier incarnations of the range,
  possibly sitting idle in the replica GC queue. These replicas can
  still request leases, and go through the request evaluation paths,
  possibly tripping up assertions that check to see no pre-migrated
  state is found. The `PurgeOutdatedReplicas` lets the migration manager
  do exactly as the name suggests, ensuring all "outdated" replicas are
  processed before declaring the specific cluster version bump complete.

- The concept of a "replica state version". This is what's used to
  construct the migration manager's view of what's "outdated", telling
  us which migrations can be assumed to have run against a particular
  replica.  When we introduce backwards incompatible changes to the
  replica state (for example using the unreplicated truncated state
  instead of the replicated variant), the version would inform us if,
  for a given replica, we should expect a state representation prior to,
  or after the migration (in our example this corresponds to whether or
  not we can assume an unreplicated truncated state).

- A gating mechanism for "outdated" snapshots. Below-raft migrations
  want to rely on the invariant that there are no extant replicas in the
  system that haven't seen the specific Migrate command. This is made
  difficult by the fact that it's possible for in-flight snapshots to
  instantiate replicas with pre-migrated state. To guard against this,
  we gate any incoming snapshots that were generated from a replica that
  (at the time) had a version less than our store's current active
  version.
  As part of this change, we re-order the steps taken by the migration
  manager so that it executes a given migration first before bumping
  version gates cluster wide. This is in consideration of the fact that
  the migration process can be arbitrarily long running, and it'd be
  undesirable to pause rebalancing/repair mechanisms from functioning
  during this time (which we're doing by disallowing snapshots from
  older replica versions).

---

This PR motivates all of the above by also onboarding the
TruncatedAndRangeAppliedState migration, lets us do the following:

  i. Use the RangeAppliedState on all ranges
  ii. Use the unreplicated TruncatedState on all ranges

In 21.2 we'll finally be able to delete holdover code that knows how to
handle the legacy replicated truncated state.

Release note(general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = <major>-<minor>, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

---

The ideas here follow from our original prototype in cockroachdb#57445. This PR
needs to be broken down into a few more piecemeal ones:
- replica state versioning and gating old snapshots (separate PR)
- pkg/migration primitives
  - `Migrate` (stand-alone commit)
  - `FlushAllEngines/PurgeOutdatedReplicas` (stand-alone commit)
  - `Manager.Migrate` changes (stand-alone commit)
- the truncated state migration (separate PR)
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 23, 2020
and onboard the truncated state migration. These include:

- The KV ranged `Migrate` command. This command forces all ranges
  overlapping with the request spans to execute the (below-raft)
  migrations corresponding to the specific, stated version. This has the
  effect of moving the range out of any legacy modes operation they may
  currently be in. KV waits for this command to durably apply on all the
  replicas before returning, guaranteeing to the caller that all
  pre-migration state has been completely purged from the system.

- The `SyncAllEngines` RPC. This is be used to instruct the target node
  to persist releveant in-memory state to disk. Like we mentioned above,
  KV currently waits for the `Migrate` command to have applied on all
  replicas before returning. With the applied state, there's no
  necessity to durably persist it (the representative version is already
  stored in the raft log). Out of an abundance of caution, and to really
  really ensure that no pre-migrated state is ever seen in the system,
  we provide the migration manager a mechanism to flush out all
  in-memory state to disk. This will let us guarantee that by the time a
  specific cluster version is bumped, all pre-migrated state from prior
  to a certain version will have been fully purged from the system.
  We'll also use it in conjunction with PurgeOutdatedReplicas below.

- The `PurgeOutdatedReplicas` RPC. This too comes up in the context of
  wanting the ensure that ranges where we've executed a ranged `Migrate`
  command over have no way of ever surfacing pre-migrated state. This can
  happen with older replicas in the replica GC queue and with applied
  state that is not yet persisted. Currently we wait for the `Migrate`
  to have applied on all replicas of a range before returning to the
  caller. This does not include earlier incarnations of the range,
  possibly sitting idle in the replica GC queue. These replicas can
  still request leases, and go through the request evaluation paths,
  possibly tripping up assertions that check to see no pre-migrated
  state is found. The `PurgeOutdatedReplicas` lets the migration manager
  do exactly as the name suggests, ensuring all "outdated" replicas are
  processed before declaring the specific cluster version bump complete.

- The concept of a "replica state version". This is what's used to
  construct the migration manager's view of what's "outdated", telling
  us which migrations can be assumed to have run against a particular
  replica.  When we introduce backwards incompatible changes to the
  replica state (for example using the unreplicated truncated state
  instead of the replicated variant), the version would inform us if,
  for a given replica, we should expect a state representation prior to,
  or after the migration (in our example this corresponds to whether or
  not we can assume an unreplicated truncated state).

As part of this commit, we also re-order the steps taken by the
migration manager so that it executes a given migration first before
bumping version gates cluster wide. This is because we want authors of
migrations to ascertain that their own migrations have run to
completion, instead of attaching that check to the next version.

---

This PR motivates all of the above by also onboarding the
TruncatedAndRangeAppliedState migration, lets us do the following:

  i. Use the RangeAppliedState on all ranges
  ii. Use the unreplicated TruncatedState on all ranges

In 21.2 we'll finally be able to delete holdover code that knows how to
handle the legacy replicated truncated state.

Release note (general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = <major>-<minor>, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

---

The ideas here follow from our original prototype in cockroachdb#57445.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Dec 30, 2020
and onboard the first long running migration (for the truncated state,
see below). The primitives here include:

- The KV ranged `Migrate` command. This command forces all ranges
  overlapping with the request spans to execute the (below-raft)
  migrations corresponding to the specific, stated version. This has the
  effect of moving the range out of any legacy modes operation they may
  currently be in. KV waits for this command to durably apply on all the
  replicas before returning, guaranteeing to the caller that all
  pre-migration state has been completely purged from the system.

- The `SyncAllEngines` RPC. This is be used to instruct the target node
  to persist releveant in-memory state to disk. Like we mentioned above,
  KV currently waits for the `Migrate` command to have applied on all
  replicas before returning. With the applied state, there's no
  necessity to durably persist it (the representative version is already
  stored in the raft log). Out of an abundance of caution, and to really
  really ensure that no pre-migrated state is ever seen in the system,
  we provide the migration manager a mechanism to flush out all
  in-memory state to disk. This will let us guarantee that by the time a
  specific cluster version is bumped, all pre-migrated state from prior
  to a certain version will have been fully purged from the system.
  We'll also use it in conjunction with PurgeOutdatedReplicas below.

- The `PurgeOutdatedReplicas` RPC. This too comes up in the context of
  wanting the ensure that ranges where we've executed a ranged `Migrate`
  command over have no way of ever surfacing pre-migrated state. This can
  happen with older replicas in the replica GC queue and with applied
  state that is not yet persisted. Currently we wait for the `Migrate`
  to have applied on all replicas of a range before returning to the
  caller. This does not include earlier incarnations of the range,
  possibly sitting idle in the replica GC queue. These replicas can
  still request leases, and go through the request evaluation paths,
  possibly tripping up assertions that check to see no pre-migrated
  state is found. The `PurgeOutdatedReplicas` lets the migration manager
  do exactly as the name suggests, ensuring all "outdated" replicas are
  processed before declaring the specific cluster version bump complete.

- The concept of a "replica state version". This is what's used to
  construct the migration manager's view of what's "outdated", telling
  us which migrations can be assumed to have run against a particular
  replica.  When we introduce backwards incompatible changes to the
  replica state (for example using the unreplicated truncated state
  instead of the replicated variant), the version would inform us if,
  for a given replica, we should expect a state representation prior to,
  or after the migration (in our example this corresponds to whether or
  not we can assume an unreplicated truncated state).

As part of this commit, we also re-order the steps taken by the
migration manager so that it executes a given migration first before
bumping version gates cluster wide. This is because we want authors of
migrations to ascertain that their own migrations have run to
completion, instead of attaching that check to the next version.

---

This PR motivates all of the above by also onboarding the
TruncatedAndRangeAppliedState migration, lets us do the following:

  i. Use the RangeAppliedState on all ranges
  ii. Use the unreplicated TruncatedState on all ranges

In 21.2 we'll finally be able to delete holdover code that knows how to
handle the legacy replicated truncated state.

Release note (general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = <major>-<minor>, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

---

The ideas here follow from our original prototype in cockroachdb#57445.
craig bot pushed a commit that referenced this pull request Dec 30, 2020
58088: migration: introduce primitives for below-raft migrations r=irfansharif a=irfansharif

These include:

- The KV ranged `Migrate` command. This command forces all ranges
  overlapping with the request spans to execute the (below-raft)
  migrations corresponding to the specific, stated version. This has the
  effect of moving the range out of any legacy modes operation they may
  currently be in. KV waits for this command to durably apply on all the
  replicas before returning, guaranteeing to the caller that all
  pre-migration state has been completely purged from the system.

- The `FlushAllEngines` RPC. This is be used to instruct the target node
  to persist all in-memory state to disk. Like we mentioned above, KV
  currently waits for the `Migrate` command to have applied on all
  replicas before returning. With the applied state, there's no
  necessity to durably persist it (the representative version is already
  stored in the raft log). Out of an abundance of caution, and to really
  really ensure that no pre-migrated state is ever seen in the system,
  we provide the migration manager a mechanism to flush out all
  in-memory state to disk. This will let us guarantee that by the time a
  specific cluster version is bumped, all pre-migrated state from prior
  to a certain version will have been fully purged from the system.

- The `PurgeOutdatedReplicas` RPC. This too comes up in the context of
  wanting the ensure that ranges where we've executed a ranged `Migrate`
  command over have no way of ever surfacing pre-migrated state. This can
  happen with older replicas in the replica GC queue and with applied
  state that is not yet persisted. Currently we wait for the `Migrate`
  to have applied on all replicas of a range before returning to the
  caller. This does not include earlier incarnations of the range,
  possibly sitting idle in the replica GC queue. These replicas can
  still request leases, and go through the request evaluation paths,
  possibly tripping up assertions that check to see no pre-migrated
  state is found. The `PurgeOutdatedReplicas` lets the migration manager
  do exactly as the name suggests, ensuring all "outdated" replicas are
  processed before declaring the specific cluster version bump complete.

- The concept of a "replica state version". This is what's used to
  construct the migration manager's view of what's "outdated", telling
  us which migrations can be assumed to have run against a particular
  replica.  When we introduce backwards incompatible changes to the
  replica state (for example using the unreplicated truncated state
  instead of the replicated variant), the version would inform us if,
  for a given replica, we should expect a state representation prior to,
  or after the migration (in our example this corresponds to whether or
  not we can assume an unreplicated truncated state).

- A gating mechanism for "outdated" snapshots. Below-raft migrations
  want to rely on the invariant that there are no extant replicas in the
  system that haven't seen the specific Migrate command. This is made
  difficult by the fact that it's possible for in-flight snapshots to
  instantiate replicas with pre-migrated state. To guard against this,
  we gate any incoming snapshots that were generated from a replica that
  (at the time) had a version less than our store's current active
  version.
  As part of this change, we re-order the steps taken by the migration
  manager so that it executes a given migration first before bumping
  version gates cluster wide. This is in consideration of the fact that
  the migration process can be arbitrarily long running, and it'd be
  undesirable to pause rebalancing/repair mechanisms from functioning
  during this time (which we're doing by disallowing snapshots from
  older replica versions).

---

This PR motivates all of the above by also onboarding the
TruncatedAndRangeAppliedState migration, lets us do the following:

  i. Use the RangeAppliedState on all ranges
  ii. Use the unreplicated TruncatedState on all ranges

In 21.2 we'll finally be able to delete holdover code that knows how to
handle the legacy replicated truncated state.

Release note(general change): Cluster version upgrades, as initiated by
SET CLUSTER SETTING version = <major>-<minor>, now perform internal
maintenance duties that will delay how long it takes for the command to
complete. The delay is proportional to the amount of data currently
stored in the cluster. The cluster will also experience a small amount
of additional load during this period while the upgrade is being
finalized.

---

The ideas here follow from our original prototype in #57445. This PR
needs to be broken down into a few more piecemeal ones:
- replica state versioning and gating old snapshots (separate PR)
- pkg/migration primitives
  - `Migrate` (stand-alone commit)
  - `FlushAllEngines/PurgeOutdatedReplicas` (stand-alone commit)
  - `Manager.Migrate` changes (stand-alone commit)
- the truncated state migration (separate PR)

Co-authored-by: irfan sharif <[email protected]>
@irfansharif irfansharif deleted the 201201.migrate-raft-state branch January 8, 2021 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants