*: introduce the kv `Migrate` command, and `GCReplicas`/`FlushAllEngines` RPCs #57662

irfansharif · 2020-12-07T20:46:15Z

The Migrate command forces all ranges overlapping with the request
spans to execute the (below-raft) migrations corresponding to the
specific, stated version. This has the effect of moving the range out of
any legacy modes operation they may currently be in. KV waits for this
command to durably apply on all the replicas before returning,
guaranteeing to the caller that all pre-migration state has been
completely purged from the system.

We also introduce two new RPCs for the migration infrastructure to use.
GCReplicas will be used to instruct the target node to process all
GC-able replicas. FlushAllEngines will be used to instruct the target
node to persist all in-memory state to disk. Both of these are necessary
primitives for the migration infrastructure.

Specifically this comes up in the context of wanting the ensure that
ranges where we've executed a ranged Migrate command over have no way
of ever surfacing pre-migrated state. This can happen with older
replicas in the replica GC queue and with applied state that is not yet
persisted. We elaborate on both of these below:

Motivation for GCReplicas: Currently we wait for the Migrate to have
applied on all replicas of a range before returning to the caller. This
does not include earlier incarnations of the range, possibly sitting
idle in the replica GC queue. These replicas can still request leases,
and go through the request evaluation paths, possibly tripping up
assertions that check to see no pre-migrated state is found. For this
reason we introduce the GCReplicas RPC that the migration manager can
use to ensure all GC-able replicas are processed before declaring the
specific cluster version bump complete.

Motivation for FlushAllEngines: Like we mentioned above, KV currently
waits for the Migrate command to have applied on all replicas before
returning. With the applied state, there's no necessity to durably
persist it (the representative version is already stored in the raft
log). Out of an abundance of caution, and to really really ensure that
no pre-migrated state is ever seen in the system, we provide the
migration manager a mechanism to flush out all in-memory state to disk.
This way the manager can guarantee that by the time a specific cluster
version is bumped, all pre-migrated state from prior to that cluster
version will have been fully purged from the system.

The ideas here follow from our original prototype in #57445. Neither of
these RPCs or the Migrate command are currently wired up to anything.
That'll happen in a future PR introducing the raft truncated state
migration.

Only the last two commits here are of interest. All prior commits are
from #57650 and #57637 respectively.

Release note: None

cockroach-teamcity · 2020-12-07T20:46:32Z

This change is

Makes for better logging. Release note: None

We can separate out the `Helper`, `Migration`, and various utilities into their own files. We'll add tests for individual components in future commits; the physical separation here sets the foundation for doing so (prototyped in cockroachdb#57445). This commit is purely code movement. Release note: None

It's clearer to talk explicitly in terms of causality. Release note: None

We re-define what the Migration type is to be able to annotate it a description. We'll later use this description when populating the `system.migrations` table (originally prototyped in cockroachdb#57445). Release note: None

We make it a bit more ergonomic (this revision was originally prototyped in cockroachdb#57445). Release note: None

To facilitate testing `Helper` in isolation, we introduce a `cluster` interface that we'll mock out in tests. It's through this interface that the migration infrastructure will able to dial out to a specific node, grab hold of a kv.DB instance, and retrieve the current cluster membership. Part of diff also downgrades `RequiredNodes` from being a first class primitive, instead tucking it away for internal usage only. Given retrieving the cluster membership made no guarantees about new nodes being added to the cluster, it's entirely possible for that to happen concurrently with it. Appropriate usage then entailed wrapping it under a stabilizing loop, like we do so in `EveryNode`. This tells us there's no need to expose it directly to migration authors. Release note: None

Release note: None

It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None

This command forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific, stated version. This has the effect of moving the range out of any legacy modes operation they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. We're currently not wiring it up to anything. We will in a future commit that introduces the truncated state migration. This commit was pulled out of our prototype in cockroachdb#57445. Release note: None

We introduce two new RPCs for the migration infrastructure to use. `GCReplicas` will be used to instruct the target node to process all GC-able replicas. `FlushAllEngines` will be used to instruct the target node to persist all in-memory state to disk. Both of these are necessary primitives for the migration infastructure. Specifically this comes up in the context of wanting the ensure that ranges where we've executed a ranged `Migrate` command over have no way of ever surfacing pre-migrated state. This can happen with older replicas in the replica GC queue and with applied state that is not yet persisted. We elaborate on both of these below: Motivation for `GCReplicas`: Currently we wait for the `Migrate` to have applied on all replicas of a range before returning to the caller. This does not include earlier incarnations of the range, possibly sitting idle in the replica GC queue. These replicas can still request leases, and go through the request evaluation paths, possibly tripping up assertions that check to see no pre-migrated state is found. For this reason we introduce the `GCReplicas` RPC that the migration manager can use to ensure all GC-able replicas are processed before declaring the specific cluster version bump complete. Motivation for `FlushAllEngines`: Like we mentioned above, KV currently waits for the `Migrate` command to have applied on all replicas before returning. With the applied state, there's no necessity to durably persist it (the representative version is already stored in the raft log). Out of an abundance of caution, and to really really ensure that no pre-migrated state is ever seen in the system, we provide the migration manager a mechanism to flush out all in-memory state to disk. This way the manager can guarantee that by the time a specific cluster version is bumped, all pre-migrated state from prior to that cluster version will have been fully purged from the system. --- The ideas here follow from our original prototype in cockroachdb#57445. Neither of these RPCs are currently wired up to anything. That'll happen in a future commit introducing the raft truncated state migration. Release note: None

tbg

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 3 files at r6, 5 of 7 files at r7, 4 of 4 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)

pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

		interceptor()
	}
	return forceScanAndProcess(s, s.replicaGCQueue.baseQueue)

This isn't good enough. We need them to actually be processed (which maybeAdd below doesn't do). This is why I initially suggested adding the version to the replica state: we could then force only the replicas that have an old version through the queue (knowing they would very likely be GC'ed). As is, we are forced to process all replicas on the node, which seems excessive for what it is (that's number-of-replicas reads of the meta ranges, so could easily go into the 100ks)

I think it is still worthwhile to add the roachpb.Version to the replica state, and then have the impl here look at all of the replicas and actually force the ones with an old version through the queue. Also open to better ideas.
I think this would play out nicely though. We would also solve the problem of old snapshots. Roughly we would do:

Migrate keyspace to version X
GCReplicas:

persists a StoreMinIncomingSnapshotVersion = X which prevents any new snapshots from <X from being applied (with strong ordering, i.e. serialized with snapshot application)
forces all replicas at versions <X through the queue

GCReplicas should probably have a different name, like PurgeOutdatedReplicas, reflecting its more holistic task.

Interestingly, the StoreMinIncomingSnapshotVersion is almost like a wrapped version step. We could possibly get things to a more modular place if a below raft migration got split into two:

run Migrate command
run GCReplicas command (still need the version on the state)

The snapshot prevention would be implicit: we'd have something like

func onApplySnapshot() error {
  if version.IsActive(VersionWhichHadTheMigrate) {
    return refuseSnap()
  }
  // ...
}

This isn't possible in a single step, because the version would only be rolled out at the very end, so there would be a gap for snapshots to slip in. In effect, the two phases echo what we do elsewhere to make things safe: the first phase introduces the new behavior, the second phase removes the old behavior.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @tbg, and @TheSamHuang)

pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

We need them to actually be processed (which maybeAdd below doesn't do)

But we're not just enqueuing these ranges, we're ensuring the queue is fully drained before returning. See the usage of DrainQueue within the RPC code path. As I type this out we don't actually need to enqueue any ranges, we can simply drain the replica GC queue before returning to the migration manager.

I'm not opposed to adding a roachpb.Version to the replica state, but I didn't realize (?) it was necessary. After the application the Migrate command, given the MLAI index moves forward and incoming snapshots never rollback past that point, I'm not sure I follow why we'd need additional ordering for the snapshot application? But attaching an active version along with the snapshot SGTM, and should be easy enough to hand.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)

pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We need them to actually be processed (which maybeAdd below doesn't do)

But we're not just enqueuing these ranges, we're ensuring the queue is fully drained before returning. See the usage of DrainQueue within the RPC code path. As I type this out we don't actually need to enqueue any ranges, we can simply drain the replica GC queue before returning to the migration manager.

I'm not opposed to adding a roachpb.Version to the replica state, but I didn't realize (?) it was necessary. After the application the Migrate command, given the MLAI index moves forward and incoming snapshots never rollback past that point, I'm not sure I follow why we'd need additional ordering for the snapshot application? But attaching an active version along with the snapshot SGTM, and should be easy enough to hand.

Draining a queue processes the replicas in it, but not the ones that never got added. So if maybeAdd doesn't queue a replica, DrainQueue won't force it through the queue.

The MLAI argument works for initialized ranges, but what about ones getting their first snapshot? For example, if a learner is being added while the migrate command is rolled out, the learner might be getting a snapshot streamed that precedes the migrate command.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)

pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Draining a queue processes the replicas in it, but not the ones that never got added. So if maybeAdd doesn't queue a replica, DrainQueue won't force it through the queue.

The MLAI argument works for initialized ranges, but what about ones getting their first snapshot? For example, if a learner is being added while the migrate command is rolled out, the learner might be getting a snapshot streamed that precedes the migrate command.

btw: and even if maybeAdd does queue the replica, doing so might evict other replicas from the queue as the size is capped

irfansharif · 2020-12-18T19:42:57Z

I've ended up pretty much scrapping this entire approach in favor of tracking versions in replicas; sending out in another PR.

server: improve a miscellaneous comment

dd10e45

Release note: None

irfansharif requested review from knz, tbg and TheSamHuang December 7, 2020 20:46

irfansharif requested a review from a team as a code owner December 7, 2020 20:46

irfansharif added 8 commits December 7, 2020 16:45

server: annotate migration RPC ctx with rpc name

ca214e9

Makes for better logging. Release note: None

migration: improve documentation for EveryNode

c328fb0

It's clearer to talk explicitly in terms of causality. Release note: None

migration: introduce a more structured type for Migrations

3412dd3

We re-define what the Migration type is to be able to annotate it a description. We'll later use this description when populating the `system.migrations` table (originally prototyped in cockroachdb#57445). Release note: None

migration: re-define the migration registration process

01ded2d

We make it a bit more ergonomic (this revision was originally prototyped in cockroachdb#57445). Release note: None

migration: parallelize execution of the EveryNode primitive

678a97e

Release note: None

irfansharif force-pushed the 201207.migrate-cmd branch 2 times, most recently from 2a72dae to f4606d1 Compare December 7, 2020 23:37

irfansharif mentioned this pull request Dec 8, 2020

migration,*: onboard TruncatedAndRangeAppliedStateMigration #57694

Closed

irfansharif force-pushed the 201207.migrate-cmd branch 2 times, most recently from e0274c0 to 07c10b1 Compare December 8, 2020 05:10

irfansharif added 2 commits December 8, 2020 00:11

irfansharif force-pushed the 201207.migrate-cmd branch from 07c10b1 to 00f1a41 Compare December 8, 2020 05:14

tbg reviewed Dec 8, 2020

View reviewed changes

tbg self-requested a review December 8, 2020 12:31

irfansharif commented Dec 8, 2020

View reviewed changes

tbg reviewed Dec 8, 2020

View reviewed changes

tbg requested review from tbg December 8, 2020 14:56

tbg requested changes Dec 8, 2020

View reviewed changes

irfansharif closed this Dec 18, 2020

irfansharif deleted the 201207.migrate-cmd branch December 18, 2020 19:43

irfansharif mentioned this pull request Dec 18, 2020

[prototype] migration: onboard the first long-running migration #57445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: introduce the kv `Migrate` command, and `GCReplicas`/`FlushAllEngines` RPCs #57662

*: introduce the kv `Migrate` command, and `GCReplicas`/`FlushAllEngines` RPCs #57662

irfansharif commented Dec 7, 2020

cockroach-teamcity commented Dec 7, 2020

tbg left a comment

irfansharif left a comment

tbg left a comment

tbg left a comment

irfansharif commented Dec 18, 2020

*: introduce the kv Migrate command, and GCReplicas/FlushAllEngines RPCs #57662

*: introduce the kv Migrate command, and GCReplicas/FlushAllEngines RPCs #57662

Conversation

irfansharif commented Dec 7, 2020

cockroach-teamcity commented Dec 7, 2020

tbg left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

irfansharif commented Dec 18, 2020

*: introduce the kv `Migrate` command, and `GCReplicas`/`FlushAllEngines` RPCs #57662

*: introduce the kv `Migrate` command, and `GCReplicas`/`FlushAllEngines` RPCs #57662