Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: introduce the kv Migrate command, and GCReplicas/FlushAllEngines RPCs #57662

Closed
wants to merge 11 commits into from

Conversation

irfansharif
Copy link
Contributor

The Migrate command forces all ranges overlapping with the request
spans to execute the (below-raft) migrations corresponding to the
specific, stated version. This has the effect of moving the range out of
any legacy modes operation they may currently be in. KV waits for this
command to durably apply on all the replicas before returning,
guaranteeing to the caller that all pre-migration state has been
completely purged from the system.

We also introduce two new RPCs for the migration infrastructure to use.
GCReplicas will be used to instruct the target node to process all
GC-able replicas. FlushAllEngines will be used to instruct the target
node to persist all in-memory state to disk. Both of these are necessary
primitives for the migration infrastructure.

Specifically this comes up in the context of wanting the ensure that
ranges where we've executed a ranged Migrate command over have no way
of ever surfacing pre-migrated state. This can happen with older
replicas in the replica GC queue and with applied state that is not yet
persisted. We elaborate on both of these below:

Motivation for GCReplicas: Currently we wait for the Migrate to have
applied on all replicas of a range before returning to the caller. This
does not include earlier incarnations of the range, possibly sitting
idle in the replica GC queue. These replicas can still request leases,
and go through the request evaluation paths, possibly tripping up
assertions that check to see no pre-migrated state is found. For this
reason we introduce the GCReplicas RPC that the migration manager can
use to ensure all GC-able replicas are processed before declaring the
specific cluster version bump complete.

Motivation for FlushAllEngines: Like we mentioned above, KV currently
waits for the Migrate command to have applied on all replicas before
returning. With the applied state, there's no necessity to durably
persist it (the representative version is already stored in the raft
log). Out of an abundance of caution, and to really really ensure that
no pre-migrated state is ever seen in the system, we provide the
migration manager a mechanism to flush out all in-memory state to disk.
This way the manager can guarantee that by the time a specific cluster
version is bumped, all pre-migrated state from prior to that cluster
version will have been fully purged from the system.

The ideas here follow from our original prototype in #57445. Neither of
these RPCs or the Migrate command are currently wired up to anything.
That'll happen in a future PR introducing the raft truncated state
migration.


Only the last two commits here are of interest. All prior commits are
from #57650 and #57637 respectively.

@irfansharif irfansharif requested a review from a team as a code owner December 7, 2020 20:46
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Makes for better logging.

Release note: None
We can separate out the `Helper`, `Migration`, and various utilities
into their own files. We'll add tests for individual components in
future commits; the physical separation here sets the foundation for
doing so (prototyped in cockroachdb#57445). This commit is purely code movement.

Release note: None
It's clearer to talk explicitly in terms of causality.

Release note: None
We re-define what the Migration type is to be able to annotate it a
description. We'll later use this description when populating the
`system.migrations` table (originally prototyped in cockroachdb#57445).

Release note: None
We make it a bit more ergonomic (this revision was originally
prototyped in cockroachdb#57445).

Release note: None
To facilitate testing `Helper` in isolation, we introduce a `cluster`
interface that we'll mock out in tests. It's through this interface that
the migration infrastructure will able to dial out to a specific node,
grab hold of a kv.DB instance, and retrieve the current cluster
membership.

Part of diff also downgrades `RequiredNodes` from being a first class
primitive, instead tucking it away for internal usage only. Given
retrieving the cluster membership made no guarantees about new nodes
being added to the cluster, it's entirely possible for that to happen
concurrently with it. Appropriate usage then entailed wrapping it under
a stabilizing loop, like we do so in `EveryNode`. This tells us there's
no need to expose it directly to migration authors.

Release note: None
It's not currently wired up to anything. We'll use it in future PRs to
send out `Migrate` requests to the entire keyspace. This was originally
prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for
the motivation here.

Release note: None
@irfansharif irfansharif force-pushed the 201207.migrate-cmd branch 2 times, most recently from 2a72dae to f4606d1 Compare December 7, 2020 23:37
@irfansharif irfansharif force-pushed the 201207.migrate-cmd branch 2 times, most recently from e0274c0 to 07c10b1 Compare December 8, 2020 05:10
This command forces all ranges overlapping with the request spans to
execute the (below-raft) migrations corresponding to the specific,
stated version. This has the effect of moving the range out of any
legacy modes operation they may currently be in. KV waits for this
command to durably apply on all the replicas before returning,
guaranteeing to the caller that all pre-migration state has been
completely purged from the system.

We're currently not wiring it up to anything. We will in a future commit
that introduces the truncated state migration. This commit was pulled
out of our prototype in cockroachdb#57445.

Release note: None
We introduce two new RPCs for the migration infrastructure to use.
`GCReplicas` will be used to instruct the target node to process all
GC-able replicas. `FlushAllEngines` will be used to instruct the target
node to persist all in-memory state to disk. Both of these are necessary
primitives for the migration infastructure.

Specifically this comes up in the context of wanting the ensure that
ranges where we've executed a ranged `Migrate` command over have no way
of ever surfacing pre-migrated state. This can happen with older
replicas in the replica GC queue and with applied state that is not yet
persisted. We elaborate on both of these below:

Motivation for `GCReplicas`: Currently we wait for the `Migrate` to have
applied on all replicas of a range before returning to the caller. This
does not include earlier incarnations of the range, possibly sitting
idle in the replica GC queue. These replicas can still request leases,
and go through the request evaluation paths, possibly tripping up
assertions that check to see no pre-migrated state is found. For this
reason we introduce the `GCReplicas` RPC that the migration manager can
use to ensure all GC-able replicas are processed before declaring the
specific cluster version bump complete.

Motivation for `FlushAllEngines`: Like we mentioned above, KV currently
waits for the `Migrate` command to have applied on all replicas before
returning. With the applied state, there's no necessity to durably
persist it (the representative version is already stored in the raft
log). Out of an abundance of caution, and to really really ensure that
no pre-migrated state is ever seen in the system, we provide the
migration manager a mechanism to flush out all in-memory state to disk.
This way the manager can guarantee that by the time a specific cluster
version is bumped, all pre-migrated state from prior to that cluster
version will have been fully purged from the system.

---

The ideas here follow from our original prototype in cockroachdb#57445. Neither of
these RPCs are currently wired up to anything. That'll happen in a
future commit introducing the raft truncated state migration.

Release note: None
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 3 files at r6, 5 of 7 files at r7, 4 of 4 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

		interceptor()
	}
	return forceScanAndProcess(s, s.replicaGCQueue.baseQueue)

This isn't good enough. We need them to actually be processed (which maybeAdd below doesn't do). This is why I initially suggested adding the version to the replica state: we could then force only the replicas that have an old version through the queue (knowing they would very likely be GC'ed). As is, we are forced to process all replicas on the node, which seems excessive for what it is (that's number-of-replicas reads of the meta ranges, so could easily go into the 100ks)

I think it is still worthwhile to add the roachpb.Version to the replica state, and then have the impl here look at all of the replicas and actually force the ones with an old version through the queue. Also open to better ideas.
I think this would play out nicely though. We would also solve the problem of old snapshots. Roughly we would do:

  1. Migrate keyspace to version X
  2. GCReplicas:
  • persists a StoreMinIncomingSnapshotVersion = X which prevents any new snapshots from <X from being applied (with strong ordering, i.e. serialized with snapshot application)
  • forces all replicas at versions <X through the queue

GCReplicas should probably have a different name, like PurgeOutdatedReplicas, reflecting its more holistic task.

Interestingly, the StoreMinIncomingSnapshotVersion is almost like a wrapped version step. We could possibly get things to a more modular place if a below raft migration got split into two:

  1. run Migrate command
  2. run GCReplicas command (still need the version on the state)

The snapshot prevention would be implicit: we'd have something like

func onApplySnapshot() error {
  if version.IsActive(VersionWhichHadTheMigrate) {
    return refuseSnap()
  }
  // ...
}

This isn't possible in a single step, because the version would only be rolled out at the very end, so there would be a gap for snapshots to slip in. In effect, the two phases echo what we do elsewhere to make things safe: the first phase introduces the new behavior, the second phase removes the old behavior.

@tbg tbg self-requested a review December 8, 2020 12:31
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

We need them to actually be processed (which maybeAdd below doesn't do)

But we're not just enqueuing these ranges, we're ensuring the queue is fully drained before returning. See the usage of DrainQueue within the RPC code path. As I type this out we don't actually need to enqueue any ranges, we can simply drain the replica GC queue before returning to the migration manager.

I'm not opposed to adding a roachpb.Version to the replica state, but I didn't realize (?) it was necessary. After the application the Migrate command, given the MLAI index moves forward and incoming snapshots never rollback past that point, I'm not sure I follow why we'd need additional ordering for the snapshot application? But attaching an active version along with the snapshot SGTM, and should be easy enough to hand.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We need them to actually be processed (which maybeAdd below doesn't do)

But we're not just enqueuing these ranges, we're ensuring the queue is fully drained before returning. See the usage of DrainQueue within the RPC code path. As I type this out we don't actually need to enqueue any ranges, we can simply drain the replica GC queue before returning to the migration manager.

I'm not opposed to adding a roachpb.Version to the replica state, but I didn't realize (?) it was necessary. After the application the Migrate command, given the MLAI index moves forward and incoming snapshots never rollback past that point, I'm not sure I follow why we'd need additional ordering for the snapshot application? But attaching an active version along with the snapshot SGTM, and should be easy enough to hand.

Draining a queue processes the replicas in it, but not the ones that never got added. So if maybeAdd doesn't queue a replica, DrainQueue won't force it through the queue.

The MLAI argument works for initialized ranges, but what about ones getting their first snapshot? For example, if a learner is being added while the migrate command is rolled out, the learner might be getting a snapshot streamed that precedes the migrate command.

@tbg tbg requested review from tbg December 8, 2020 14:56
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Draining a queue processes the replicas in it, but not the ones that never got added. So if maybeAdd doesn't queue a replica, DrainQueue won't force it through the queue.

The MLAI argument works for initialized ranges, but what about ones getting their first snapshot? For example, if a learner is being added while the migrate command is rolled out, the learner might be getting a snapshot streamed that precedes the migrate command.

btw: and even if maybeAdd does queue the replica, doing so might evict other replicas from the queue as the size is capped

@irfansharif
Copy link
Contributor Author

I've ended up pretty much scrapping this entire approach in favor of tracking versions in replicas; sending out in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants