-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: introduce the kv Migrate
command, and GCReplicas
/FlushAllEngines
RPCs
#57662
Conversation
Release note: None
Makes for better logging. Release note: None
We can separate out the `Helper`, `Migration`, and various utilities into their own files. We'll add tests for individual components in future commits; the physical separation here sets the foundation for doing so (prototyped in cockroachdb#57445). This commit is purely code movement. Release note: None
It's clearer to talk explicitly in terms of causality. Release note: None
We re-define what the Migration type is to be able to annotate it a description. We'll later use this description when populating the `system.migrations` table (originally prototyped in cockroachdb#57445). Release note: None
We make it a bit more ergonomic (this revision was originally prototyped in cockroachdb#57445). Release note: None
To facilitate testing `Helper` in isolation, we introduce a `cluster` interface that we'll mock out in tests. It's through this interface that the migration infrastructure will able to dial out to a specific node, grab hold of a kv.DB instance, and retrieve the current cluster membership. Part of diff also downgrades `RequiredNodes` from being a first class primitive, instead tucking it away for internal usage only. Given retrieving the cluster membership made no guarantees about new nodes being added to the cluster, it's entirely possible for that to happen concurrently with it. Appropriate usage then entailed wrapping it under a stabilizing loop, like we do so in `EveryNode`. This tells us there's no need to expose it directly to migration authors. Release note: None
Release note: None
It's not currently wired up to anything. We'll use it in future PRs to send out `Migrate` requests to the entire keyspace. This was originally prototyped in cockroachdb#57445. See the inline comments and the RFC (cockroachdb#48843) for the motivation here. Release note: None
2a72dae
to
f4606d1
Compare
e0274c0
to
07c10b1
Compare
This command forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific, stated version. This has the effect of moving the range out of any legacy modes operation they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. We're currently not wiring it up to anything. We will in a future commit that introduces the truncated state migration. This commit was pulled out of our prototype in cockroachdb#57445. Release note: None
We introduce two new RPCs for the migration infrastructure to use. `GCReplicas` will be used to instruct the target node to process all GC-able replicas. `FlushAllEngines` will be used to instruct the target node to persist all in-memory state to disk. Both of these are necessary primitives for the migration infastructure. Specifically this comes up in the context of wanting the ensure that ranges where we've executed a ranged `Migrate` command over have no way of ever surfacing pre-migrated state. This can happen with older replicas in the replica GC queue and with applied state that is not yet persisted. We elaborate on both of these below: Motivation for `GCReplicas`: Currently we wait for the `Migrate` to have applied on all replicas of a range before returning to the caller. This does not include earlier incarnations of the range, possibly sitting idle in the replica GC queue. These replicas can still request leases, and go through the request evaluation paths, possibly tripping up assertions that check to see no pre-migrated state is found. For this reason we introduce the `GCReplicas` RPC that the migration manager can use to ensure all GC-able replicas are processed before declaring the specific cluster version bump complete. Motivation for `FlushAllEngines`: Like we mentioned above, KV currently waits for the `Migrate` command to have applied on all replicas before returning. With the applied state, there's no necessity to durably persist it (the representative version is already stored in the raft log). Out of an abundance of caution, and to really really ensure that no pre-migrated state is ever seen in the system, we provide the migration manager a mechanism to flush out all in-memory state to disk. This way the manager can guarantee that by the time a specific cluster version is bumped, all pre-migrated state from prior to that cluster version will have been fully purged from the system. --- The ideas here follow from our original prototype in cockroachdb#57445. Neither of these RPCs are currently wired up to anything. That'll happen in a future commit introducing the raft truncated state migration. Release note: None
07c10b1
to
00f1a41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 3 files at r6, 5 of 7 files at r7, 4 of 4 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):
interceptor() } return forceScanAndProcess(s, s.replicaGCQueue.baseQueue)
This isn't good enough. We need them to actually be processed (which maybeAdd
below doesn't do). This is why I initially suggested adding the version to the replica state: we could then force only the replicas that have an old version through the queue (knowing they would very likely be GC'ed). As is, we are forced to process all replicas on the node, which seems excessive for what it is (that's number-of-replicas reads of the meta ranges, so could easily go into the 100ks)
I think it is still worthwhile to add the roachpb.Version
to the replica state, and then have the impl here look at all of the replicas and actually force the ones with an old version through the queue. Also open to better ideas.
I think this would play out nicely though. We would also solve the problem of old snapshots. Roughly we would do:
- Migrate keyspace to version X
- GCReplicas:
- persists a
StoreMinIncomingSnapshotVersion = X
which prevents any new snapshots from <X from being applied (with strong ordering, i.e. serialized with snapshot application) - forces all replicas at versions <X through the queue
GCReplicas
should probably have a different name, like PurgeOutdatedReplicas
, reflecting its more holistic task.
Interestingly, the StoreMinIncomingSnapshotVersion
is almost like a wrapped version step. We could possibly get things to a more modular place if a below raft migration got split into two:
- run Migrate command
- run GCReplicas command (still need the version on the state)
The snapshot prevention would be implicit: we'd have something like
func onApplySnapshot() error {
if version.IsActive(VersionWhichHadTheMigrate) {
return refuseSnap()
}
// ...
}
This isn't possible in a single step, because the version would only be rolled out at the very end, so there would be a gap for snapshots to slip in. In effect, the two phases echo what we do elsewhere to make things safe: the first phase introduces the new behavior, the second phase removes the old behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @tbg, and @TheSamHuang)
pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):
We need them to actually be processed (which maybeAdd below doesn't do)
But we're not just enqueuing these ranges, we're ensuring the queue is fully drained before returning. See the usage of DrainQueue
within the RPC code path. As I type this out we don't actually need to enqueue any ranges, we can simply drain the replica GC queue before returning to the migration manager.
I'm not opposed to adding a roachpb.Version to the replica state, but I didn't realize (?) it was necessary. After the application the Migrate command, given the MLAI index moves forward and incoming snapshots never rollback past that point, I'm not sure I follow why we'd need additional ordering for the snapshot application? But attaching an active version along with the snapshot SGTM, and should be easy enough to hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):
Previously, irfansharif (irfan sharif) wrote…
We need them to actually be processed (which maybeAdd below doesn't do)
But we're not just enqueuing these ranges, we're ensuring the queue is fully drained before returning. See the usage of
DrainQueue
within the RPC code path. As I type this out we don't actually need to enqueue any ranges, we can simply drain the replica GC queue before returning to the migration manager.I'm not opposed to adding a roachpb.Version to the replica state, but I didn't realize (?) it was necessary. After the application the Migrate command, given the MLAI index moves forward and incoming snapshots never rollback past that point, I'm not sure I follow why we'd need additional ordering for the snapshot application? But attaching an active version along with the snapshot SGTM, and should be easy enough to hand.
Draining a queue processes the replicas in it, but not the ones that never got added. So if maybeAdd
doesn't queue a replica, DrainQueue
won't force it through the queue.
The MLAI argument works for initialized ranges, but what about ones getting their first snapshot? For example, if a learner is being added while the migrate command is rolled out, the learner might be getting a snapshot streamed that precedes the migrate command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)
pkg/kv/kvserver/store.go, line 2772 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Draining a queue processes the replicas in it, but not the ones that never got added. So if
maybeAdd
doesn't queue a replica,DrainQueue
won't force it through the queue.The MLAI argument works for initialized ranges, but what about ones getting their first snapshot? For example, if a learner is being added while the migrate command is rolled out, the learner might be getting a snapshot streamed that precedes the migrate command.
btw: and even if maybeAdd
does queue the replica, doing so might evict other replicas from the queue as the size is capped
I've ended up pretty much scrapping this entire approach in favor of tracking versions in replicas; sending out in another PR. |
The
Migrate
command forces all ranges overlapping with the requestspans to execute the (below-raft) migrations corresponding to the
specific, stated version. This has the effect of moving the range out of
any legacy modes operation they may currently be in. KV waits for this
command to durably apply on all the replicas before returning,
guaranteeing to the caller that all pre-migration state has been
completely purged from the system.
We also introduce two new RPCs for the migration infrastructure to use.
GCReplicas
will be used to instruct the target node to process allGC-able replicas.
FlushAllEngines
will be used to instruct the targetnode to persist all in-memory state to disk. Both of these are necessary
primitives for the migration infrastructure.
Specifically this comes up in the context of wanting the ensure that
ranges where we've executed a ranged
Migrate
command over have no wayof ever surfacing pre-migrated state. This can happen with older
replicas in the replica GC queue and with applied state that is not yet
persisted. We elaborate on both of these below:
Motivation for
GCReplicas
: Currently we wait for theMigrate
to haveapplied on all replicas of a range before returning to the caller. This
does not include earlier incarnations of the range, possibly sitting
idle in the replica GC queue. These replicas can still request leases,
and go through the request evaluation paths, possibly tripping up
assertions that check to see no pre-migrated state is found. For this
reason we introduce the
GCReplicas
RPC that the migration manager canuse to ensure all GC-able replicas are processed before declaring the
specific cluster version bump complete.
Motivation for
FlushAllEngines
: Like we mentioned above, KV currentlywaits for the
Migrate
command to have applied on all replicas beforereturning. With the applied state, there's no necessity to durably
persist it (the representative version is already stored in the raft
log). Out of an abundance of caution, and to really really ensure that
no pre-migrated state is ever seen in the system, we provide the
migration manager a mechanism to flush out all in-memory state to disk.
This way the manager can guarantee that by the time a specific cluster
version is bumped, all pre-migrated state from prior to that cluster
version will have been fully purged from the system.
The ideas here follow from our original prototype in #57445. Neither of
these RPCs or the
Migrate
command are currently wired up to anything.That'll happen in a future PR introducing the raft truncated state
migration.
Only the last two commits here are of interest. All prior commits are
from #57650 and #57637 respectively.