Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: long-running migrations #39182

Closed
danhhz opened this issue Jul 30, 2019 · 25 comments
Closed

*: long-running migrations #39182

danhhz opened this issue Jul 30, 2019 · 25 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@danhhz
Copy link
Contributor

danhhz commented Jul 30, 2019

Backward compatibility is a necessary part of storage infrastructure, but we'd like to minimize the associated technical debt. Initially our migrations were one-off, required complicated reasoning, and under tested (example: the format version stuff that was added for column families and interleaved tables). Over time, we've added frameworks to help with this, but there's one notable gap in our story: long-running data migrations.

An example that's come up a number of times is rewriting all table descriptors in kv. An extreme example that we may have to do one day might be rewriting all data on disk.

Note that the format versions mentioned above do a mini-migration on each table descriptor as it enters the node (from kv, gossip, etc). Nothing guarantees that all table descriptors eventually get rewritten. So even though this has been around since before 1.0, the format version migrations have to stay around as permanent technical debt. The FK migration currently underway will have a similar problem.

The format version code is a small amount of debt, but it'd be nice to get rid of it. Other examples are not so simple. The learner replicas work in 19.2 replaces preemptive snapshots, but after we stop using preemptive snapshots, we need to completely flush them out of the system before the code can be deleted. One of these places is an interim state between when a preemptive snapshot has been written to rocksdb and when raft has caught up enough to make it into a real replica. To flush these out, after we stop sending preemptive snapshots, we'll need to iterate each range descriptor in kv and ensure that it is finished being added or is removed.

More examples from @tbg:

  • GenerationComparable (make sure that 20.1 can assume that there are no legacy generations out there) - not yet sure how exactly this is achieved but should be doable
  • the various Raft migrations (RangeAppliedState, unreplicated truncated state, etc) which all boil down to "run something through Raft until they're all active on all ranges but 99.99% sure they're all active already anyway"

We currently have two main frameworks for migrations. They go by different names in various places, but I'll call them startup migrations and cluster versions.

Startup Migrations

Startup migrations (in package pkg/sqlmigrations and called "Cluster Upgrade Tool" by the RFC) are used to ensure that some backwards-compatible hook is run before the code that needs it. An example of this is adding a new system table. A node that doesn't know about a system table simply doesn't use it, so it's safe to add one in a mixed version cluster.

When the first node of the new version starts up and tries to join the cluster, it notices that the migration hasn't run, runs it, and saves the result durably in kv. Any nodes that start after see that the migration has been run and skip it.

If a second node of the new version tries to join the cluster before the migration has completed, it blocks until the migration finishes. This means that startup migrations need to be relatively quick. In an ideal world, every user would do version upgrades with a rolling restart, waiting for each node to be healthy and settled before moving on. But it's not Making Data Easy if a non-rolling restart blocks every node on some extremely long startup migration, causing a total outage.

On the other hand, by running at startup, all code in the new version can assume that startup migrations have been run. This means there doesn't need to be a fallback path and that any old code can be deleted immediately.

Cluster Versions

Cluster versions are a special case of our gossip-based configuration settings. They allow for backward-incompatible behaviors to be gated by a promise from the user that they'll never downgrade. An example of this is adding a field to an existing rpc request protobuf. A node doesn't want to send such a request until it is sure that it will be handled correctly on the other end. Because of protobuf semantics, if it went to an old node, the field would be silently ignored.

Cluster versions tie together two concepts. First, a promise to the user that we don't have to worry about rpc or disk compatibility with old versions anymore. Second, a feature gate that is unlocked by that promise. There is an ordered list of these, one for each feature being gated.

Because the feature gate is initially off when a new version is rolled onto, each check of the gate needs to have a fallback. New features can return an error informing the user to finish the upgrade and bump the version, but other uses need to keep old code around for a release cycle.

Aside: To make it easier for users that don't want the control, the cluster version is automatically bumped some period of time after all nodes have been upgraded. A setting is available to disable this for users that want to keep the option to roll back until they've manually signed off on the new version.

Data Migrations

Summary: I propose a partial unification of these two systems. To avoid having three migration frameworks, they will be based on and replace Cluster Versions. Separate the two ClusterVersion concepts described above so that we can execute potentially long-running hooks in between.

The interface presented to the user is essentially the same, but slightly expanded. After rolling onto a new version of cockroach, some features are not available until the cluster version is bumped. Either the user does this manually or the auto-upgrade does. Instead of the version bump finishing immediately, a system job is created to do the upgrade, enabling the feature gates as it progresses. This is modeled as a system job both to make sure only one coordinator is running as well as exposing progress to the user. Once the job finishes (technically as each step finishes) the gated features become available to the user.

The interface presented to CockroachDB developers is also mostly the same. Each version in the list is now given the option of including a hook, which is allowed to take a while (TODO be more specific) to finish.

Details:

  • The hook, if given, runs after the feature gate is enabled. This simple building block can be used to build more complex migrations by using multiple gate+hook features.
  • To make it simpler to reason about the code that goes in the hook, the system guarantees that the gate has been pushed to every node in the system. That is, every IsActive(feature) check will always return true on every node from now on. This was not previously guaranteed and the implementation is mostly tricky around handling nodes that are unavailable when the gate is pushed.
  • Like startup migrations, the hook must be idempotent.
  • Complications such as flushing caches, etc are left to the hooks. If any compelling patterns emerge, they could be baked into the framework in the future.
  • Startup migrations then conceptually become very similar to this. They currently implement their own coordinator leasing system and there is an opportunity for code unification by moving them to this new system job type.

Side node: This is all very nearly the long-term vision laid out in https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20161024_cluster_upgrade_tool.md

Note to self: https://reviewable.io/reviews/cockroachdb/cockroach/38932#-LlaULyp9sd2JS_tyaKi:-LlaULyp9sd2JS_tyaKj:bfh2kib

@danhhz danhhz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv Anything in KV that doesn't belong in a more specific category. labels Jul 30, 2019
@danhhz danhhz self-assigned this Jul 30, 2019
@danhhz danhhz changed the title CockroachDB Data Migrations *: long-running migrations Jul 30, 2019
@danhhz
Copy link
Contributor Author

danhhz commented Jul 30, 2019

@tbg Does this accurately reflect our conversation? @bdarnell This is still pretty handwave-y, but I'd love some initial feedback if you have a minute.

@tbg
Copy link
Member

tbg commented Jul 31, 2019

@danhhz yes it does!
Not sure what your thinking here was, but feels that there doesn't have to be anything special about "major" versions (those corresponding to releases) except that by convention the major version usually doesn't have a hook associated to it (because this hook won't have run by the time the major version is active, so it's better put in an earlier hook). For example, if we have the versions

19.1 (active)
19.1-1
19.1-2
19.1-3

(i.e. we're in the 19.2 cycle, and some migrations exist, but no 19.2 release version yet) then the data migration mechanism would move from 19.1-0 (the initial version) to 19.1-1, 19.1-2, 19.1-3 as you'd expect. Later if, that cluster got restarted into a binary that also has the release version 19.2 (=19.2-0), it'd move to that. I'm not sure whether the automatic upgrade mechanism does this today, but it should (and I hope it already does).

Were you planning to have an individual job for each transition (19.1-1 to 19.1-2 for example, and we only use jobs for transitions that have a hook)? This also would let us account for the duration of each step and make obvious to the user where we're at. This seems both most straightforward and conceptually easiest to me, better than trying to group multiple steps into one job.

As an aside, we're also planning to move to "negative" unstable versions at some point, so the above history would have taken the form

19.2-(-999)
19.2-(-998)
19.2-(-997)
19.2 (=19.2-0)

instead, so that a 19.2-alpha won't pretend to be compatible with a 19.1 release. I don't think that affects your plans here at all since in both you just move "step by step" through the unstable versions in the order in which they appear (and sort).

@bdarnell
Copy link
Contributor

The hook, if given, runs after the feature gate is enabled.

What happens at the end of the hook? That is, how does the code and/or the admin learn that everything is finished and it's possible to move on to the next version? Do we now gossip two cluster versions, one tracking the feature gates and one for the migration completion?

@danhhz
Copy link
Contributor Author

danhhz commented Jul 31, 2019

The admin runs one command that triggers a job for all necessary migrations. I've been imagining that the job steps through each version. For each step, it first communicates the version to the code, unblocking feature gates (I'd like to have stronger guarantees about this than we've had in the past, how this works is the part that is currently most handwave-y), then runs the hook, then moves on to the next one. Progress would be communicated to the user via the job. I don't think two versions are necessary.

@bdarnell
Copy link
Contributor

So then the new step in the upgrade process will be something like "check the jobs page to make sure the migration job from the previous upgrade has finished before starting a new upgrade"? We should also consider scripting for this for embedded users who need to do upgrades in a hands-off way (and might bundle multiple upgrades in a relatively short time frame, and/or have pathological conditions that could cause migrations to get hung for a very long time).

@danhhz
Copy link
Contributor Author

danhhz commented Jul 31, 2019

Given that this unblocks the shiny new features as it goes, I was imagining the upgrade docs would have a step at the end directing the user to monitor the job until it finishes. But you're right that we should mention it at the beginning, too, just in case.

I'm in favor of making it scriptable. There's some bikeshedding to be done about whether it's one SQL command that blocks for the entire time or one to kick off the job and one to wait on it (IIRC the bulk io team has been favoring the latter recently).

One blocking command (strawman, ignore the syntax):

> COMMIT TO UPGRADE;
<waits a while>

Two commands (strawman, ignore the syntax):

> COMMIT TO UPGRADE;
job_id
12345
> BLOCK ON JOB 12345;
<waits a while>

We could also wrap this SQL in a cli tool if it makes it easier for embedded uses.

I share your concerns about multiple upgrades in a short time frame. In fact, this design was largely worked backward by starting with these assumptions: a) we'll need to build long-running migrations at some point b) we should make it really hard to mess up when upgrading one version and c) there should be some sane story for upgrading multiple versions.

I'm not entirely happy with how much this achieves (c) but at some point we fundamentally have to choose between blocking on startup if they're going too quickly or requiring long-running migrations to work while spanning multiple major versions (which makes them less useful for eliminating tech debt). I should make this more explicit above, but my current thinking is that if you start on major version X, roll to Y, then start rolling to Z without the Y upgrade being committed, then the Z nodes would block on startup.

Hung migrations is a good concern, I hadn't thought of it. I'm not sure what we could do besides putting something in whatever alerting we build and your idea to document checking in on it at the beginning of the next upgrade. Any opinions here?

@danhhz
Copy link
Contributor Author

danhhz commented Jul 31, 2019

Nathan pointed out in our 1:1 that my thinking around nodes rejoining after being partitioned off when a feature gate bump is pushed out is probably too harsh. @nvanbenschoten do you mind writing up your concern here so I'm sure I have it right?

@bdarnell
Copy link
Contributor

bdarnell commented Aug 1, 2019

I like the two command "block on job" variant because it's extensible to other kinds of jobs. (syntax strawman: SHOW UPGRADE STATUS instead of COMMIT TO UPGRADE. Don't use the verb COMMIT for something other than transaction commits)

then start rolling to Z without the Y upgrade being committed, then the Z nodes would block on startup.

Maybe the Z nodes should crash instead of blocking, which is more likely to trigger rolling downgrades in the deployment tooling. Either crashing or blocking is probably the best we can do here. Blocking is only better than crashing if we can be confident that some of the Y nodes will stick around to complete the process.

I'm not sure what we could do besides putting something in whatever alerting we build and your idea to document checking in on it at the beginning of the next upgrade. Any opinions here?

I don't think there's much we can do besides documenting it and providing tools to check the status and wait.

@danhhz
Copy link
Contributor Author

danhhz commented Aug 15, 2019

I whiteboarded the details of this with @tbg today and we realized that a lot of the scope I was hoping to cut to get this into 19.2 is not able to be cut. So, I now think this is a long shot. I’m going to keep working on the prototype and see where I get

@danhhz
Copy link
Contributor Author

danhhz commented Sep 24, 2019

Note to self: another possible use of this from @ajkr

Also I forgot to mention this but Pebble might need to be compatible with multiple RocksDB versions, simply because upgrading to 20.1 doesn't necessarily mean the data was rewritten with Pebble by the time a user upgrades to 20.2. A full compaction would be needed to guarantee that. In fact maybe they turn on Pebble in 20.1 and it sees data written by whatever RocksDB version we were using in 1.0. Is there anything preventing this?

@irfansharif
Copy link
Contributor

Another possible use case for this would be the raft storage engine (#7807). If we’re introducing a dedicated storage engine for “raft stuff”, there has to be a cut over point for each node running a crdb version with this dedicated engine code to scoop up all existing raft data from the existing storage engine, and funneling it into the new one. I had put down some thoughts two years ago here how this could be done.

@nvanbenschoten
Copy link
Member

@irfansharif that can't just be done at startup? It's not clear from the RFC why this needs to be coordinated across replicas, given that the Raft log is currently stored in the unreplicated keyspace.

@irfansharif
Copy link
Contributor

It can/should be, by "cutover point" I was looking at a node-centric view, not a cluster wide one. Mentioned it all here because the proposal above talks about unifying startup migrations with cluster versions.

@danhhz
Copy link
Contributor Author

danhhz commented Nov 25, 2019

Can we do it at startup though? Some of the other motivating examples could be done at startup, but we avoid long-running migrations at startup to protect against users rolling onto a version too fast. I'd be uncomfortable reading all raft data from one engine and writing it into another at startup, if that's what's being discussed here.

@irfansharif
Copy link
Contributor

to protect against users rolling onto a version too fast.

Do you mind elaborating on this? I don't quite follow. We are discussing reading all raft data from one engine and writing it into another at startup.

@danhhz
Copy link
Contributor Author

danhhz commented Nov 25, 2019

If you're doing a vX to vX+1 migration, you roll each node onto the next version. Ideally, you roll one node, wait for it to health check, and then roll the next node, repeating until you've done them all.

The worrysome case is a user that rolls them without the health checks. If we do too much work when first starting a node in the new version, then rolling them too fast will result in a cluster that is down for as long as the migrations take. Obviously, the user should be doing the right thing here, but extended total unavailability when they mess up is pretty bad.

Another thing to consider here is that it's likely difficult to maintain the ability to roll back to vX with an at-startup migration to a new raft log engine.

tbg added a commit to tbg/cockroach that referenced this issue Dec 2, 2019
TODO: see if this migration is actually "short-running". That is,
in a sufficiently large cluster, does this cause significant load?

----

This is a baby version of cockroachdb#39182 that handles only short-running
migrations but is useful in itself because it allows us to migrate us
fully into the following two KV below-Raft migrations:

1. use the RangeAppliedState on all ranges
2. use the unreplicated TruncatedState on all ranges

These two migrations have been around for a while and it has been
getting in the way of things. While ideally we introduce cockroachdb#39182 in the
near future and use it to address a larger class of migration concerns,
this PR serves as a starting point and work done on cockroachdb#39182 should be
able to absorb this PR with relative ease.

With this PR, legacy code related to 1) and 2) above can be removed from
`master` once the 20.1 branch is cut, i.e. roughly in April 2020.

The main ingredients in this PR are

a) introduce a hook that is called during `SET CLUSTER SETTING version`,
   after the change has been validated but before it is made.
b) introduce a KV-level `Migrate` ranged write command that triggers
   the migrations for 1) and 2) above. It is called from the hook for
   all of the keyspace.
c) before returning to the client, `Migrate` waits for the command to
   be durably applied on all followers.

Trying to get this 100% correct has proven tricky, perhaps foreshadowing
similar issues that expect us once we try cockroachdb#39182 in earnest. For one,
the mechanism only reaches replicas that members of the raft group, that
is, it won't touch replicas which are gc'able. For the migrations at
hand this means that in 20.2 there may - in theory - still be replicas
that have a replicated truncated state. I believe that our recent
efforts around not re-using replicas across replicaIDs has made sure
that this isn't an issue for this particular migration, but in general
it will have to remain on the radar. Similarly, it seems hard to prove
conclusively that no snapshot is in-flight that would set up a new
follower with a state predating the explicit migration, though it would
be exceptionally rare in practice.

Release note (general change): version upgrades now perform maintenance
duties that may slightly delay the `SET CLUSTER SETTING version` command
and may cause a small amount of additional load on the cluster while an
upgrade is being finalized.
@rohany
Copy link
Contributor

rohany commented Dec 18, 2019

Is there anymore discussion about this? This would be very useful for the FK migration work right now (we want to upgrade all table descriptors as a migration).

@danhhz
Copy link
Contributor Author

danhhz commented Dec 18, 2019

I have it on my backburner to finish the prototype, but noone should block on me for anything. @irfansharif has expressed an interest in this area at one point, dunno if anything has changed there.

@irfansharif
Copy link
Contributor

The issue with table descriptors (for my own reference).

There's a new FK representation (as of 19.2) and we want to make sure all the table descriptors have it in 20.1. If old table descriptors (the 19.1 representation) are left lying around, we have to keep maintaining the "upgrade on read" code path introduced in 19.2. For a running cluster, there's currently the possibility that certain tables would have not been read since 19.1 and thus would still be identified using the old representation.

Having the migration story here ironed out would reduce this maintenance burden + build confidence in table descriptor version upgrades.

@spaskob
Copy link
Contributor

spaskob commented Jan 10, 2020

Handling jobs from 2.0 clusters may also be relevant:

if log.V(2) {

It will be good to remove case from the codebase.

irfansharif added a commit to irfansharif/cockroach that referenced this issue Feb 26, 2020
We split off ClusterVersion out of pkg/settings/cluster into a dedicated
pkg/clusterversion. This is to pave the way for cockroachdb#39182 where we
introduce long running version migration infrastructure.

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Feb 27, 2020
We split off ClusterVersion out of pkg/settings/cluster into a dedicated
pkg/clusterversion. This is to pave the way for cockroachdb#39182 where we
introduce long running version migration infrastructure.

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Feb 27, 2020
We split off ClusterVersion out of pkg/settings/cluster into a dedicated
pkg/clusterversion. This is to pave the way for cockroachdb#39182 where we
introduce long running version migration infrastructure.

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Feb 28, 2020
We split off ClusterVersion out of pkg/settings/cluster into a dedicated
pkg/clusterversion. This is to pave the way for cockroachdb#39182 where we
introduce long running version migration infrastructure.

Release note: None

Co-authored-by: Daniel Harrison <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
craig bot pushed a commit that referenced this issue Feb 28, 2020
45143: ui: Release notes signup r=dhartunian a=koorosh

Resolves: #43912
Depends on: #44744, #44821, #44856 

- [x] WIP. Current branch has two branches merged which haven't been approved/merged in master branch yet.

- [x] rebase branch to remove merge commits from branches other than master.

Add Release notes signup form to Cluster Overview page
right after page title.
Release Notes signup view is created in `src/views/dashboard`
directory because it will be the main page where we display
this view. And Cluster Overview page is a temporary place
while Dashboard view doesn't exist.

These changes integrate three main parts: ReleaseNotesForm,
AlertNotification component and custom analytics saga.



45426: coldata: minor tweak of flat bytes r=yuzefovich a=yuzefovich

This commit changes `maybeBackfillOffsets` to update `maxSetIndex`
accordingly (this might be a minor performance improvement). In a sense,
when we're backfilling offsets, we *are* setting indices to point to
empty `[]byte` slices. Also, the logic for `Set` method is slightly
refactored.

Release note: None

45455: clusterversion: significantly rework cluster version handling r=tbg a=irfansharif

We split off ClusterVersion out of pkg/settings/cluster into a dedicated
pkg/clusterversion. This is to pave the way for #39182 where we
introduce long running version migration infrastructure.

We then introduce clusterversion.Handle as a read only view to the
active cluster version and this binary's version details. We similarly
fold in the actual global cluster version setting into
pkg/clusterversion, and enforce all external usage to go through
clusterversion.Handle. We can also hide away external usage of the baked
in cluster.Binary{,MinimumSupported}Version constants. Instead we have
the implementation of clusterversion.Handle allow for tests to override
binary and minimum supported versions as needed.

Along the way we clean up Settings.Initialized, as it is not really
used. We document all the various "versions" in play ("binary version",
    "minimum supported version", "active version") and change usage of what
was previously "serverVersion" to simply be "binaryVersion", because
that's what it is. We also clean up the Settings constructors into
Test/Non-Test types that differ in cluster version setting
initialization behaviour.

---

For reviewers: It's probably easiest to start with what 
pkg/settings/cluster/cluster_settings.go looks like, then following into
pkg/clusterversion/cluster_version.go and then pkg/clusterversion/setting.go.

---

I still don't like the following detail about our pkg structure:

- pkg/settings/cluster depends on it's "parent" pkg, pkg/settings
- pkg/settings/cluster depends pkg/clusterversion, which in turn depends
on pkg/settings

Naively, pkg/settings/cluster should be a top level package, but I'm not
making that change in this PR. For now I've renamed the settings.go file
to cluster_settings.go.

Release note: None


45535: Revert "storage,libroach: update MVCCIncrementalIterator to look at every updated key" r=pbardea a=pbardea

Reverts #45163

To stop the errors we're seeing on #45524. Will investigate further once it's off master.

Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Daniel Harrison <[email protected]>
Co-authored-by: Paul Bardea <[email protected]>
craig bot pushed a commit that referenced this issue Mar 2, 2020
45455: clusterversion: significantly rework cluster version handling r=irfansharif a=irfansharif

We split off ClusterVersion out of pkg/settings/cluster into a dedicated
pkg/clusterversion. This is to pave the way for #39182 where we
introduce long running version migration infrastructure.

We then introduce clusterversion.Handle as a read only view to the
active cluster version and this binary's version details. We similarly
fold in the actual global cluster version setting into
pkg/clusterversion, and enforce all external usage to go through
clusterversion.Handle. We can also hide away external usage of the baked
in cluster.Binary{,MinimumSupported}Version constants. Instead we have
the implementation of clusterversion.Handle allow for tests to override
binary and minimum supported versions as needed.

Along the way we clean up Settings.Initialized, as it is not really
used. We document all the various "versions" in play ("binary version",
    "minimum supported version", "active version") and change usage of what
was previously "serverVersion" to simply be "binaryVersion", because
that's what it is. We also clean up the Settings constructors into
Test/Non-Test types that differ in cluster version setting
initialization behaviour.

---

For reviewers: It's probably easiest to start with what 
pkg/settings/cluster/cluster_settings.go looks like, then following into
pkg/clusterversion/cluster_version.go and then pkg/clusterversion/setting.go.

---

I still don't like the following detail about our pkg structure:

- pkg/settings/cluster depends on it's "parent" pkg, pkg/settings
- pkg/settings/cluster depends pkg/clusterversion, which in turn depends
on pkg/settings

Naively, pkg/settings/cluster should be a top level package, but I'm not
making that change in this PR. For now I've renamed the settings.go file
to cluster_settings.go.

Release note: None


45515: sql: rationalize some output events from the connection state machine r=andreimatei a=andreimatei

See individual commits.

Release note: None

45582: colexec: plan disk-spilling enabled operators when vectorize=auto r=yuzefovich a=asubiotto

Each commit turns on one of the HashRouter, Sorter, and HashJoiner and includes the relevant test changes (all plan changes).

The only thing of note is that explain analyze plans include both row and column stats for wrapped operators. It seems like this is expected though (according to `vectorize_local`) and changing this naively results in other failures (`unexpectedly not collecting stats`), so will leave this up to discussion of whether we want to change this. Regardless, I think it is out of scope for this PR.

Closes #45172 

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Daniel Harrison <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Alfonso Subiotto Marques <[email protected]>
@irfansharif irfansharif self-assigned this Mar 31, 2020
@irfansharif
Copy link
Contributor

RFC here: #48843.

irfansharif pushed a commit to irfansharif/cockroach that referenced this issue Aug 18, 2020
TODO: see if this migration is actually "short-running". That is,
in a sufficiently large cluster, does this cause significant load?

----

This is a baby version of cockroachdb#39182 that handles only short-running
migrations but is useful in itself because it allows us to migrate us
fully into the following two KV below-Raft migrations:

1. use the RangeAppliedState on all ranges
2. use the unreplicated TruncatedState on all ranges

These two migrations have been around for a while and it has been
getting in the way of things. While ideally we introduce cockroachdb#39182 in the
near future and use it to address a larger class of migration concerns,
this PR serves as a starting point and work done on cockroachdb#39182 should be
able to absorb this PR with relative ease.

With this PR, legacy code related to 1) and 2) above can be removed from
`master` once the 20.1 branch is cut, i.e. roughly in April 2020.

The main ingredients in this PR are

a) introduce a hook that is called during `SET CLUSTER SETTING version`,
   after the change has been validated but before it is made.
b) introduce a KV-level `Migrate` ranged write command that triggers
   the migrations for 1) and 2) above. It is called from the hook for
   all of the keyspace.
c) before returning to the client, `Migrate` waits for the command to
   be durably applied on all followers.

Trying to get this 100% correct has proven tricky, perhaps foreshadowing
similar issues that expect us once we try cockroachdb#39182 in earnest. For one,
the mechanism only reaches replicas that members of the raft group, that
is, it won't touch replicas which are gc'able. For the migrations at
hand this means that in 20.2 there may - in theory - still be replicas
that have a replicated truncated state. I believe that our recent
efforts around not re-using replicas across replicaIDs has made sure
that this isn't an issue for this particular migration, but in general
it will have to remain on the radar. Similarly, it seems hard to prove
conclusively that no snapshot is in-flight that would set up a new
follower with a state predating the explicit migration, though it would
be exceptionally rare in practice.

Release note (general change): version upgrades now perform maintenance
duties that may slightly delay the `SET CLUSTER SETTING version` command
and may cause a small amount of additional load on the cluster while an
upgrade is being finalized.
@irfansharif
Copy link
Contributor

the various Raft migrations (RangeAppliedState, unreplicated truncated state, etc) which all boil down to "run something through Raft until they're all active on all ranges but 99.99% sure they're all active already anyway"

#58088 uses the long running migrations infrastructure proposed in #48843 (x-ref linked PRs in #56107) to onboard exactly the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants