Incremental version checks during migration #2399

Zash · 2025-01-17T13:31:14Z

Warning

This is a public repository, ensure not to disclose:

personal data beyond what is necessary for interacting with this pull request, nor
business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

kind/admin-change
kind/dev-change
kind/security
[kind/adr](set-me)

What does this PR do / why do we need this PR?

This uses a pair of ConfigMaps to keep track of progress during apps upgrades
~~in order to enforce order (prepare first, then apply, and for the same version)~~
and atomicity ~~(only allow prepare once)~~ per step.

apps-meta stores the current version of apps
- .data.version the version number itself
apps-upgrade stores details about an in-progress upgrade, and is deleted to signify a completed migration
- ~~.data.prepare records that ck8s upgrade ... prepare has been started~~
- ~~.data.prepared records that ck8s upgrade ... prepare has completed~~
- .data.prepared records that ck8s upgrade ... apply has started
- .data.last_step records the last snippet run by ck8s upgrade ...

Fixes [3] Add applied apps version to cluster state #2315

Information to reviewers

This is intended to behave as a state machine, going from start ~~to prepare~~ to
apply (for each step) to done, at which point the upgraded-to version is
recorded.

How to test: I applied the changes on top of the release-0.42 branch and upgraded from 0.41.x to 0.42.1, then again to 0.43 by creating a local tag (not to be published).

Checklist

This is meant to atomically record the state transition into a migration (by creation) and then out of it (by deletion). The general apps version is recorded at the end, and compared to the config version on next upgrade.

aarnq · 2025-01-22T12:13:27Z

What is the motivation for keeping the phase of prepare in-cluster?

Isn't it just critical to track the apply phase?

Zash · 2025-01-22T13:20:06Z

What is the motivation for keeping the phase of prepare in-cluster?

Seemed a natural starting point for these checks.

Isn't it just critical to track the apply phase?

Yes.

Zash · 2025-01-22T13:31:50Z

#2315 (comment) is probably where I got that from, to track the whole upgrade process from the start.

simonklb

This LGTM!

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

No strong opinion here though. It doesn't hurt but might add some extra complexity.

simonklb · 2025-01-23T12:17:19Z

bin/upgrade.bash

+    if ! record_migration_prepare_begin sc; then
+      log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
+    fi
+  fi
+  if [[ "${CK8S_CLUSTER:-}" =~ ^(wc|both)$ ]]; then
+    if ! record_migration_prepare_begin wc; then
+      log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
+    fi


suggestion: Move the log_fatal inside the function instead of assuming what a non-zero exit code means.

simonklb · 2025-01-23T12:17:50Z

bin/upgrade.bash

+    prepared_version="$(get_prepared_version sc || true)"
+    if [ -z "${prepared_version}" ]; then
+      log_fatal "'prepare' step does not appear to have been run, do so first"
+    fi
+
+    if [[ "${prepared_version}" != "${CK8S_TARGET_VERSION}" ]]; then
+      log_fatal "'prepare' step in sc appears to have been run for version ${prepared_version}, not ${CK8S_TARGET_VERSION}"
+    fi


suggestion: Extract into a separate function and reuse for both sc and wc.

simonklb · 2025-01-23T12:26:36Z

scripts/migration/lib.sh

+  kubectl_do "${1}" get -n kube-system cm apps-upgrade -o yaml 2>/dev/null |
+    yq4 --exit-status 'select(.data.prepare == strenv(CK8S_TARGET_VERSION)) |
+      .data.prepared = strenv(CK8S_TARGET_VERSION) |
+      del(.data.last_prepare_step)' |
+    kubectl_do "${1}" replace -f - >/dev/null 2>/dev/null


suggestion: While one-lines are nice I think reducing the magic a bit here could help with readability and make it less error-prone, something like:

current_version=$(kubectl get configmap apps-upgrade -o jsonpath='{.data.prepare}') [ $current_version == $CK8S_TARGET_VERSION ] || log_fatal "version mismatch" kubectl patch configmap apps-upgrade --type=json -p='[{"op": "replace", "path": "/data/prepared", "value":"$CK8S_TARGET_VERSION"}]' kubectl patch configmap apps-upgrade --type=json -p='[{"op": "remove", "path": "/data/last_prepare_step"}]'

Does that resist TOCTOU issues? I was hoping that doing this as a single pipeline would ensure that the version could not be changed between the test and the patching. Sadly the JSON Patch subset used by kubernetes does not support "op":"test" 😭

As long as record_migration_begin isn't suffering from it I think we should be fine, right? Since that is essentially acting as a lock.

If we aren't worried that the version is changed out of bounds that is, but then we have bigger issues than a race condition between the test and the patch. 😄

Speaking of resisting issues. Tests would be nice!

Tests would be nice! Figuring out how to test whole migrations feels like a whole task in itself however, unless we already have it somewhere? Maybe @aarnq or @viktor-f have thoughts on that?

Actually testing migrations would be awesome but feels like too big of an undertaking here.

I think something as simple as testing that running two migrations concurrently is prohibited and that migrations are actually being tracked. The migrations in the tests could just be dummies.

simonklb

Still approved after removing the prepare phase tracking and keep relying on ck8sVersion in the config file. However, I still do think that getting rid of the ck8sVersion config value and only rely on the apps-meta config map would have been neat.

aarnq · 2025-01-24T09:48:59Z

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean.
Or we have to move the entire config in cluster.
Else if someone has done prepare, then someone else without that config can run apply.

Zash · 2025-01-24T09:54:47Z

bin/upgrade.bash

+      log_fatal "'prepare' step in wc appears to have been run for version ${prepared_version}, not ${CK8S_TARGET_VERSION}"
+    fi
+  fi
+


Note from @Xartos - record beginning of apply step

aarnq · 2025-01-24T09:52:01Z

bin/upgrade.bash

+  # Create a configmap. This fails if done twice.
+  if [[ "${CK8S_CLUSTER:-}" =~ ^(sc|both)$ ]]; then
+    if ! record_migration_begin sc; then
+      log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"


Suggested change

log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

log_fatal "apply already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

Also, would it make sense to add a ck8s upgrade <prefix> unlock-apply to do this?
(With some warning+confirm that you probably shouldn't unless you know what you are doing.)

aarnq · 2025-01-24T09:52:09Z

bin/upgrade.bash

+  fi
+  if [[ "${CK8S_CLUSTER:-}" =~ ^(wc|both)$ ]]; then
+    if ! record_migration_begin wc; then
+      log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"


Suggested change

log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

log_fatal "apply already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

This reverts commit 53ac173. Decided to go with the way where it starts tracking at the prepare stage.

simonklb · 2025-01-24T15:56:21Z

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.

Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

Zash force-pushed the ka/in-cluster-version branch 7 times, most recently from 505bcf1 to c3dca19 Compare January 22, 2025 10:33

migration: Track apps version and migration steps

21653d6

This is meant to atomically record the state transition into a migration (by creation) and then out of it (by deletion). The general apps version is recorded at the end, and compared to the config version on next upgrade.

Zash force-pushed the ka/in-cluster-version branch from c3dca19 to 21653d6 Compare January 22, 2025 10:58

Zash marked this pull request as ready for review January 22, 2025 10:58

Zash requested review from a team as code owners January 22, 2025 10:58

simonklb approved these changes Jan 23, 2025

View reviewed changes

Zash requested a review from simonklb January 23, 2025 13:06

migration: only track the 'apply' step

53ac173

Zash force-pushed the ka/in-cluster-version branch from 3162bf7 to 53ac173 Compare January 23, 2025 13:08

simonklb approved these changes Jan 23, 2025

View reviewed changes

Zash commented Jan 24, 2025

View reviewed changes

aarnq reviewed Jan 24, 2025

View reviewed changes

Revert "migration: only track the 'apply' step"

8173bf4

This reverts commit 53ac173. Decided to go with the way where it starts tracking at the prepare stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental version checks during migration #2399

Incremental version checks during migration #2399

Zash commented Jan 17, 2025 •

edited

Loading

aarnq commented Jan 22, 2025 •

edited

Loading

Zash commented Jan 22, 2025

Zash commented Jan 22, 2025

simonklb left a comment

simonklb Jan 23, 2025

simonklb Jan 23, 2025

simonklb Jan 23, 2025

Zash Jan 23, 2025

simonklb Jan 23, 2025

simonklb Jan 23, 2025

Zash Jan 24, 2025

simonklb Jan 24, 2025

simonklb left a comment

aarnq commented Jan 24, 2025

Zash Jan 24, 2025

aarnq Jan 24, 2025

aarnq Jan 24, 2025

simonklb commented Jan 24, 2025 •

edited

Loading

	log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
	log_fatal "apply already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

	log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
	log_fatal "apply already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

Incremental version checks during migration #2399

Are you sure you want to change the base?

Incremental version checks during migration #2399

Conversation

Zash commented Jan 17, 2025 • edited Loading

What kind of PR is this?

What does this PR do / why do we need this PR?

Information to reviewers

Checklist

aarnq commented Jan 22, 2025 • edited Loading

Zash commented Jan 22, 2025

Zash commented Jan 22, 2025

simonklb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonklb left a comment

Choose a reason for hiding this comment

aarnq commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonklb commented Jan 24, 2025 • edited Loading

Zash commented Jan 17, 2025 •

edited

Loading

aarnq commented Jan 22, 2025 •

edited

Loading

simonklb commented Jan 24, 2025 •

edited

Loading