-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental version checks during migration #2399
base: main
Are you sure you want to change the base?
Conversation
505bcf1
to
c3dca19
Compare
This is meant to atomically record the state transition into a migration (by creation) and then out of it (by deletion). The general apps version is recorded at the end, and compared to the config version on next upgrade.
c3dca19
to
21653d6
Compare
What is the motivation for keeping the phase of prepare in-cluster? Isn't it just critical to track the apply phase? |
Seemed a natural starting point for these checks.
Yes. |
#2315 (comment) is probably where I got that from, to track the whole upgrade process from the start. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM!
Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion
from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.
No strong opinion here though. It doesn't hurt but might add some extra complexity.
if ! record_migration_prepare_begin sc; then | ||
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)" | ||
fi | ||
fi | ||
if [[ "${CK8S_CLUSTER:-}" =~ ^(wc|both)$ ]]; then | ||
if ! record_migration_prepare_begin wc; then | ||
log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Move the log_fatal
inside the function instead of assuming what a non-zero exit code means.
prepared_version="$(get_prepared_version sc || true)" | ||
if [ -z "${prepared_version}" ]; then | ||
log_fatal "'prepare' step does not appear to have been run, do so first" | ||
fi | ||
|
||
if [[ "${prepared_version}" != "${CK8S_TARGET_VERSION}" ]]; then | ||
log_fatal "'prepare' step in sc appears to have been run for version ${prepared_version}, not ${CK8S_TARGET_VERSION}" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Extract into a separate function and reuse for both sc and wc.
kubectl_do "${1}" get -n kube-system cm apps-upgrade -o yaml 2>/dev/null | | ||
yq4 --exit-status 'select(.data.prepare == strenv(CK8S_TARGET_VERSION)) | | ||
.data.prepared = strenv(CK8S_TARGET_VERSION) | | ||
del(.data.last_prepare_step)' | | ||
kubectl_do "${1}" replace -f - >/dev/null 2>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: While one-lines are nice I think reducing the magic a bit here could help with readability and make it less error-prone, something like:
current_version=$(kubectl get configmap apps-upgrade -o jsonpath='{.data.prepare}')
[ $current_version == $CK8S_TARGET_VERSION ] || log_fatal "version mismatch"
kubectl patch configmap apps-upgrade --type=json -p='[{"op": "replace", "path": "/data/prepared", "value":"$CK8S_TARGET_VERSION"}]'
kubectl patch configmap apps-upgrade --type=json -p='[{"op": "remove", "path": "/data/last_prepare_step"}]'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that resist TOCTOU issues? I was hoping that doing this as a single pipeline would ensure that the version could not be changed between the test and the patching. Sadly the JSON Patch subset used by kubernetes does not support "op":"test"
😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as record_migration_begin
isn't suffering from it I think we should be fine, right? Since that is essentially acting as a lock.
If we aren't worried that the version is changed out of bounds that is, but then we have bigger issues than a race condition between the test and the patch. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of resisting issues. Tests would be nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually testing migrations would be awesome but feels like too big of an undertaking here.
I think something as simple as testing that running two migrations concurrently is prohibited and that migrations are actually being tracked. The migrations in the tests could just be dummies.
3162bf7
to
53ac173
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still approved after removing the prepare phase tracking and keep relying on ck8sVersion
in the config file. However, I still do think that getting rid of the ck8sVersion
config value and only rely on the apps-meta
config map would have been neat.
But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. |
log_fatal "'prepare' step in wc appears to have been run for version ${prepared_version}, not ${CK8S_TARGET_VERSION}" | ||
fi | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from @Xartos - record beginning of apply step
bin/upgrade.bash
Outdated
# Create a configmap. This fails if done twice. | ||
if [[ "${CK8S_CLUSTER:-}" =~ ^(sc|both)$ ]]; then | ||
if ! record_migration_begin sc; then | ||
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)" | |
log_fatal "apply already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)" |
Also, would it make sense to add a ck8s upgrade <prefix> unlock-apply
to do this?
(With some warning+confirm that you probably shouldn't unless you know what you are doing.)
bin/upgrade.bash
Outdated
fi | ||
if [[ "${CK8S_CLUSTER:-}" =~ ^(wc|both)$ ]]; then | ||
if ! record_migration_begin wc; then | ||
log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)" | |
log_fatal "apply already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)" |
This reverts commit 53ac173. Decided to go with the way where it starts tracking at the prepare stage.
Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win. I think my main concern with the |
Warning
This is a public repository, ensure not to disclose:
What kind of PR is this?
Required: Mark one of the following that is applicable:
Optional: Mark one or more of the following that are applicable:
Important
Breaking changes should be marked
kind/admin-change
orkind/dev-change
depending on typeCritical security fixes should be marked with
kind/security
What does this PR do / why do we need this PR?
This uses a pair of ConfigMaps to keep track of progress during apps upgrades
in order to enforce order (prepare first, then apply, and for the same version)and atomicity
(only allow prepare once)per step.apps-meta
stores the current version of apps.data.version
the version number itselfapps-upgrade
stores details about an in-progress upgrade, and is deleted to signify a completed migration.data.prepare
records thatck8s upgrade ... prepare
has been started.data.prepared
records thatck8s upgrade ... prepare
has completed.data.prepared
records thatck8s upgrade ... apply
has started.data.last_step
records the last snippet run byck8s upgrade ...
Information to reviewers
This is intended to behave as a state machine, going from start
to preparetoapply (for each step) to done, at which point the upgraded-to version is
recorded.
How to test: I applied the changes on top of the
release-0.42
branch and upgraded from 0.41.x to 0.42.1, then again to 0.43 by creating a local tag (not to be published).Checklist