Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental version checks during migration #2399

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Zash
Copy link
Contributor

@Zash Zash commented Jan 17, 2025

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

What does this PR do / why do we need this PR?

This uses a pair of ConfigMaps to keep track of progress during apps upgrades
in order to enforce order (prepare first, then apply, and for the same version)
and atomicity (only allow prepare once) per step.

  • apps-meta stores the current version of apps
    • .data.version the version number itself
  • apps-upgrade stores details about an in-progress upgrade, and is deleted to signify a completed migration
    • .data.prepare records that ck8s upgrade ... prepare has been started
    • .data.prepared records that ck8s upgrade ... prepare has completed
    • .data.prepared records that ck8s upgrade ... apply has started
    • .data.last_step records the last snippet run by ck8s upgrade ...

Information to reviewers

This is intended to behave as a state machine, going from start to prepare to
apply (for each step) to done, at which point the upgraded-to version is
recorded.

How to test: I applied the changes on top of the release-0.42 branch and upgraded from 0.41.x to 0.42.1, then again to 0.43 by creating a local tag (not to be published).

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@Zash Zash force-pushed the ka/in-cluster-version branch 7 times, most recently from 505bcf1 to c3dca19 Compare January 22, 2025 10:33
This is meant to atomically record the state transition into a migration (by creation) and then out of it (by deletion).

The general apps version is recorded at the end, and compared to the config version on next upgrade.
@Zash Zash force-pushed the ka/in-cluster-version branch from c3dca19 to 21653d6 Compare January 22, 2025 10:58
@Zash Zash marked this pull request as ready for review January 22, 2025 10:58
@Zash Zash requested review from a team as code owners January 22, 2025 10:58
@aarnq
Copy link
Contributor

aarnq commented Jan 22, 2025

What is the motivation for keeping the phase of prepare in-cluster?

Isn't it just critical to track the apply phase?

@Zash
Copy link
Contributor Author

Zash commented Jan 22, 2025

What is the motivation for keeping the phase of prepare in-cluster?

Seemed a natural starting point for these checks.

Isn't it just critical to track the apply phase?

Yes.

@Zash
Copy link
Contributor Author

Zash commented Jan 22, 2025

#2315 (comment) is probably where I got that from, to track the whole upgrade process from the start.

Copy link
Contributor

@simonklb simonklb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

No strong opinion here though. It doesn't hurt but might add some extra complexity.

Comment on lines +57 to +64
if ! record_migration_prepare_begin sc; then
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
fi
fi
if [[ "${CK8S_CLUSTER:-}" =~ ^(wc|both)$ ]]; then
if ! record_migration_prepare_begin wc; then
log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Move the log_fatal inside the function instead of assuming what a non-zero exit code means.

Comment on lines +118 to +125
prepared_version="$(get_prepared_version sc || true)"
if [ -z "${prepared_version}" ]; then
log_fatal "'prepare' step does not appear to have been run, do so first"
fi

if [[ "${prepared_version}" != "${CK8S_TARGET_VERSION}" ]]; then
log_fatal "'prepare' step in sc appears to have been run for version ${prepared_version}, not ${CK8S_TARGET_VERSION}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Extract into a separate function and reuse for both sc and wc.

Comment on lines +398 to +402
kubectl_do "${1}" get -n kube-system cm apps-upgrade -o yaml 2>/dev/null |
yq4 --exit-status 'select(.data.prepare == strenv(CK8S_TARGET_VERSION)) |
.data.prepared = strenv(CK8S_TARGET_VERSION) |
del(.data.last_prepare_step)' |
kubectl_do "${1}" replace -f - >/dev/null 2>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: While one-lines are nice I think reducing the magic a bit here could help with readability and make it less error-prone, something like:

current_version=$(kubectl get configmap apps-upgrade -o jsonpath='{.data.prepare}')
[ $current_version == $CK8S_TARGET_VERSION ] || log_fatal "version mismatch"
kubectl patch configmap apps-upgrade --type=json -p='[{"op": "replace", "path": "/data/prepared", "value":"$CK8S_TARGET_VERSION"}]'
kubectl patch configmap apps-upgrade --type=json -p='[{"op": "remove", "path": "/data/last_prepare_step"}]'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that resist TOCTOU issues? I was hoping that doing this as a single pipeline would ensure that the version could not be changed between the test and the patching. Sadly the JSON Patch subset used by kubernetes does not support "op":"test" 😭

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as record_migration_begin isn't suffering from it I think we should be fine, right? Since that is essentially acting as a lock.

If we aren't worried that the version is changed out of bounds that is, but then we have bigger issues than a race condition between the test and the patch. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of resisting issues. Tests would be nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests would be nice! Figuring out how to test whole migrations feels like a whole task in itself however, unless we already have it somewhere? Maybe @aarnq or @viktor-f have thoughts on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually testing migrations would be awesome but feels like too big of an undertaking here.

I think something as simple as testing that running two migrations concurrently is prohibited and that migrations are actually being tracked. The migrations in the tests could just be dummies.

@Zash Zash requested a review from simonklb January 23, 2025 13:06
@Zash Zash force-pushed the ka/in-cluster-version branch from 3162bf7 to 53ac173 Compare January 23, 2025 13:08
Copy link
Contributor

@simonklb simonklb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still approved after removing the prepare phase tracking and keep relying on ck8sVersion in the config file. However, I still do think that getting rid of the ck8sVersion config value and only rely on the apps-meta config map would have been neat.

@aarnq
Copy link
Contributor

aarnq commented Jan 24, 2025

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean.
Or we have to move the entire config in cluster.
Else if someone has done prepare, then someone else without that config can run apply.

log_fatal "'prepare' step in wc appears to have been run for version ${prepared_version}, not ${CK8S_TARGET_VERSION}"
fi
fi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note from @Xartos - record beginning of apply step

bin/upgrade.bash Outdated
# Create a configmap. This fails if done twice.
if [[ "${CK8S_CLUSTER:-}" =~ ^(sc|both)$ ]]; then
if ! record_migration_begin sc; then
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log_fatal "prepare already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
log_fatal "apply already started in sc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

Also, would it make sense to add a ck8s upgrade <prefix> unlock-apply to do this?
(With some warning+confirm that you probably shouldn't unless you know what you are doing.)

bin/upgrade.bash Outdated
fi
if [[ "${CK8S_CLUSTER:-}" =~ ^(wc|both)$ ]]; then
if ! record_migration_begin wc; then
log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log_fatal "prepare already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"
log_fatal "apply already started in wc (kubectl delete -n kube-system configmap apps-upgrade to try again)"

This reverts commit 53ac173.

Decided to go with the way where it starts tracking at
the prepare stage.
@simonklb
Copy link
Contributor

simonklb commented Jan 24, 2025

Regarding the tracking of the prepare phase, I like that we are able to make sure that the config has been properly updated and get rid of the ck8sVersion from the config on the other hand if there is no Kubernetes cluster to verify against then we still have to rely on something external.

But if it is in cluster it doesn't say anything, as the config isn't attached to it. Then you must apply some sort of checksumming to assert that it is that config you mean. Or we have to move the entire config in cluster. Else if someone has done prepare, then someone else without that config can run apply.

Yes you're right, the inherent issue of working with distributed configuration files still applies. At least now we can make sure that the cluster has been migrated to a new version which is a big win.

I think my main concern with the ck8sVersion config is that it leads to a false sense of security and a bad habit of thinking that just running ck8s init and you're good to go. I've seen it countless of times where the configuration validation succeeds and apply breaks because the migration was never run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[3] Add applied apps version to cluster state
3 participants