-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Upgrade cert-manager #3313
✨ Upgrade cert-manager #3313
Conversation
/hold To wait for v0.3.7 to be released first |
I can keep the old method name around if preferred, and maybe introduce the new name as an extra function and deprecated the old one? |
The job is optional, I should have said something earlier. AFAIK these methods are only used within clusterctl, so it should be fine to change |
/milestone v0.3.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me. Just some minor nits.
UPDATE: I also manually tested this using clusterctl init
on a new kind cluster and everything was Running smoothly. Yay!
Thanks for this PR!
// Nb. we are ignoring the error so this operation can support listing images even if there is no an existing management cluster; | ||
// in case there is no an existing management cluster, we assume there is no web-hook installed in the cluster. | ||
hasWebhook, _ := cm.hasWebhook() | ||
if hasWebhook { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain but this bit a code may still be required? If a user runs 'init' again on an existing cluster, what should this area of code do?
log := logf.Log | ||
|
||
// Checks if the cert-manager web-hook already exists, if yes, exit immediately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we still want this check too? Again, to avoid anything actually happening if 'clusterctl init' is run 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-added this check with an additional function that calls Get on each resource in the cluster to verify whether cert-manager resources already exist before Creating them all again.
Now that I think some more though, it may be better to just call Create(), because currently if a single IsNotFound error is encountered, we go ahead and call Create anyway.
Whilst we may not want to address a proper upgrade process in this PR, because it is upgrading the manifest used, we probably want to implement some guards to avoid users downloading the new clusterctl
and running init
against an existing management cluster, as they could end out with orphaned resources from the previous release leftover.
I'm actually not 100% certain on how best to address this - updating the manifests in place without a defined upgrade process is tough, because we don't have any means to understand whether an older version is installed (unless we retain both copies of the cert-manager install manifests and explicitly check if all resources in an older version exist, but that gets.. slippery).
We do have this annotation on resources: helm.sh/chart: 'cert-manager-v0.15.2'
but I'm not sure if relying on the cert-manager project to always set this in this format is wise. Perhaps we could annotate created objects (similar to what's done already), but with a version number too? For the 'transitionary period' for users coming from earlier clusterctl versions, we could check for the presence of the cert-manager
value with no version annotation in the clusterctl.cluster.x-k8s.io/core
annotation). But without recording exactly what resources are required to be installed in earlier versions of cert-manager, it won't be possible for us to be certain all resources exist (i.e. if a previous invocation of clusterctl init
failed half way through, we'll struggle to recover).
936562d
to
465adf5
Compare
// Issuer and Certificate). | ||
// This ensures that the Kubernetes apiserver is ready to serve resources within the | ||
// cert-manager API group. | ||
func (cm *certManagerClient) waitForAPIReady(ctx context.Context, log logr.Logger) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for confirmation, in this func we are creating an issurer/a certificate, but we are not waiting for the certificate to be processed as documented https://cert-manager.io/docs/installation/kubernetes/#verifying-the-installation.
Is this enough for assuming cert-manager is working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question, as far as I know cert-manager, this test can work both with v0.11.0 and v0.15.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this enough for assuming cert-manager is working?
So it depends on how you want to handle this - you are correct, that it may not yet be ready to issue certificates as the controller could still be leader electing. However, presumably the parts of code that create the Certificate resource also then wait for those Certificates to be Ready before consuming the signed certificate/private key. Would you rather wait at the start for the controller to be ready, or wait until further into execution? Easy enough to change either way 😄
Another question, as far as I know cert-manager, this test can work both with v0.11.0 and v0.15.2
Correct, this is a far more universal test - if this stops working, it'll be because:
- The v1alpha2 API version no longer exists in cert-manager
- We made a big mistake in cert-manager 😅
// This is required to check the web-hook is working and ready to accept requests. | ||
func (cm *certManagerClient) isWebhookAvailable(webhook *unstructured.Unstructured) (bool, error) { | ||
conditions, found, err := unstructured.NestedSlice(webhook.Object, "status", "conditions") | ||
testObjs, err := getTestResourcesManifestObjs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should randomize the namespace name, but this can be addressed in a follow up as well
|
||
// installs the web-hook | ||
// Install all cert-manager manifests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question. Should we start to add a version label to each component so we can properly plan for cert-manager upgrades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems sensible - but we will need to store that version number somewhere, and remember to update it if the YAML asset changes. I wonder if we can have an automated test for this..? Maybe even just take a hash of the YAML file and have a variable: versionNumber
which is a tuple of the version number and the hash of that file?
That way if you change the file, the hash will no longer match, forcing people to look at that area of code?
Perhaps you have other instances of this somewhere and have an existing pattern to solve it? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4aa658b - keen to hear your comments on the technique for keeping the version number up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like the idea of the hash, so we can manage adjustment to the manifest without bumping a release.
Addressed your feedback - should be good for another look! The only other thing I'm concerned about now is the limitations/requirements about label values - from what I can tell, every possible character we are using right now is valid/part of the character set permitted for labels. But it may be better to use annotations for some of this data if indexing/selecting based on these values is not required :) |
Seeing:
Before we go ahead and merge this, do we want to stick with MD5 for this hashing? I don't think we are so concerned around the cryptographic properties of the hashing function, but it'll be a lot easier to get it right now than it would be to change it later 😄 |
Let's just use sha256? |
Done ✅ |
Seems like we might have to limit the hash to be in a label, IIRC usually can be limited to 10 chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I ran into the following problem.
- Created kind cluster
clusterctl init
on that cluster with older clusterctl (that is, clusterctl built from master branch).- Built clusterctl using this branch and ran
clusterctl init
.
Got the following output:
Using configuration File="/Users/wfernandes/.cluster-api/clusterctl.yaml"
Fetching providers
Waiting for cert-manager to be available...
Creating Namespace="cert-manager-test"
Failed to create cert-manager test resource Namespace="cert-manager-test"
Installing cert-manager
Creating Namespace="cert-manager"
Operation failed, retrying with backoff Cause="failed to create cert-manager component: /v1, Kind=Namespace, /cert-manager: Namespace \"cert-manager\" is
invalid: metadata.labels: Invalid value: \"d7a0492b0c1345023d79be75f475b3670c4109a305bdd85a36bb78d142b8febd\": must be no more than 63 characters"
...
Error: action failed after 10 attempts: failed to create cert-manager component: /v1, Kind=Namespace, /cert-manager: Namespace "cert-manager" is invalid:
metadata.labels: Invalid value: "d7a0492b0c1345023d79be75f475b3670c4109a305bdd85a36bb78d142b8febd": must be no more than 63 characters
Seem like there is a constraint on label length to be 63 characters in length.
https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set
UPDATE: LOL! Apologies...just noticed that the e2e's were failing for the same reason 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the multiple review comments.
// persist the version number of stored resources to make a | ||
// future enhancement to add upgrade support possible. | ||
labels["certmanager.clusterctl.cluster.x-k8s.io/version"] = embeddedCertManagerVersion() | ||
labels["certmanager.clusterctl.cluster.x-k8s.io/hash"] = embeddedCertManagerHash() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should have a test somewhere to ensure these are set if we are going to be depending on them in the future.
if err := cm.deleteObj(obj); err != nil { | ||
log.V(5).Info("Failed to delete cert-manager test resource", logf.UnstructuredToValues(obj)...) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel if there may be some interesting behavior here. If we delete the namespace cert-manager-test
and then delete the certificate/issuer could we run into a chance of getting a NotFound error since deleting the namespace would delete the stuff inside it?
I haven't run into it while testing manually...but I'm wondering if we should ignore NotFound errors here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be we can simplify and simply delete the ns...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adjusted this to tolerate NotFound errors instead for now, given there is no specific namespace creation logic this seems simpler than having to inspect the manifests and infer what namespace to delete. This are can be improved as you've noted above to randomise the namespace name, which I think will make deletion a little simpler then too 😄
Hey folks, now that we've released v0.3.7 I was wondering about the status of this PR, any particular updates or action items? |
I'm +1 to rally and get this merged. |
If we are looking to get this merged in sooner rather than later, we can create separate issues and address some of the comments above. My comments aren't necessarily blockers. |
/milestone v0.3.9 |
@munnerz is it ok for you trying to get this merged in this iteration? |
Thank you @munnerz for finishing this up! Changes LGTM, can we squash commits? /assign @fabriziopandini @wfernandes |
/approve |
I'll do another pass on this now. Thanks @munnerz! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Waiting on commit squash.
"sigs.k8s.io/cluster-api/cmd/clusterctl/internal/scheme" | ||
"sigs.k8s.io/cluster-api/cmd/clusterctl/internal/test" | ||
) | ||
|
||
func Test_VersionMarkerUpToDate(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use g:=NewWithT(t)
and use gomega matchers here.
This is not a blocker though.
Thanks @munnerz! looks great! |
This also updates the installer to better test for API group readiness, as well as leaving version markers to make it possible to upgrade cert-manager in a future pull request.
effa8da
to
19a37e6
Compare
Done! Squashed and reworded the commit message to be summarise the changes. |
@munnerz: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
For visibility on the apidiff failure, as discussed:
|
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What this PR does / why we need it:
Upgrades cert-manager to
v0.16.0-alpha.0v0.15.2v0.16.0.I can adjust this PR to install the latest stable version if preferred.Done 🎉The new 'wait for ready' logic should work more universally by actually attempting to create an Issuer and Certificate resource as-per the documentation we provide in the installation guide: https://cert-manager.io/docs/installation/kubernetes/#verifying-the-installation
Please note: This PR does not handle the upgrade process for cert-manager if a user already has it installed in their cluster (either through
clusterctl init
or through any other means).Which issue(s) this PR fixes:
Fixes #2240