Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Upgrade cert-manager #3313

Merged
merged 1 commit into from
Aug 6, 2020

Conversation

munnerz
Copy link
Member

@munnerz munnerz commented Jul 8, 2020

What this PR does / why we need it:

Upgrades cert-manager to v0.16.0-alpha.0 v0.15.2 v0.16.0.

I can adjust this PR to install the latest stable version if preferred. Done 🎉

The new 'wait for ready' logic should work more universally by actually attempting to create an Issuer and Certificate resource as-per the documentation we provide in the installation guide: https://cert-manager.io/docs/installation/kubernetes/#verifying-the-installation

Please note: This PR does not handle the upgrade process for cert-manager if a user already has it installed in their cluster (either through clusterctl init or through any other means).

Which issue(s) this PR fixes:
Fixes #2240

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 8, 2020
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 8, 2020
@vincepri
Copy link
Member

vincepri commented Jul 8, 2020

/hold

To wait for v0.3.7 to be released first

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 8, 2020
@munnerz
Copy link
Member Author

munnerz commented Jul 8, 2020

apidiff failing due to:

  Incompatible changes:
  - CertManagerClient.EnsureCertManager: added
  - CertManagerClient.EnsureWebhook: removed

I can keep the old method name around if preferred, and maybe introduce the new name as an extra function and deprecated the old one? EnsureWebhook isn't really very descriptive given the method actually installs cert-manager, not just the webhook :)

@vincepri
Copy link
Member

vincepri commented Jul 8, 2020

The job is optional, I should have said something earlier. AFAIK these methods are only used within clusterctl, so it should be fine to change

@vincepri
Copy link
Member

vincepri commented Jul 8, 2020

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Jul 8, 2020
Copy link
Contributor

@wfernandes wfernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me. Just some minor nits.
UPDATE: I also manually tested this using clusterctl init on a new kind cluster and everything was Running smoothly. Yay!
Thanks for this PR!

cmd/clusterctl/client/cluster/cert_manager.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/cert_manager.go Outdated Show resolved Hide resolved
// Nb. we are ignoring the error so this operation can support listing images even if there is no an existing management cluster;
// in case there is no an existing management cluster, we assume there is no web-hook installed in the cluster.
hasWebhook, _ := cm.hasWebhook()
if hasWebhook {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain but this bit a code may still be required? If a user runs 'init' again on an existing cluster, what should this area of code do?

log := logf.Log

// Checks if the cert-manager web-hook already exists, if yes, exit immediately
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we still want this check too? Again, to avoid anything actually happening if 'clusterctl init' is run 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-added this check with an additional function that calls Get on each resource in the cluster to verify whether cert-manager resources already exist before Creating them all again.

Now that I think some more though, it may be better to just call Create(), because currently if a single IsNotFound error is encountered, we go ahead and call Create anyway.

Whilst we may not want to address a proper upgrade process in this PR, because it is upgrading the manifest used, we probably want to implement some guards to avoid users downloading the new clusterctl and running init against an existing management cluster, as they could end out with orphaned resources from the previous release leftover.

I'm actually not 100% certain on how best to address this - updating the manifests in place without a defined upgrade process is tough, because we don't have any means to understand whether an older version is installed (unless we retain both copies of the cert-manager install manifests and explicitly check if all resources in an older version exist, but that gets.. slippery).

We do have this annotation on resources: helm.sh/chart: 'cert-manager-v0.15.2' but I'm not sure if relying on the cert-manager project to always set this in this format is wise. Perhaps we could annotate created objects (similar to what's done already), but with a version number too? For the 'transitionary period' for users coming from earlier clusterctl versions, we could check for the presence of the cert-manager value with no version annotation in the clusterctl.cluster.x-k8s.io/core annotation). But without recording exactly what resources are required to be installed in earlier versions of cert-manager, it won't be possible for us to be certain all resources exist (i.e. if a previous invocation of clusterctl init failed half way through, we'll struggle to recover).

@munnerz munnerz force-pushed the upgrade-cert-manager branch 2 times, most recently from 936562d to 465adf5 Compare July 9, 2020 09:15
cmd/clusterctl/client/cluster/cert_manager.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/cert_manager.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/cert_manager.go Outdated Show resolved Hide resolved
// Issuer and Certificate).
// This ensures that the Kubernetes apiserver is ready to serve resources within the
// cert-manager API group.
func (cm *certManagerClient) waitForAPIReady(ctx context.Context, log logr.Logger) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for confirmation, in this func we are creating an issurer/a certificate, but we are not waiting for the certificate to be processed as documented https://cert-manager.io/docs/installation/kubernetes/#verifying-the-installation.
Is this enough for assuming cert-manager is working?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question, as far as I know cert-manager, this test can work both with v0.11.0 and v0.15.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enough for assuming cert-manager is working?

So it depends on how you want to handle this - you are correct, that it may not yet be ready to issue certificates as the controller could still be leader electing. However, presumably the parts of code that create the Certificate resource also then wait for those Certificates to be Ready before consuming the signed certificate/private key. Would you rather wait at the start for the controller to be ready, or wait until further into execution? Easy enough to change either way 😄

Another question, as far as I know cert-manager, this test can work both with v0.11.0 and v0.15.2

Correct, this is a far more universal test - if this stops working, it'll be because:

  1. The v1alpha2 API version no longer exists in cert-manager
  2. We made a big mistake in cert-manager 😅

// This is required to check the web-hook is working and ready to accept requests.
func (cm *certManagerClient) isWebhookAvailable(webhook *unstructured.Unstructured) (bool, error) {
conditions, found, err := unstructured.NestedSlice(webhook.Object, "status", "conditions")
testObjs, err := getTestResourcesManifestObjs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should randomize the namespace name, but this can be addressed in a follow up as well


// installs the web-hook
// Install all cert-manager manifests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question. Should we start to add a version label to each component so we can properly plan for cert-manager upgrades?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems sensible - but we will need to store that version number somewhere, and remember to update it if the YAML asset changes. I wonder if we can have an automated test for this..? Maybe even just take a hash of the YAML file and have a variable: versionNumber which is a tuple of the version number and the hash of that file?

That way if you change the file, the hash will no longer match, forcing people to look at that area of code?

Perhaps you have other instances of this somewhere and have an existing pattern to solve it? 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4aa658b - keen to hear your comments on the technique for keeping the version number up to date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the idea of the hash, so we can manage adjustment to the manifest without bumping a release.

cmd/clusterctl/config/assets/cert-manager.yaml Outdated Show resolved Hide resolved
cmd/clusterctl/client/init.go Outdated Show resolved Hide resolved
cmd/clusterctl/client/cluster/cert_manager.go Show resolved Hide resolved
cmd/clusterctl/client/cluster/cert_manager.go Outdated Show resolved Hide resolved
@munnerz
Copy link
Member Author

munnerz commented Jul 9, 2020

Addressed your feedback - should be good for another look!

The only other thing I'm concerned about now is the limitations/requirements about label values - from what I can tell, every possible character we are using right now is valid/part of the character set permitted for labels. But it may be better to use annotations for some of this data if indexing/selecting based on these values is not required :)

@munnerz
Copy link
Member Author

munnerz commented Jul 9, 2020

Seeing:

cmd/clusterctl/client/cluster/cert_manager_test.go:41:34: G401: Use of weak cryptographic primitive (gosec)
	actualHash := fmt.Sprintf("%x", md5.Sum(yaml))

Before we go ahead and merge this, do we want to stick with MD5 for this hashing? I don't think we are so concerned around the cryptographic properties of the hashing function, but it'll be a lot easier to get it right now than it would be to change it later 😄

@vincepri
Copy link
Member

vincepri commented Jul 9, 2020

Let's just use sha256?

@munnerz
Copy link
Member Author

munnerz commented Jul 9, 2020

Let's just use sha256?

Done ✅

@vincepri
Copy link
Member

vincepri commented Jul 9, 2020

Seems like we might have to limit the hash to be in a label, IIRC usually can be limited to 10 chars

Copy link
Contributor

@wfernandes wfernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ran into the following problem.

  1. Created kind cluster
  2. clusterctl init on that cluster with older clusterctl (that is, clusterctl built from master branch).
  3. Built clusterctl using this branch and ran clusterctl init.
    Got the following output:
Using configuration File="/Users/wfernandes/.cluster-api/clusterctl.yaml"
Fetching providers
Waiting for cert-manager to be available...
Creating Namespace="cert-manager-test"
Failed to create cert-manager test resource Namespace="cert-manager-test"
Installing cert-manager
Creating Namespace="cert-manager"
Operation failed, retrying with backoff Cause="failed to create cert-manager component: /v1, Kind=Namespace, /cert-manager: Namespace \"cert-manager\" is
invalid: metadata.labels: Invalid value: \"d7a0492b0c1345023d79be75f475b3670c4109a305bdd85a36bb78d142b8febd\": must be no more than 63 characters"
...
Error: action failed after 10 attempts: failed to create cert-manager component: /v1, Kind=Namespace, /cert-manager: Namespace "cert-manager" is invalid:
metadata.labels: Invalid value: "d7a0492b0c1345023d79be75f475b3670c4109a305bdd85a36bb78d142b8febd": must be no more than 63 characters

Seem like there is a constraint on label length to be 63 characters in length.
https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set

UPDATE: LOL! Apologies...just noticed that the e2e's were failing for the same reason 🙂

cmd/clusterctl/client/cluster/cert_manager.go Outdated Show resolved Hide resolved
Copy link
Contributor

@wfernandes wfernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the multiple review comments.

cmd/clusterctl/client/cluster/cert_manager.go Show resolved Hide resolved
Comment on lines 216 to 219
// persist the version number of stored resources to make a
// future enhancement to add upgrade support possible.
labels["certmanager.clusterctl.cluster.x-k8s.io/version"] = embeddedCertManagerVersion()
labels["certmanager.clusterctl.cluster.x-k8s.io/hash"] = embeddedCertManagerHash()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should have a test somewhere to ensure these are set if we are going to be depending on them in the future.

Comment on lines 288 to 302
if err := cm.deleteObj(obj); err != nil {
log.V(5).Info("Failed to delete cert-manager test resource", logf.UnstructuredToValues(obj)...)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel if there may be some interesting behavior here. If we delete the namespace cert-manager-test and then delete the certificate/issuer could we run into a chance of getting a NotFound error since deleting the namespace would delete the stuff inside it?

I haven't run into it while testing manually...but I'm wondering if we should ignore NotFound errors here. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be we can simplify and simply delete the ns...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted this to tolerate NotFound errors instead for now, given there is no specific namespace creation logic this seems simpler than having to inspect the manifests and infer what namespace to delete. This are can be improved as you've noted above to randomise the namespace name, which I think will make deletion a little simpler then too 😄

@vincepri
Copy link
Member

Hey folks, now that we've released v0.3.7 I was wondering about the status of this PR, any particular updates or action items?

@fabriziopandini
Copy link
Member

I'm +1 to rally and get this merged.
My comment are already addressed and I have a WIP PR ready for covering the upgrade of existing clusters as soon as this one merges.
But I see some comments from @wfernandes still to be addressed

@wfernandes
Copy link
Contributor

If we are looking to get this merged in sooner rather than later, we can create separate issues and address some of the comments above. My comments aren't necessarily blockers.
However, we still need to fix the cause of the failing e2e test.

@vincepri
Copy link
Member

/milestone v0.3.9

@fabriziopandini
Copy link
Member

@munnerz is it ok for you trying to get this merged in this iteration?

@vincepri
Copy link
Member

vincepri commented Aug 5, 2020

Thank you @munnerz for finishing this up!

Changes LGTM, can we squash commits?

/assign @fabriziopandini @wfernandes

@vincepri
Copy link
Member

vincepri commented Aug 5, 2020

/approve
/hold
Feel free to unhold after squash

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2020
@wfernandes
Copy link
Contributor

I'll do another pass on this now. Thanks @munnerz!

Copy link
Contributor

@wfernandes wfernandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Waiting on commit squash.

"sigs.k8s.io/cluster-api/cmd/clusterctl/internal/scheme"
"sigs.k8s.io/cluster-api/cmd/clusterctl/internal/test"
)

func Test_VersionMarkerUpToDate(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use g:=NewWithT(t) and use gomega matchers here.
This is not a blocker though.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 5, 2020
@fabriziopandini
Copy link
Member

Thanks @munnerz! looks great!
After squash, this is ready to go!!!

This also updates the installer to better test for API group readiness, as well
as leaving version markers to make it possible to upgrade cert-manager in a future
pull request.
@munnerz munnerz force-pushed the upgrade-cert-manager branch from effa8da to 19a37e6 Compare August 6, 2020 10:02
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2020
@munnerz
Copy link
Member Author

munnerz commented Aug 6, 2020

Done! Squashed and reworded the commit message to be summarise the changes.

@k8s-ci-robot
Copy link
Contributor

@munnerz: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-apidiff 19a37e6 link /test pull-cluster-api-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@munnerz
Copy link
Member Author

munnerz commented Aug 6, 2020

For visibility on the apidiff failure, as discussed:

sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster
  Incompatible changes:
  - CertManagerClient.EnsureInstalled: added
  - CertManagerClient.EnsureWebhook: removed

@fabriziopandini
Copy link
Member

/lgtm
/approve
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fabriziopandini
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 6, 2020
@k8s-ci-robot k8s-ci-robot merged commit ad631d7 into kubernetes-sigs:master Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade to a newer cert-manager
5 participants