Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm Operator fails when two different types of Custom Resources have same name #3357

Closed
pre opened this issue Jul 8, 2020 · 10 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/helm Issue is related to a Helm operator project lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@pre
Copy link
Contributor

pre commented Jul 8, 2020

Bug Report

Helm operator writes Helm specific metadata of the deployed Custom Resource (instance of a Chart) in a Kubernetes Secret.
The name of the Secret does not contain the name of the Custom Resource Definition but it contains the name of the Custom Resource. As a result, one instance of CRD_A and one instance of CRD_B will corrupt each other's K8s Secret if CR_A and CR_B both have the same name.

For example:

  • Cluster has two Custom Resource Definitions (CRD): Lolcat and Doge.
  • Each CRD has a Helm Operator watching it.
  • When Lolcat named control is deployed, its Helm Operator will create a Kubernetes Secret named sh.helm.release.v1.control.v1
  • When Doge named control is deployed, its Helm Operator will create a Kubernetes Secret using the same name sh.helm.release.v1.control.v1

This secret contains Helm's own internal metadata. Since the metadata in sh.helm.release.v1.control.v1 was about Lolcat, the change set for Doge applied later will cause Helm Operator to fail.

As a corollary, the name of any Custom Resource backed by a Helm Opeator must be unique in a given namespace.

While it is possible to have two independent and de-coupled Custom Resource Definitions (Lolcat and Doge), any instance of a Custom Resource backed by the Helm Operator must have a unique name in that namespace. This is surprising and not obvious to debug when you first see the error message for the first time.

Even if this could not be fixed due to Helm internals, it'd save a significant amount of cumulative debugging time if the Helm Operator would give a sensible error message. The current error message is about a Helm metadata about the wrong Custom Resource instance.

In the snipped below an instance of Doge named control fails, because an instance of Lolcat named control has been deployed earlier. As you can see, the name of the secret is sh.helm.release.v1.control.v6 since the Lolcat was already in v5. However, it should have been v1 since this was the first deployment of Doge.

{"level":"error","ts":1594219438.7016752,"logger":"helm.controller","msg":"Release failed","namespace":"dev","name":"control","apiVersion":"rdx.net/v1alpha1","kind":"AccessConfig","release":"control","error":"failed update (update: failed to update: secrets \"sh.helm.release.v1.control.v6\" not found) and failed rollback: release: not found","stacktrace":[..] }

{"level":"error","ts":1594219438.7058475,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"accessconfig-controller","request":"dev/control","error":"failed update (update: failed to update: secrets \"sh.helm.release.v1.control.v6\" not found) and failed rollback: release: not found","stacktrace": [..] }

After this first message, the later error messages are about incorrectly trying to adopt existing resources (of the other CRD).
What happens is that Doge has now corrupted the secret which was about Lolcat. As a result, the Helm Operator of Lolcat now has wrong metadata and starts failing with the error below. The problem is that meta.helm.sh/release-name is now about Doge even though this was Lolcat's Helm metadata.

{"level":"error","ts":1594220624.2125375,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"lolcat-controller","request":"dev/control","error":"failed to get candidate release: rendered manifests contain a resource that already exists. Unable to continue with update: Deployment \"a-deployment\" in namespace \"dev\" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: missing key \"meta.helm.sh/release-name\": must be set to \"control\"; annotation validation error: missing key \"meta.helm.sh/release-namespace\": must be set to \"dev\""

Environment

  • operator-sdk version: v0.18.2

  • Kubernetes cluster kind: Minikube

  • Are you writing your operator in ansible, helm, or go?
    Helm

Possible Solution

Maybefix: Include the name of the Custom Resource Definition in the Helm Secret's name?

Remedy: Provide a sensible error message.

@estroz estroz added language/helm Issue is related to a Helm operator project triage/support Indicates an issue that is a support question. labels Jul 9, 2020
@estroz
Copy link
Member

estroz commented Jul 9, 2020

/cc @joelanford @camilamacedo86

@camilamacedo86 camilamacedo86 self-assigned this Jul 9, 2020
@camilamacedo86
Copy link
Contributor

Hi @pre,

Tks for raise that I will be looking on it and I will keep you updated.

@joelanford
Copy link
Member

joelanford commented Jul 9, 2020

@pre Yeah this is a known issue caused by a limitation of Helm. The Helm operator uses Helm's install/upgrade/uninstall code to manage the releases. As a result, the Helm operator uses Helm's release secrets, and therefore the same limitation that applies to Helm releases (no two releases can share the same name in the same namespace) applies to CRs managed by the Helm operator.

Maybefix: Include the name of the Custom Resource Definition in the Helm Secret's name?

Helm controls the release secret name, which is based on the release name. And the release name is taken directly from the CR name. We used to include an encoded version of the CR's UID in the release name, but that caused other issues.

See #1094, #1818, #2918

We may be able to do this logic (assuming an existing release):

  1. Get the release based on the CR name
  2. Verify that the chart associated with the release is the same chart that the reconciler is configured to use. Fail with a helpful error message if the reconciler chart is different than the release's chart. (Would we just check the chart name? or something else?)
  3. Continue with the upgrade-or-reconcile logic.

An even better option would be to run a ValidatingAdmissionWebhook that would handle CR create events and check to make sure a release does not already exist with the same name as the CR. This would prevent the CR with the duplicate name from even being created in the first place. The downside there, is that it makes the operator a bit harder to deploy.

@camilamacedo86 camilamacedo86 added kind/feature Categorizes issue or PR as related to a new feature. and removed triage/support Indicates an issue that is a support question. labels Jul 13, 2020
@kensipe kensipe added this to the Backlog milestone Jul 13, 2020
@camilamacedo86 camilamacedo86 removed their assignment Jul 15, 2020
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 28, 2020
@pre
Copy link
Contributor Author

pre commented Nov 2, 2020

Fail with a helpful error message if the reconciler chart is different than the release's chart.

For a wonderful MVP for better user experience, a helpful error message would be great. The current error is very much confusing - if you don't remember that this issue exists, you may end up spending quite an amount of time figuring out what's going on.

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 2, 2020
@drewwells
Copy link

The {uid} suffix fixes this problem #1818. To make matters worse, you can't add openapi patterns to enforce a naming scheme on the CR. This leaves us with having to write an admission webhook to enforce a name

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 6, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. language/helm Issue is related to a Helm operator project lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants