Helm Operator fails when two different types of Custom Resources have same name #3357

pre · 2020-07-08T14:54:18Z

Bug Report

Helm operator writes Helm specific metadata of the deployed Custom Resource (instance of a Chart) in a Kubernetes Secret.
The name of the Secret does not contain the name of the Custom Resource Definition but it contains the name of the Custom Resource. As a result, one instance of CRD_A and one instance of CRD_B will corrupt each other's K8s Secret if CR_A and CR_B both have the same name.

For example:

Cluster has two Custom Resource Definitions (CRD): Lolcat and Doge.
Each CRD has a Helm Operator watching it.
When Lolcat named control is deployed, its Helm Operator will create a Kubernetes Secret named sh.helm.release.v1.control.v1
When Doge named control is deployed, its Helm Operator will create a Kubernetes Secret using the same name sh.helm.release.v1.control.v1

This secret contains Helm's own internal metadata. Since the metadata in sh.helm.release.v1.control.v1 was about Lolcat, the change set for Doge applied later will cause Helm Operator to fail.

As a corollary, the name of any Custom Resource backed by a Helm Opeator must be unique in a given namespace.

While it is possible to have two independent and de-coupled Custom Resource Definitions (Lolcat and Doge), any instance of a Custom Resource backed by the Helm Operator must have a unique name in that namespace. This is surprising and not obvious to debug when you first see the error message for the first time.

Even if this could not be fixed due to Helm internals, it'd save a significant amount of cumulative debugging time if the Helm Operator would give a sensible error message. The current error message is about a Helm metadata about the wrong Custom Resource instance.

In the snipped below an instance of Doge named control fails, because an instance of Lolcat named control has been deployed earlier. As you can see, the name of the secret is sh.helm.release.v1.control.v6 since the Lolcat was already in v5. However, it should have been v1 since this was the first deployment of Doge.

{"level":"error","ts":1594219438.7016752,"logger":"helm.controller","msg":"Release failed","namespace":"dev","name":"control","apiVersion":"rdx.net/v1alpha1","kind":"AccessConfig","release":"control","error":"failed update (update: failed to update: secrets \"sh.helm.release.v1.control.v6\" not found) and failed rollback: release: not found","stacktrace":[..] }

{"level":"error","ts":1594219438.7058475,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"accessconfig-controller","request":"dev/control","error":"failed update (update: failed to update: secrets \"sh.helm.release.v1.control.v6\" not found) and failed rollback: release: not found","stacktrace": [..] }

After this first message, the later error messages are about incorrectly trying to adopt existing resources (of the other CRD).
What happens is that Doge has now corrupted the secret which was about Lolcat. As a result, the Helm Operator of Lolcat now has wrong metadata and starts failing with the error below. The problem is that meta.helm.sh/release-name is now about Doge even though this was Lolcat's Helm metadata.

{"level":"error","ts":1594220624.2125375,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"lolcat-controller","request":"dev/control","error":"failed to get candidate release: rendered manifests contain a resource that already exists. Unable to continue with update: Deployment \"a-deployment\" in namespace \"dev\" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: missing key \"meta.helm.sh/release-name\": must be set to \"control\"; annotation validation error: missing key \"meta.helm.sh/release-namespace\": must be set to \"dev\""

Environment

operator-sdk version: v0.18.2
Kubernetes cluster kind: Minikube
Are you writing your operator in ansible, helm, or go?
Helm

Possible Solution

Maybefix: Include the name of the Custom Resource Definition in the Helm Secret's name?

Remedy: Provide a sensible error message.

The text was updated successfully, but these errors were encountered:

estroz · 2020-07-09T18:17:03Z

/cc @joelanford @camilamacedo86

camilamacedo86 · 2020-07-09T19:07:24Z

Hi @pre,

Tks for raise that I will be looking on it and I will keep you updated.

joelanford · 2020-07-09T21:28:38Z

@pre Yeah this is a known issue caused by a limitation of Helm. The Helm operator uses Helm's install/upgrade/uninstall code to manage the releases. As a result, the Helm operator uses Helm's release secrets, and therefore the same limitation that applies to Helm releases (no two releases can share the same name in the same namespace) applies to CRs managed by the Helm operator.

Maybefix: Include the name of the Custom Resource Definition in the Helm Secret's name?

Helm controls the release secret name, which is based on the release name. And the release name is taken directly from the CR name. We used to include an encoded version of the CR's UID in the release name, but that caused other issues.

See #1094, #1818, #2918

We may be able to do this logic (assuming an existing release):

Get the release based on the CR name
Verify that the chart associated with the release is the same chart that the reconciler is configured to use. Fail with a helpful error message if the reconciler chart is different than the release's chart. (Would we just check the chart name? or something else?)
Continue with the upgrade-or-reconcile logic.

An even better option would be to run a ValidatingAdmissionWebhook that would handle CR create events and check to make sure a release does not already exist with the same name as the CR. This would prevent the CR with the duplicate name from even being created in the first place. The downside there, is that it makes the operator a bit harder to deploy.

openshift-bot · 2020-10-28T01:23:08Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

pre · 2020-11-02T08:20:59Z

Fail with a helpful error message if the reconciler chart is different than the release's chart.

For a wonderful MVP for better user experience, a helpful error message would be great. The current error is very much confusing - if you don't remember that this issue exists, you may end up spending quite an amount of time figuring out what's going on.

/remove-lifecycle stale

drewwells · 2020-11-06T00:17:36Z

The {uid} suffix fixes this problem #1818. To make matters worse, you can't add openapi patterns to enforce a naming scheme on the CR. This leaves us with having to write an admission webhook to enforce a name

openshift-bot · 2021-02-04T07:36:47Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-03-06T07:51:06Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-04-05T12:12:06Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-04-05T12:12:08Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

estroz added language/helm Issue is related to a Helm operator project triage/support Indicates an issue that is a support question. labels Jul 9, 2020

camilamacedo86 self-assigned this Jul 9, 2020

camilamacedo86 added kind/feature Categorizes issue or PR as related to a new feature. and removed triage/support Indicates an issue that is a support question. labels Jul 13, 2020

kensipe added this to the Backlog milestone Jul 13, 2020

camilamacedo86 removed their assignment Jul 15, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 28, 2020

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 2, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 4, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 6, 2021

openshift-ci-robot closed this as completed Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm Operator fails when two different types of Custom Resources have same name #3357

Helm Operator fails when two different types of Custom Resources have same name #3357

pre commented Jul 8, 2020 •

edited

Loading

estroz commented Jul 9, 2020

camilamacedo86 commented Jul 9, 2020

joelanford commented Jul 9, 2020 •

edited

Loading

openshift-bot commented Oct 28, 2020

pre commented Nov 2, 2020

drewwells commented Nov 6, 2020

openshift-bot commented Feb 4, 2021

openshift-bot commented Mar 6, 2021

openshift-bot commented Apr 5, 2021

openshift-ci-robot commented Apr 5, 2021

Helm Operator fails when two different types of Custom Resources have same name #3357

Helm Operator fails when two different types of Custom Resources have same name #3357

Comments

pre commented Jul 8, 2020 • edited Loading

Bug Report

estroz commented Jul 9, 2020

camilamacedo86 commented Jul 9, 2020

joelanford commented Jul 9, 2020 • edited Loading

openshift-bot commented Oct 28, 2020

pre commented Nov 2, 2020

drewwells commented Nov 6, 2020

openshift-bot commented Feb 4, 2021

openshift-bot commented Mar 6, 2021

openshift-bot commented Apr 5, 2021

openshift-ci-robot commented Apr 5, 2021

pre commented Jul 8, 2020 •

edited

Loading

joelanford commented Jul 9, 2020 •

edited

Loading