Helm 3 resource installation order is not respected #2227
Labels
area/helm
kind/question
Questions about existing features
mro2
Monica's list of 2st tier overlay related issues
resolution/fixed
This issue was fixed
What happened?
When deploying a Helm chart using
helm.v3.Chart
resource, all resources are deployed in parallel. This causes issues when Helm chart is designed to follow the expected deployment order documented here and linking to InstallerOrder var in code:For example, a Secret is deployed before a Pod or Deployment using it.
Steps to reproduce
Deploy any chart using multiple resources
Expected Behavior
Deploy resources in documented order
Actual Behavior
Resources are not deployed in documented order
Output of
pulumi about
Additional context
This is probably related to #1671 in which @lblackstone said:
It's indeed solving the issue in most situations, however some charts like Datadog Helm chart relies on this mechanism to update Secret shared across Pods and ensure they are restarted using this pattern:
A
andB
)A
andB
rely on this secret for communicationA
expectedB
to provide the shared secret value at runtimeB
can't connect toA
on startup, it failsFor this to work, we must deploy the Secret/ConfigMap before all Pods specs are applied, otherwise a new Pod may be created before the new Secret is created, resulting in some Pods using old Secret and other Pods using new Secret, causing havoc.
"Hey, but your pod won't be healthy and will be restarted using new secret". Not necessarily, considering that
A
is passively waiting forB
to provide the shared secret. What may happen is:A
is started using old secretB
is started using new secretB
fails to contactA
because of invalid auth (as their secrets do not match).B
is unealthy and restarted (still using new secret), again and again, ending-up in CrashLoopBackoffA
seems "healthy" but is in fact using old secret and will never be restartedA simple workaround is of course to force recreate (restart)
A
manually, but the issue will keep happening from time to time.To be more specific, we had this issue with Datadog Helm Charts because Datadog Cluster Agent expect an auth token from Node Agents (each Node Agent report to a single Cluster Agent), and it's exactly what happened: Cluster Agent was restarted too soon using old auth token, and all new Node Agents failed because they couldn't reach the Cluster Agent.
I guess such situations may be frequent in other situations, such as clusterized systems like Database, Redis, etc.
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: