[opentelemetry-helm-charts/opentelemetry-operator] Allow deploying the OTEL Operator as a Daemonset #1023

AlecAttwood · 2024-02-01T02:12:22Z

Overview

Allow the option to deploy the OTEL Operator as a Daemonset.

Why do we want this?

Currently the only option to deploy the Operator is with a kube deployment. Adding the ability to deploy a daemonset would allow a greater flexibility when choosing how to run OTEL. It also addresses a very specific issue where, when deploying new nodes or AKS versions, pods can start up before the operator does and do not have the OTEL collector sidecar injected. Deploying the operator as a daemonset would ensure the operator is present.

TylerHelmuth · 2024-02-01T15:49:03Z

@AlecAttwood the OpenTelemetry Operator does not support being run as a Daemonset.

Can you describe more how pods are starting before the Operator? Is the Operator deployment getting evicted and spun up somewhere else? PodDistruptionBudgets may help

AlecAttwood · 2024-02-13T03:30:16Z

Hey @TylerHelmuth, sorry for the delayed response.
This issue is something niche I have noticed during Kubernetes version upgrades. Despite the issue being niche I still think have the option to deploy the OTEL Operator as a daemonset would be good.

The issue occurs during a Kube version upgrade e.g. going from 1.26 -> 1.27. During an upgrade a new node is created on the new version. For some reason, communications between 1.26 and 1.27 nodes during this time is interrupted. In this scenario, if a new OTEL operator isn't created on the new 1.27 node, any app that starts won't have the OTEL sidecar injected. Due to the operator only existing on 1.26 nodes. Now this can be worked around, Pod Topology is an option for example. But I feel like a daemonset is a guaranteed way to avoid this particular issue. But even if people aren't experiencing our particular issue, a daemonset is still useful for managing deployments on large clusters.

Having a option in the Helm Chart to either deploy using a Deployment or a Daemonset would be a good feature in my opinion.

TylerHelmuth · 2024-02-13T16:26:43Z

I think Topology is definitely what you want. If the chart allowed installing the operator as a daemonset we'd be deploying the operator in a pattern unsupported by the OpenTelemetry Operator SIG.

AlecAttwood · 2024-02-13T23:03:57Z

I'm curious why the Daemonset pattern would be unsupported. What is the rationale behind that? Just asking out of curiosity, I can address my issues with pod topology and some other config, but daemonsets are very commonly used. Was there any documentation or articles available to explain the decision not to support daemonsets?

TylerHelmuth · 2024-02-13T23:31:29Z

Other @open-telemetry/operator-approvers please chime in, but I believe having multiple operators managing the same webhooks/custom resources would create a problem.

In general I think K8s Operators are not run as daemonsets because that makes it much hard to centralize the management of the objects it controls. I don't know of any Operators that do.

swiatekm · 2024-02-14T09:53:21Z

The otel operator does not scale naturally. Increasing the number of replicas in a naive way only gives you hot standbys, as the replicas will elect a leader, and the leader will do all the work. It is possible to partition the workload manually, but not in a way that would make sense for a DaemonSet.

In general, running as a DaemonSet isn't much different from running as a Deployment with a lot of replicas, so allowing it would be giving users a footgun with no upside.

The issue occurs during a Kube version upgrade e.g. going from 1.26 -> 1.27. During an upgrade a new node is created on the new version. For some reason, communications between 1.26 and 1.27 nodes during this time is interrupted. In this scenario, if a new OTEL operator isn't created on the new 1.27 node, any app that starts won't have the OTEL sidecar injected. Due to the operator only existing on 1.26 nodes. Now this can be worked around, Pod Topology is an option for example. But I feel like a daemonset is a guaranteed way to avoid this particular issue. But even if people aren't experiencing our particular issue, a daemonset is still useful for managing deployments on large clusters.

That doesn't sound right to me. The operator doesn't care about Nodes at all, it talks directly to the API Server. Can you reliably reproduce this problem? I don't think running the operator as a DaemonSet would help with it, sounds more like a bug in K8s itself potentially.

AlecAttwood · 2024-02-14T22:31:33Z

@swiatekm-sumo thanks for the context. I was unaware the operator elected a leader, that's very useful. I was wondering why a daemonset wasn't viable when a deployment with replicas was.
I will have a deep dive into our upgrade issues, it's repeatable, I've done it in test environments and we've encountered it on every EKS upgrade we've done. There's some combo of actions which causes the OTEL sidecar to not be injected into pods. More investigation to be done, thanks.

TylerHelmuth added the chart:operator Issue related to opentelemetry-operator helm chart label Feb 1, 2024

AlecAttwood closed this as completed Feb 14, 2024

AlecAttwood closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opentelemetry-helm-charts/opentelemetry-operator] Allow deploying the OTEL Operator as a Daemonset #1023

[opentelemetry-helm-charts/opentelemetry-operator] Allow deploying the OTEL Operator as a Daemonset #1023

AlecAttwood commented Feb 1, 2024

TylerHelmuth commented Feb 1, 2024

AlecAttwood commented Feb 13, 2024

TylerHelmuth commented Feb 13, 2024

AlecAttwood commented Feb 13, 2024

TylerHelmuth commented Feb 13, 2024

swiatekm commented Feb 14, 2024

AlecAttwood commented Feb 14, 2024 •

edited

Loading

[opentelemetry-helm-charts/opentelemetry-operator] Allow deploying the OTEL Operator as a Daemonset #1023

[opentelemetry-helm-charts/opentelemetry-operator] Allow deploying the OTEL Operator as a Daemonset #1023

Comments

AlecAttwood commented Feb 1, 2024

Overview

Why do we want this?

TylerHelmuth commented Feb 1, 2024

AlecAttwood commented Feb 13, 2024

TylerHelmuth commented Feb 13, 2024

AlecAttwood commented Feb 13, 2024

TylerHelmuth commented Feb 13, 2024

swiatekm commented Feb 14, 2024

AlecAttwood commented Feb 14, 2024 • edited Loading

AlecAttwood commented Feb 14, 2024 •

edited

Loading