Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opentelemetry-helm-charts/opentelemetry-operator] Allow deploying the OTEL Operator as a Daemonset #1023

Closed
AlecAttwood opened this issue Feb 1, 2024 · 7 comments
Labels
chart:operator Issue related to opentelemetry-operator helm chart

Comments

@AlecAttwood
Copy link

Overview

Allow the option to deploy the OTEL Operator as a Daemonset.

Why do we want this?

Currently the only option to deploy the Operator is with a kube deployment. Adding the ability to deploy a daemonset would allow a greater flexibility when choosing how to run OTEL. It also addresses a very specific issue where, when deploying new nodes or AKS versions, pods can start up before the operator does and do not have the OTEL collector sidecar injected. Deploying the operator as a daemonset would ensure the operator is present.

@TylerHelmuth TylerHelmuth added the chart:operator Issue related to opentelemetry-operator helm chart label Feb 1, 2024
@TylerHelmuth
Copy link
Member

@AlecAttwood the OpenTelemetry Operator does not support being run as a Daemonset.

Can you describe more how pods are starting before the Operator? Is the Operator deployment getting evicted and spun up somewhere else? PodDistruptionBudgets may help

@AlecAttwood
Copy link
Author

Hey @TylerHelmuth, sorry for the delayed response.
This issue is something niche I have noticed during Kubernetes version upgrades. Despite the issue being niche I still think have the option to deploy the OTEL Operator as a daemonset would be good.

The issue occurs during a Kube version upgrade e.g. going from 1.26 -> 1.27. During an upgrade a new node is created on the new version. For some reason, communications between 1.26 and 1.27 nodes during this time is interrupted. In this scenario, if a new OTEL operator isn't created on the new 1.27 node, any app that starts won't have the OTEL sidecar injected. Due to the operator only existing on 1.26 nodes. Now this can be worked around, Pod Topology is an option for example. But I feel like a daemonset is a guaranteed way to avoid this particular issue. But even if people aren't experiencing our particular issue, a daemonset is still useful for managing deployments on large clusters.

Having a option in the Helm Chart to either deploy using a Deployment or a Daemonset would be a good feature in my opinion.

@TylerHelmuth
Copy link
Member

I think Topology is definitely what you want. If the chart allowed installing the operator as a daemonset we'd be deploying the operator in a pattern unsupported by the OpenTelemetry Operator SIG.

@AlecAttwood
Copy link
Author

I'm curious why the Daemonset pattern would be unsupported. What is the rationale behind that? Just asking out of curiosity, I can address my issues with pod topology and some other config, but daemonsets are very commonly used. Was there any documentation or articles available to explain the decision not to support daemonsets?

@TylerHelmuth
Copy link
Member

Other @open-telemetry/operator-approvers please chime in, but I believe having multiple operators managing the same webhooks/custom resources would create a problem.

In general I think K8s Operators are not run as daemonsets because that makes it much hard to centralize the management of the objects it controls. I don't know of any Operators that do.

@swiatekm
Copy link
Contributor

The otel operator does not scale naturally. Increasing the number of replicas in a naive way only gives you hot standbys, as the replicas will elect a leader, and the leader will do all the work. It is possible to partition the workload manually, but not in a way that would make sense for a DaemonSet.

In general, running as a DaemonSet isn't much different from running as a Deployment with a lot of replicas, so allowing it would be giving users a footgun with no upside.

The issue occurs during a Kube version upgrade e.g. going from 1.26 -> 1.27. During an upgrade a new node is created on the new version. For some reason, communications between 1.26 and 1.27 nodes during this time is interrupted. In this scenario, if a new OTEL operator isn't created on the new 1.27 node, any app that starts won't have the OTEL sidecar injected. Due to the operator only existing on 1.26 nodes. Now this can be worked around, Pod Topology is an option for example. But I feel like a daemonset is a guaranteed way to avoid this particular issue. But even if people aren't experiencing our particular issue, a daemonset is still useful for managing deployments on large clusters.

That doesn't sound right to me. The operator doesn't care about Nodes at all, it talks directly to the API Server. Can you reliably reproduce this problem? I don't think running the operator as a DaemonSet would help with it, sounds more like a bug in K8s itself potentially.

@AlecAttwood
Copy link
Author

AlecAttwood commented Feb 14, 2024

@swiatekm-sumo thanks for the context. I was unaware the operator elected a leader, that's very useful. I was wondering why a daemonset wasn't viable when a deployment with replicas was.
I will have a deep dive into our upgrade issues, it's repeatable, I've done it in test environments and we've encountered it on every EKS upgrade we've done. There's some combo of actions which causes the OTEL sidecar to not be injected into pods. More investigation to be done, thanks.

@AlecAttwood AlecAttwood closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chart:operator Issue related to opentelemetry-operator helm chart
Projects
None yet
Development

No branches or pull requests

3 participants