Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry webhook modifies orphaned pods in an invalid way #1514

Closed
andrewdinunzio opened this issue Feb 24, 2023 · 6 comments · Fixed by #2584
Closed

OpenTelemetry webhook modifies orphaned pods in an invalid way #1514

andrewdinunzio opened this issue Feb 24, 2023 · 6 comments · Fixed by #2584
Labels
area:auto-instrumentation Issues for auto-instrumentation bug Something isn't working

Comments

@andrewdinunzio
Copy link
Contributor

andrewdinunzio commented Feb 24, 2023

Using the opentelemetry operator helm chart version 0.18.3.

This will be a bit difficult to describe or reproduce, but wanted to open this to describe what I found in the meantime.

Yesterday, we saw issues where a huge number of pods were stuck in Terminating phase. The cause of that is still somewhat unknown but we figure one of two things happened:

We have finalizers on Pods for record-keeping purposes that our own operator removes once its work is complete. However, our operator was unable to remove the finalizer. It reported errors like:

"The Pod "{PODNAME}" is invalid: spec.initContainers: Forbidden: pod updates may not add or remove containers"

We also tried to manually remove the finalizer, both via kubectl patch, which had the same error, and with kubectl edit which had an error saying the pod was invalid for the same reason.

So here's where I think there could be an issue with the opentelemetry webhook:

Our pod has an init container already, so I tried to remove that as part of the kubectl edit, and was met with a different 'invalid pod' error. In this case, there was an init container with multiple duplicate keys. Like, there were two name fields, one with the name of our init container and one that appeared to be injected by opentelemetry webhook.

Seeing this, I modified the webhook service to point to a bogus port to effectively disable it. After that, I was able to successfully remove the finalizer.

Some other notes:

  • This pod was orphaned. We typically delete using a Foreground cascading policy, but something must have deleted with a different policy.
  • The pod had a deletionTimestamp set, it was just waiting for finalizers to be removed
  • We actually deployed a new node pool while diagnosing this and tore down the old one. The pod still persisted, and even kubectl delete pod --force --grace-period=0 was hanging until the finalizer was removed.
  • Because we also have linkerd, I set nameOverride to include an aaaa prefix, so it runs before linkerd's webhook. This is because in cases where I want to inject something with autoinstrumentation, it was modifying linkerd's proxy, not our main app container. By adding this prefix, the opentelemetry webhook ran first.
  • The webhook appeared to run even for pods that originally were not marked for autoinstrumentation

I'll try to add additional notes as I find them, but this is the gist of it.

@svmaris
Copy link

svmaris commented May 8, 2023

We're running into a similar issue with Jobs. When we auto-instrument them using the instrumentation.opentelemetry.io/inject-sdk: "true" annotation, the Pods are stuck on Terminating indefinitely after the Job completes.

Manually removing the batch.kubernetes.io/job-tracking finalizer is not allowed. Only after deleting the opentelemetry-operator Mutating/ValidatingWebhooks, I'm able to remove the finalizer and delete the stuck Pod.

Not sure what's happening here, but it seems like the Webhook is preventing the Job Controller from removing the finalizer and finishing the Job.

@pavolloffay pavolloffay added area:auto-instrumentation Issues for auto-instrumentation bug Something isn't working labels May 9, 2023
@edude03
Copy link

edude03 commented May 19, 2023

Looks like the same issue(s) as #1541 #1417 and #1366.

I'm having the same issue. One thing I noticed that you can see in #1541 - is that attempting to re-inject the same variables that already exist - for example

            Name: "OTEL_RESOURCE_ATTRIBUTES",                                                                                                                             
            Value: strings.Join({                                                                                                                                         
              ... // 131 identical bytes                                                                                                                                  
              "TEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.name=$(OTEL_RESOURCE_",                                                                                         
              "ATTRIBUTES_POD_NAME),k8s.replicaset.name=xxxxx",                                                                                         
+             ",k8s.node.name=$(OTEL_RESOURCE_ATTRIBUTES_NODE_NAME),k8s.pod.nam",                                                                                         
+             "e=$(OTEL_RESOURCE_ATTRIBUTES_POD_NAME),k8s.pod.uid=xxxx",                                                                                                                                                                                                                  
            }, ""),                             

Notice that k8s.pod.name exists on the first line, but is being added again on the third line. My guess is the check that checks if the pod was already injected is failing to match for some reason and thus is trying again.

What I don't understand however, is what is trying to add these variables and why? As far as I can tell, the variables are added via a MutatingWebHook so there shouldn't be updates to the pod after creation 🤔

@andrewdinunzio
Copy link
Contributor Author

andrewdinunzio commented Jul 6, 2023

I feel like I am getting close to hitting this with the debugger using Telepresence. If it helps anyone else look into this, I made the following changes to hook into the webhook and was able to hit some breakpoints in the webhook:

diff --git a/Makefile b/Makefile
index febd731..fad5ec6 100644
--- a/Makefile
+++ b/Makefile
@@ -13,6 +13,7 @@ AUTO_INSTRUMENTATION_DOTNET_VERSION ?= "$(shell grep -v '\#' versions.txt | grep
 AUTO_INSTRUMENTATION_APACHE_HTTPD_VERSION ?= "$(shell grep -v '\#' versions.txt | grep autoinstrumentation-apache-httpd | awk -F= '{print $$2}')"
 LD_FLAGS ?= "-X ${VERSION_PKG}.version=${VERSION} -X ${VERSION_PKG}.buildDate=${VERSION_DATE} -X ${VERSION_PKG}.otelCol=${OTELCOL_VERSION} -X ${VERSION_PKG}.targetAllocator=${TARGETALLOCATOR_VERSION} -X ${VERSION_PKG}.operatorOpAMPBridge=${OPERATOR_OPAMP_BRIDGE_VERSION} -X ${VERSION_PKG}.autoInstrumentationJava=${AUTO_INSTRUMENTATION_JAVA_VERSION} -X ${VERSION_PKG}.autoInstrumentationNodeJS=${AUTO_INSTRUMENTATION_NODEJS_VERSION} -X ${VERSION_PKG}.autoInstrumentationPython=${AUTO_INSTRUMENTATION_PYTHON_VERSION} -X ${VERSION_PKG}.autoInstrumentationDotNet=${AUTO_INSTRUMENTATION_DOTNET_VERSION}" ARCH ?= $(shell go env GOARCH)
+SHELL := /bin/bash
 
 # Image URL to use all building/pushing image targets
 IMG_PREFIX ?= ghcr.io/${USER}/opentelemetry-operator
@@ -45,7 +46,7 @@ GOBIN=$(shell go env GOBIN)
 endif
 
 # by default, do not run the manager with webhooks enabled. This only affects local runs, not the build or in-cluster deployments.
-ENABLE_WEBHOOKS ?= false
+ENABLE_WEBHOOKS ?= true

@@ -101,10 +102,24 @@ test: generate fmt vet ensure-generate-is-noop envtest
 manager: generate fmt vet
        go build -o bin/manager main.go

+.PHONY: copy-certs
+copy-certs:
+       @[ -d /tmp/k8s-webhook-server/serving-certs/ ] || { \
+       mkdir -p /tmp/k8s-webhook-server/serving-certs/;\
+       kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["tls.crt"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/tls.crt;\
+       kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["tls.key"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/tls.key;\
+       kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["ca.crt"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/ca.crt;\
+       }
+
 # Run against the configured Kubernetes cluster in ~/.kube/config
 .PHONY: run
-run: generate fmt vet manifests
-       ENABLE_WEBHOOKS=$(ENABLE_WEBHOOKS) go run -ldflags ${LD_FLAGS} ./main.go --zap-devel
+run: generate fmt vet manifests copy-certs
+       go build -o bin/otel-operator -gcflags='all=-N -l' -ldflags ${LD_FLAGS} ./main.go
+       ENABLE_WEBHOOKS=$(ENABLE_WEBHOOKS) telepresence intercept monitoring-stack-aaaa-opentelemetry-operator --port 9443:443 -- dlv exec bin/otel-operator --headless --listen=:2345 --log --api-version=2 --accept-multiclient --continue --only-same-user=false -- --zap-devel --webhook-port=9443 --metrics-addr=0.0.0.0:8080 --enable-leader-election --health-probe-addr=:8081 --collector-image=otel/opentelemetry-collector-contrib:0.71.0

Also had to add this to main.go (put whatever namespace the otel operator is in):

+               LeaderElectionNamespace: "monitoring",

Also added this launch config to .vscode/launch.json:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug",
            "type": "go",
            "debugAdapter": "dlv-dap",
            "request": "attach",
            "mode": "remote",
            "port": 2345,
            "host": "127.0.0.1"
        }
    ]
}

Started debugging with

telepresence helm install # install telepresence into the cluster
kubectl config set-context --current --namespace {otel-namespace}
make run # start local instance with debugger and connect telepresence
# run the "Debug" launch configuration from the Debug tab

Note that when intercepting with Telepresence you need to use the name of the Deployment not the Service (monitoring-stack-aaaa-opentelemetry-operator in my case). It figures out the Service based on what ports, etc.

@GarrettGeorge
Copy link

(Feel free to remove this if deemed too out of scope or off topic for this issue)

Are there any upstream issues found related to this? Although I'm not using this operator, I do have an Open Policy Agent mutating webhook which exhibits the same behavior described here.

Our mutation is at the .spec.containers[*].image and spec.initContainers[*].image level to ensure all images come from our internal registry. In our "bad pod", the mutation doesn't seem to have ever been applied and when the webhook pod receives the request it tries to update the image and fails for similar reasons above:

# pods "cronjob-***" was not valid:
 # * spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing       tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
 #   core.PodSpec{
 #     ... // 11 identical fields
 #     NodeName:        "aks-nodepool1-***",
 #     SecurityContext: &{},
 #     ImagePullSecrets: []core.LocalObjectReference{
 # -     {Name: "private-registry1"},
 # +     {Name: "private-registry2"},
 # -     {Name: "private-registry3"},
 #     },
 #     Hostname:  "",
 #     Subdomain: "",
 #     ... // 15 identical fields
 #   }

Even if you try to change something the API says you should be able to, it complains. Not sure if this is necessarily an upstream issue, but it's definitely not unique to opentelemetry-operator.

@kristian-schjelderup
Copy link

Seems like the mutating webhook doesn't account for the life-cycle of the resource it gets in. The modifications to the Pod definition should be done in the initial Pod CREATE. and not trying to modify anything after this.

Based on this, my take would be that the the MutatingWebhookConfiguration should only trigger on CREATE operations for Pod resources, and not both on CREATE and UPDATE as it currently does.

@andrewdinunzio
Copy link
Contributor Author

Something like this?
#2584

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:auto-instrumentation Issues for auto-instrumentation bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants