-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenTelemetry webhook modifies orphaned pods in an invalid way #1514
Comments
We're running into a similar issue with Jobs. When we auto-instrument them using the Manually removing the Not sure what's happening here, but it seems like the Webhook is preventing the Job Controller from removing the finalizer and finishing the Job. |
Looks like the same issue(s) as #1541 #1417 and #1366. I'm having the same issue. One thing I noticed that you can see in #1541 - is that attempting to re-inject the same variables that already exist - for example
Notice that k8s.pod.name exists on the first line, but is being added again on the third line. My guess is the check that checks if the pod was already injected is failing to match for some reason and thus is trying again. What I don't understand however, is what is trying to add these variables and why? As far as I can tell, the variables are added via a |
I feel like I am getting close to hitting this with the debugger using Telepresence. If it helps anyone else look into this, I made the following changes to hook into the webhook and was able to hit some breakpoints in the webhook: diff --git a/Makefile b/Makefile
index febd731..fad5ec6 100644
--- a/Makefile
+++ b/Makefile
@@ -13,6 +13,7 @@ AUTO_INSTRUMENTATION_DOTNET_VERSION ?= "$(shell grep -v '\#' versions.txt | grep
AUTO_INSTRUMENTATION_APACHE_HTTPD_VERSION ?= "$(shell grep -v '\#' versions.txt | grep autoinstrumentation-apache-httpd | awk -F= '{print $$2}')"
LD_FLAGS ?= "-X ${VERSION_PKG}.version=${VERSION} -X ${VERSION_PKG}.buildDate=${VERSION_DATE} -X ${VERSION_PKG}.otelCol=${OTELCOL_VERSION} -X ${VERSION_PKG}.targetAllocator=${TARGETALLOCATOR_VERSION} -X ${VERSION_PKG}.operatorOpAMPBridge=${OPERATOR_OPAMP_BRIDGE_VERSION} -X ${VERSION_PKG}.autoInstrumentationJava=${AUTO_INSTRUMENTATION_JAVA_VERSION} -X ${VERSION_PKG}.autoInstrumentationNodeJS=${AUTO_INSTRUMENTATION_NODEJS_VERSION} -X ${VERSION_PKG}.autoInstrumentationPython=${AUTO_INSTRUMENTATION_PYTHON_VERSION} -X ${VERSION_PKG}.autoInstrumentationDotNet=${AUTO_INSTRUMENTATION_DOTNET_VERSION}" ARCH ?= $(shell go env GOARCH)
+SHELL := /bin/bash
# Image URL to use all building/pushing image targets
IMG_PREFIX ?= ghcr.io/${USER}/opentelemetry-operator
@@ -45,7 +46,7 @@ GOBIN=$(shell go env GOBIN)
endif
# by default, do not run the manager with webhooks enabled. This only affects local runs, not the build or in-cluster deployments.
-ENABLE_WEBHOOKS ?= false
+ENABLE_WEBHOOKS ?= true
@@ -101,10 +102,24 @@ test: generate fmt vet ensure-generate-is-noop envtest
manager: generate fmt vet
go build -o bin/manager main.go
+.PHONY: copy-certs
+copy-certs:
+ @[ -d /tmp/k8s-webhook-server/serving-certs/ ] || { \
+ mkdir -p /tmp/k8s-webhook-server/serving-certs/;\
+ kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["tls.crt"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/tls.crt;\
+ kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["tls.key"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/tls.key;\
+ kubectl get secret monitoring-stack-aaaa-opentelemetry-operator-controller-manager-service-cert -o yaml | yq '.data["ca.crt"]' | base64 --decode > /tmp/k8s-webhook-server/serving-certs/ca.crt;\
+ }
+
# Run against the configured Kubernetes cluster in ~/.kube/config
.PHONY: run
-run: generate fmt vet manifests
- ENABLE_WEBHOOKS=$(ENABLE_WEBHOOKS) go run -ldflags ${LD_FLAGS} ./main.go --zap-devel
+run: generate fmt vet manifests copy-certs
+ go build -o bin/otel-operator -gcflags='all=-N -l' -ldflags ${LD_FLAGS} ./main.go
+ ENABLE_WEBHOOKS=$(ENABLE_WEBHOOKS) telepresence intercept monitoring-stack-aaaa-opentelemetry-operator --port 9443:443 -- dlv exec bin/otel-operator --headless --listen=:2345 --log --api-version=2 --accept-multiclient --continue --only-same-user=false -- --zap-devel --webhook-port=9443 --metrics-addr=0.0.0.0:8080 --enable-leader-election --health-probe-addr=:8081 --collector-image=otel/opentelemetry-collector-contrib:0.71.0 Also had to add this to main.go (put whatever namespace the otel operator is in): + LeaderElectionNamespace: "monitoring", Also added this launch config to {
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "Debug",
"type": "go",
"debugAdapter": "dlv-dap",
"request": "attach",
"mode": "remote",
"port": 2345,
"host": "127.0.0.1"
}
]
} Started debugging with telepresence helm install # install telepresence into the cluster
kubectl config set-context --current --namespace {otel-namespace}
make run # start local instance with debugger and connect telepresence
# run the "Debug" launch configuration from the Debug tab Note that when intercepting with Telepresence you need to use the name of the Deployment not the Service ( |
(Feel free to remove this if deemed too out of scope or off topic for this issue) Are there any upstream issues found related to this? Although I'm not using this operator, I do have an Open Policy Agent mutating webhook which exhibits the same behavior described here. Our mutation is at the
Even if you try to change something the API says you should be able to, it complains. Not sure if this is necessarily an upstream issue, but it's definitely not unique to |
Seems like the mutating webhook doesn't account for the life-cycle of the resource it gets in. The modifications to the Pod definition should be done in the initial Pod CREATE. and not trying to modify anything after this. Based on this, my take would be that the the MutatingWebhookConfiguration should only trigger on CREATE operations for Pod resources, and not both on CREATE and UPDATE as it currently does. |
Something like this? |
Using the opentelemetry operator helm chart version 0.18.3.
This will be a bit difficult to describe or reproduce, but wanted to open this to describe what I found in the meantime.
Yesterday, we saw issues where a huge number of pods were stuck in Terminating phase. The cause of that is still somewhat unknown but we figure one of two things happened:
We have finalizers on Pods for record-keeping purposes that our own operator removes once its work is complete. However, our operator was unable to remove the finalizer. It reported errors like:
We also tried to manually remove the finalizer, both via
kubectl patch
, which had the same error, and withkubectl edit
which had an error saying the pod was invalid for the same reason.So here's where I think there could be an issue with the opentelemetry webhook:
Our pod has an init container already, so I tried to remove that as part of the
kubectl edit
, and was met with a different 'invalid pod' error. In this case, there was an init container with multiple duplicate keys. Like, there were twoname
fields, one with the name of our init container and one that appeared to be injected by opentelemetry webhook.Seeing this, I modified the webhook service to point to a bogus port to effectively disable it. After that, I was able to successfully remove the finalizer.
Some other notes:
kubectl delete pod --force --grace-period=0
was hanging until the finalizer was removed.nameOverride
to include anaaaa
prefix, so it runs before linkerd's webhook. This is because in cases where I want to inject something with autoinstrumentation, it was modifying linkerd's proxy, not our main app container. By adding this prefix, the opentelemetry webhook ran first.I'll try to add additional notes as I find them, but this is the gist of it.
The text was updated successfully, but these errors were encountered: