Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/k8sattributes] Handle all resource deletion event types #27847

Merged

Conversation

c-kruse
Copy link
Contributor

@c-kruse c-kruse commented Oct 18, 2023

Description:

The k8s go client's cache expects OnDelete handlers to handle objects of type DeletedFinalStateUnknown when the cache's watch mechanism misses a delete and notices later. This changes the processor to handle such deletes as if they were normal, rather than logging an error and dropping the change.

Link to tracking Issue: #27632

Testing:

Only what you see in the unit tests. I am open to suggestions, but I don't see this being a code path we can reasonably cover in the e2e test suite.

Verified manually locally on a kind cluster.

  • Stood up two deployments loosely based off e2e testing resources, one w/ a collector built from this branch and the other docker.io/otel/opentelemetry-collector-contrib:latest.
    • Both included an additional container in the collector pod I used to fiddle with iptables rules.
  • Added rules to reject traffic to/from the kube api server
  • Deleted some namespaces containing deployments generating telemetry.
  • Restored connectivity by removing the iptables rules.
  • Observed the collector built from this branch was silent (aside from the junk the k8s client logs due to the broken connection)
  • Observed the latest (0.87.0) collector logged a handful of errors for the deleted resources (api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long enough for Namespace.)
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/[email protected]/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...

Documentation: N/A - it is not clear to me whether or not this should land on the changelog. Its impact on users is marginal.

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

Signed-off-by: Christian Kruse <[email protected]>
@github-actions github-actions bot added the processor/k8sattributes k8s Attributes processor label Oct 18, 2023
@c-kruse c-kruse marked this pull request as ready for review October 19, 2023 02:52
@c-kruse c-kruse requested a review from a team October 19, 2023 02:52
@dmitryax dmitryax added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Oct 19, 2023
@dmitryax
Copy link
Member

Thanks @c-kruse!

@dmitryax dmitryax merged commit b0f0dda into open-telemetry:main Oct 19, 2023
89 of 90 checks passed
@github-actions github-actions bot added this to the next release milestone Oct 19, 2023
martin-majlis-s1 pushed a commit to scalyr/opentelemetry-collector-contrib that referenced this pull request Oct 20, 2023
…pen-telemetry#27847)

**Description:**

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

**Link to tracking Issue:**
open-telemetry#27632

**Testing:**

Only what you see in the unit tests. I am open to suggestions, but I
don't see this being a code path we can reasonably cover in the e2e test
suite.

Verified manually locally on a kind cluster.
* Stood up two deployments loosely based off e2e testing resources, one
w/ a collector built from this branch and the other
docker.io/otel/opentelemetry-collector-contrib:latest.
* Both included an additional container in the collector pod I used to
fiddle with iptables rules.
* Added rules to reject traffic to/from the kube api server
* Deleted some namespaces containing deployments generating telemetry.
* Restored connectivity by removing the iptables rules.
* Observed the collector built from this branch was silent (aside from
the junk the k8s client logs due to the broken connection)
* Observed the latest
([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore))
collector logged a handful of errors for the deleted resources
(api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long
enough for Namespace.)

```
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/[email protected]/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...
```
**Documentation:** N/A - it is not clear to me whether or not this
should land on the changelog. Its impact on users is marginal.

Signed-off-by: Christian Kruse <[email protected]>
sigilioso pushed a commit to carlossscastro/opentelemetry-collector-contrib that referenced this pull request Oct 27, 2023
…pen-telemetry#27847)

**Description:**

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

**Link to tracking Issue:**
open-telemetry#27632

**Testing:**

Only what you see in the unit tests. I am open to suggestions, but I
don't see this being a code path we can reasonably cover in the e2e test
suite.

Verified manually locally on a kind cluster.
* Stood up two deployments loosely based off e2e testing resources, one
w/ a collector built from this branch and the other
docker.io/otel/opentelemetry-collector-contrib:latest.
* Both included an additional container in the collector pod I used to
fiddle with iptables rules.
* Added rules to reject traffic to/from the kube api server
* Deleted some namespaces containing deployments generating telemetry.
* Restored connectivity by removing the iptables rules.
* Observed the collector built from this branch was silent (aside from
the junk the k8s client logs due to the broken connection)
* Observed the latest
([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore))
collector logged a handful of errors for the deleted resources
(api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long
enough for Namespace.)

```
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/[email protected]/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...
```
**Documentation:** N/A - it is not clear to me whether or not this
should land on the changelog. Its impact on users is marginal.

Signed-off-by: Christian Kruse <[email protected]>
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this pull request Nov 12, 2023
…pen-telemetry#27847)

**Description:**

The k8s go client's cache expects OnDelete handlers to handle objects of
type DeletedFinalStateUnknown when the cache's watch mechanism misses a
delete and notices later. This changes the processor to handle such
deletes as if they were normal, rather than logging an error and
dropping the change.

**Link to tracking Issue:**
open-telemetry#27632

**Testing:**

Only what you see in the unit tests. I am open to suggestions, but I
don't see this being a code path we can reasonably cover in the e2e test
suite.

Verified manually locally on a kind cluster.
* Stood up two deployments loosely based off e2e testing resources, one
w/ a collector built from this branch and the other
docker.io/otel/opentelemetry-collector-contrib:latest.
* Both included an additional container in the collector pod I used to
fiddle with iptables rules.
* Added rules to reject traffic to/from the kube api server
* Deleted some namespaces containing deployments generating telemetry.
* Restored connectivity by removing the iptables rules.
* Observed the collector built from this branch was silent (aside from
the junk the k8s client logs due to the broken connection)
* Observed the latest
([0.87.0](https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.87.0/images/sha256-77cdd395b828b09cb920c671966f09a87a40611aa6107443146086f2046f4a9a?context=explore))
collector logged a handful of errors for the deleted resources
(api_v1.Pod, and apps_v1.ReplicaSet. I probably just didn't wait long
enough for Namespace.)

```
2023-10-19T02:18:37.781Z        error   kube/client.go:236      object received was not of type api_v1.Pod      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-patched-766d55cbcb-8zktr","Obj":{"metadata":{"name":"telemetrygen-patched-766d55cbcb-8zktr","namespace":"src1","uid":"be5d2268-c8b0-434d-b3b8-8b18083c7a8b","creat
ionTimestamp":"2023-10-19T02:01:08Z","labels":{"app":"telemetrygen-patched","pod-template-hash":"766d55cbcb"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"telemetrygen-patched-766d55cbcb","uid":"a887d67a-d5d6-4269-b520-45dbb4f1cd82","controller":true,"blockOwnerDeletion":true}]},"spec":{"containers":[{"name":"telemetrygen","image":"localhost/telemetrygen
:latest","resources":{}}],"nodeName":"manual-e2e-testing-control-plane"},"status":{"podIP":"10.244.0.56","startTime":"2023-10-19T02:01:08Z","containerStatuses":[{"name":"telemetrygen","state":{},"lastState":{},"ready":false,"restartCount":0,"image":"","imageID":"","containerID":"containerd://2821ef32cd8bf93a13414504c0f8f0c016c84be49d6ffdbd475d7e4681e90c51"}]}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handlePodDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:236
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
...

2023-10-19T02:19:03.970Z        error   kube/client.go:868      object received was not of type apps_v1.ReplicaSet      {"kind": "processor", "name": "k8sattributes", "pipeline": "metrics", "received": {"Key":"src1/telemetrygen-stable-5c444bb8b8","Obj":{"metadata":{"name":"telemetrygen-stable-5c444bb8b8","namespace":"src1","uid":"d37707ff-b308-4339-8543-a1caf5705ea8","creationTimestamp":null,"ownerReferences":[{"apiVersion":"apps/v1","kind":"Deployment","name":"telemetrygen-stable","uid":"c421276e-e1bf-40c5-85e1-e92e30363da5","controller":true,"blockOwnerDeletion":true}]},"spec":{"selector":null,"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":null}}},"status":{"replicas":0}}}}
github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.(*WatchClient).handleReplicaSetDelete
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/kube/client.go:868
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete
        k8s.io/[email protected]/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        k8s.io/[email protected]/tools/cache/shared_informer.go:979
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
...
```
**Documentation:** N/A - it is not clear to me whether or not this
should land on the changelog. Its impact on users is marginal.

Signed-off-by: Christian Kruse <[email protected]>
abhinavdahiya added a commit to abhinavdahiya/eks-node-viewer that referenced this pull request May 8, 2024
OnDelete event handler can receive
https://pkg.go.dev/k8s.io/client-go/tools/cache#DeletedFinalStateUnknown
in cases where delete event is missed during reconnection.

for example, observed this
```
created by main.startMonitor in goroutine 1
	/home/runner/work/eks-node-viewer/eks-node-viewer/cmd/eks-node-viewer/main.go:207 +0x728
panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node [recovered]
	panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node [recovered]
	panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node
```

Followed similar solution to open-telemetry/opentelemetry-collector-contrib#27847
tzneal pushed a commit to awslabs/eks-node-viewer that referenced this pull request May 14, 2024
OnDelete event handler can receive
https://pkg.go.dev/k8s.io/client-go/tools/cache#DeletedFinalStateUnknown
in cases where delete event is missed during reconnection.

for example, observed this
```
created by main.startMonitor in goroutine 1
	/home/runner/work/eks-node-viewer/eks-node-viewer/cmd/eks-node-viewer/main.go:207 +0x728
panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node [recovered]
	panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node [recovered]
	panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Node
```

Followed similar solution to open-telemetry/opentelemetry-collector-contrib#27847
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/k8sattributes k8s Attributes processor Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants