Temporary network issue causes Target Allocator to stuck #3216

sfc-gh-dguy · 2024-08-13T06:07:21Z

Component(s)

target allocator

What happened?

Description

A target Allocator in one of our k8s clusters got stuck due to network issues, causing metrics to stop being sent.
The error message we saw in the logs was "Failed to create namespace informer in promOperator CRD watcher" and when looking at the code it seems there's no retry/recovery mechanism for this situation. Instead the code just logs the error and continue as usual.

Steps to Reproduce

This is not easily reproducable as we don't know what exactly in the network caused this situation

Expected Result

Target allocator should recover from temporary network issues and shouldn't get stuck until manual intervention

Actual Result

Target Allocator is stuck until restarted manually

Kubernetes Version

v1.28.11-eks-db838b0

Operator version

0.103.0

Collector version

0.103.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

{"level":"error","ts":"2024-08-13T01:25:32Z","logger":"setup.prometheus-cr-watcher","msg":"Failed to create namespace informer in promOperator CRD watcher","error":"Get \"https://172.20.0.1:443/version\": net/http: TLS handshake timeout","stacktrace":"github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator/watcher.NewPrometheusCRWatcher\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/watcher/promOperator.go:99\nmain.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/main.go:119\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.4/x64/src/runtime/proc.go:271"}

Additional context

No response

davidhaja · 2024-08-26T20:20:41Z

I'm interested in working on this one.
After checking the mentioned code I see two possible solutions:

When we receive an error we return nil and the error, which will then terminate the whole process (exit code)
We implement a retry mechanism and exit with an error only after N failures.

@jaronoff97 Any thoughts? What would you prefer?

jaronoff97 · 2024-08-26T21:20:06Z

i think doing a retry and then exiting makes sense to me. @swiatekm thoughts? Thanks for helping out :D

swiatekm · 2024-08-27T07:46:11Z

Generally speaking, the behaviour we want when the target allocator doesn't have sufficient permissions to function correctly, is to retry and output error logs about the source of the problem, for a bit, and then exit. So this sounds like a good idea to me. I think originally we didn't want to change this behaviour for backwards-compatibility reasons, but enough time has passed that all users should have the Namespace permission added.

sfc-gh-dguy added bug Something isn't working needs triage labels Aug 13, 2024

jaronoff97 added area:target-allocator Issues for target-allocator and removed needs triage labels Aug 26, 2024

jaronoff97 assigned davidhaja Aug 26, 2024

davidhaja mentioned this issue Aug 29, 2024

Retry failed namespace informer creation in promOperator CRD watcher #3244

Merged

swiatekm closed this as completed in #3244 Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporary network issue causes Target Allocator to stuck #3216

Temporary network issue causes Target Allocator to stuck #3216

sfc-gh-dguy commented Aug 13, 2024

davidhaja commented Aug 26, 2024

jaronoff97 commented Aug 26, 2024 •

edited

Loading

swiatekm commented Aug 27, 2024

Temporary network issue causes Target Allocator to stuck #3216

Temporary network issue causes Target Allocator to stuck #3216

Comments

sfc-gh-dguy commented Aug 13, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

davidhaja commented Aug 26, 2024

jaronoff97 commented Aug 26, 2024 • edited Loading

swiatekm commented Aug 27, 2024

jaronoff97 commented Aug 26, 2024 •

edited

Loading