Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary network issue causes Target Allocator to stuck #3216

Closed
sfc-gh-dguy opened this issue Aug 13, 2024 · 3 comments · Fixed by #3244
Closed

Temporary network issue causes Target Allocator to stuck #3216

sfc-gh-dguy opened this issue Aug 13, 2024 · 3 comments · Fixed by #3244
Assignees
Labels
area:target-allocator Issues for target-allocator bug Something isn't working

Comments

@sfc-gh-dguy
Copy link

Component(s)

target allocator

What happened?

Description

A target Allocator in one of our k8s clusters got stuck due to network issues, causing metrics to stop being sent.
The error message we saw in the logs was "Failed to create namespace informer in promOperator CRD watcher" and when looking at the code it seems there's no retry/recovery mechanism for this situation. Instead the code just logs the error and continue as usual.

Steps to Reproduce

This is not easily reproducable as we don't know what exactly in the network caused this situation

Expected Result

Target allocator should recover from temporary network issues and shouldn't get stuck until manual intervention

Actual Result

Target Allocator is stuck until restarted manually

Kubernetes Version

v1.28.11-eks-db838b0

Operator version

0.103.0

Collector version

0.103.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Log output

{"level":"error","ts":"2024-08-13T01:25:32Z","logger":"setup.prometheus-cr-watcher","msg":"Failed to create namespace informer in promOperator CRD watcher","error":"Get \"https://172.20.0.1:443/version\": net/http: TLS handshake timeout","stacktrace":"github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator/watcher.NewPrometheusCRWatcher\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/watcher/promOperator.go:99\nmain.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/main.go:119\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.4/x64/src/runtime/proc.go:271"}

Additional context

No response

@sfc-gh-dguy sfc-gh-dguy added bug Something isn't working needs triage labels Aug 13, 2024
@jaronoff97 jaronoff97 added area:target-allocator Issues for target-allocator and removed needs triage labels Aug 26, 2024
@davidhaja
Copy link
Contributor

I'm interested in working on this one.
After checking the mentioned code I see two possible solutions:

  1. When we receive an error we return nil and the error, which will then terminate the whole process (exit code)
  2. We implement a retry mechanism and exit with an error only after N failures.

@jaronoff97 Any thoughts? What would you prefer?

@jaronoff97
Copy link
Contributor

jaronoff97 commented Aug 26, 2024

i think doing a retry and then exiting makes sense to me. @swiatekm thoughts? Thanks for helping out :D

@swiatekm
Copy link
Contributor

Generally speaking, the behaviour we want when the target allocator doesn't have sufficient permissions to function correctly, is to retry and output error logs about the source of the problem, for a bit, and then exit. So this sounds like a good idea to me. I think originally we didn't want to change this behaviour for backwards-compatibility reasons, but enough time has passed that all users should have the Namespace permission added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:target-allocator Issues for target-allocator bug Something isn't working
Projects
None yet
4 participants