You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A target Allocator in one of our k8s clusters got stuck due to network issues, causing metrics to stop being sent.
The error message we saw in the logs was "Failed to create namespace informer in promOperator CRD watcher" and when looking at the code it seems there's no retry/recovery mechanism for this situation. Instead the code just logs the error and continue as usual.
Steps to Reproduce
This is not easily reproducable as we don't know what exactly in the network caused this situation
Expected Result
Target allocator should recover from temporary network issues and shouldn't get stuck until manual intervention
Actual Result
Target Allocator is stuck until restarted manually
Generally speaking, the behaviour we want when the target allocator doesn't have sufficient permissions to function correctly, is to retry and output error logs about the source of the problem, for a bit, and then exit. So this sounds like a good idea to me. I think originally we didn't want to change this behaviour for backwards-compatibility reasons, but enough time has passed that all users should have the Namespace permission added.
Component(s)
target allocator
What happened?
Description
A target Allocator in one of our k8s clusters got stuck due to network issues, causing metrics to stop being sent.
The error message we saw in the logs was "Failed to create namespace informer in promOperator CRD watcher" and when looking at the code it seems there's no retry/recovery mechanism for this situation. Instead the code just logs the error and continue as usual.
Steps to Reproduce
This is not easily reproducable as we don't know what exactly in the network caused this situation
Expected Result
Target allocator should recover from temporary network issues and shouldn't get stuck until manual intervention
Actual Result
Target Allocator is stuck until restarted manually
Kubernetes Version
v1.28.11-eks-db838b0
Operator version
0.103.0
Collector version
0.103.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: