-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supervisor does no retry to connect to opamp server forever #33408
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
You mean when the connection to oapserver fails supervisor should exit the process? |
The opposite- it shall never exi but retry to re/connect forever |
+1. This is the intent. |
Will that MR solve it? Seems like @tigrannajaryan review in on your side ;) |
Component(s)
cmd/opampsupervisor
What happened?
Supervisor when started shall not give up to connect to the opamp backend, when errors with connectivity.
At least we shall be able to configure the timeout before giving up.
In term of resilience in a non stable (e.g. cellular) network environment this would need elsewise a external scheduler like systemd to restart instead of retries of the supervisor itself. such external restart would also increase load on the cpu. A "endless" loop with retry timeout is the best practice for client to sever communication retrs.
Despite the errors, the log indicates that there are retries happening (e.g., will retry message). However, if it seems like it's not retrying, it might be due to:
Immediate Failures: The connection attempts might be failing too quickly in succession, making it appear as if there's no retry mechanism.
There might be configuration settings limiting or controlling the retry behavior which i don't know. Why does the supervisor's has such fixed (instead of unlimited) retry policies or limits?
I feel the supervisor code is written like that to handle error situation, but it shall retry resilient
Collector version
o.101
Environment information
No response
OpenTelemetry Collector configuration
No response
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: