-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When an agent misses a checkin, why do we call it unhealthy vs. offline? #1148
Comments
Here are the states that Elastic Agent can report. I will investigate further to check how we can perform this change. I believe the Elastic Agent should still report as healthy even if it misses some checkins and as soon as the last check in time is older than the checkin interval it will then be shown as Offline. @joshdover @kpollich @craig do you agree with the fact that agent still has to report healthy and only fleet ui will take care of showing it offline as soon as the time is expired? |
I'm not sure if quite need to change how offline is determined. To me it makes sense that the only way to determine this is by how long it's been since the agent successfully checked in. IMO the issue is that the agent can report itself as degraded or in an error state if one of it's previous check in attempts failed. Presumably if it can check in with Fleet Server, it's no longer degraded. It seems to be that we report the agent's status during a checkin based on the last attempt, which isn't really relevant anymore if we've checked in successfully. @blakerouse I think this is related to what you brought up over Zoom on Friday. We may want the fleet_gateway code to disconnect and reconnect as soon as the health status changes so Fleet doesn't have an outdated view of the agent. Otherwise, Fleet's view may be up to 5 minutes out of date. So the flow here might be:
Some things to think about:
Here's the relevant code where we currently set the agent status to degraded due a missed checkin: https://github.com/elastic/elastic-agent/blob/5ca0ae1c94ac35e5d458b32ec0ad715e9f08a83f/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go/#L294-L312 |
answering Q from Julien: the way how degraded is computed is a bit but not too complex. agent consists of multiple components each of them are reporting health state to reporter. if gateway is broken then gateway is unhealthy. proposed filter of gateway health may be implementable as a filter on a component health reporting level. it's up to the component in this case gateway to report unhealthy state to reporter. so if we decide that fleet connectivity is not a reason for agent becoming unhealthy it is implementable. and not even that complex |
Yes, let's remove fleet connectivity as a reason for agent becoming unhealthy. The agent is either online or offline from a user's perspective. That we hit an error and had to retry a few times on a particular checkin is interesting to us in engineering, but should not be shown in the Fleet UI as long as the product keeps working.
Yes, I am in favour of a longer term move to WebSockets or an HTTP/2 based streaming protocol. One thing we will likely find as we try to drive connectivity errors to zero is that chance of something going wrong is highest when making new connections, and we will want to avoid making connections unless we have to. We will find situations where we can't make new connections but can keep the existing ones alive (transient DNS errors for example). |
I believe this is resolved by #1152. I have created a follow up issue to discuss moving to a streaming protocol between agent and fleet server. |
Would be great to revisit this code.
We might be treating
offline
agents asunhealthy
for no real reason...The text was updated successfully, but these errors were encountered: