Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When an agent misses a checkin, why do we call it unhealthy vs. offline? #1148

Closed
amitkanfer opened this issue Sep 12, 2022 · 5 comments
Closed
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amitkanfer
Copy link
Contributor

Would be great to revisit this code.

We might be treating offline agents as unhealthy for no real reason...

@amitkanfer amitkanfer added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 12, 2022
@jlind23
Copy link
Contributor

jlind23 commented Sep 12, 2022

Here are the states that Elastic Agent can report.
Offline state does not exist inside Elastic Agent which means that the mapping is done on Kibana's end.

I will investigate further to check how we can perform this change.
Found the code here

I believe the Elastic Agent should still report as healthy even if it misses some checkins and as soon as the last check in time is older than the checkin interval it will then be shown as Offline.

@joshdover @kpollich @craig do you agree with the fact that agent still has to report healthy and only fleet ui will take care of showing it offline as soon as the time is expired?

@joshdover
Copy link
Contributor

I'm not sure if quite need to change how offline is determined. To me it makes sense that the only way to determine this is by how long it's been since the agent successfully checked in.

IMO the issue is that the agent can report itself as degraded or in an error state if one of it's previous check in attempts failed. Presumably if it can check in with Fleet Server, it's no longer degraded. It seems to be that we report the agent's status during a checkin based on the last attempt, which isn't really relevant anymore if we've checked in successfully.

@blakerouse I think this is related to what you brought up over Zoom on Friday. We may want the fleet_gateway code to disconnect and reconnect as soon as the health status changes so Fleet doesn't have an outdated view of the agent. Otherwise, Fleet's view may be up to 5 minutes out of date. So the flow here might be:

  • Agent misses a Fleet Server checkin, and sets itself to degraded
  • Agent reports its degraded status to Fleet with a successful checkin
  • Since the checkin was successful, the agent updates it's internal status to healthy
  • A status change prompts the Agent to disconnect from Fleet Server intentionally and re-checkin with the updated status.

Some things to think about:

  • This will create more incoming requests to Fleet Server, and it may happen precisely when Fleet Server is already under high load (which caused the initial missed checkin). We may want to explore some way for Agent to send updated status information on the same request, possibly with a streaming request over HTTP/2 when possible.
  • Another option could be to avoid sending a degraded status in the first place if the only reason for that degraded status is Fleet Server connectivity issues. I suspect this may be complex to implement though as it makes that type of status special from others.

Here's the relevant code where we currently set the agent status to degraded due a missed checkin: https://github.com/elastic/elastic-agent/blob/5ca0ae1c94ac35e5d458b32ec0ad715e9f08a83f/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go/#L294-L312

@michalpristas
Copy link
Contributor

michalpristas commented Sep 12, 2022

answering Q from Julien: When Agent is missing some checkins it changes it status to degraded. Why that? What about keeping it healthy and letting fleet ui deal with the "offline" status?

the way how degraded is computed is a bit but not too complex. agent consists of multiple components each of them are reporting health state to reporter. if gateway is broken then gateway is unhealthy.
overall agent health is considered the worst case of its components health. so if gateway is unhealhy agent is unhealthy. offline should not be related to unhealthy, it should be related to last checkin

proposed filter of gateway health may be implementable as a filter on a component health reporting level. it's up to the component in this case gateway to report unhealthy state to reporter. so if we decide that fleet connectivity is not a reason for agent becoming unhealthy it is implementable. and not even that complex

@cmacknz
Copy link
Member

cmacknz commented Sep 12, 2022

If we decide that fleet connectivity is not a reason for agent becoming unhealthy it is implementable. and not even that complex.

Yes, let's remove fleet connectivity as a reason for agent becoming unhealthy. The agent is either online or offline from a user's perspective. That we hit an error and had to retry a few times on a particular checkin is interesting to us in engineering, but should not be shown in the Fleet UI as long as the product keeps working.

We may want to explore some way for Agent to send updated status information on the same request, possibly with a streaming request over HTTP/2 when possible.

Yes, I am in favour of a longer term move to WebSockets or an HTTP/2 based streaming protocol. One thing we will likely find as we try to drive connectivity errors to zero is that chance of something going wrong is highest when making new connections, and we will want to avoid making connections unless we have to. We will find situations where we can't make new connections but can keep the existing ones alive (transient DNS errors for example).

@cmacknz
Copy link
Member

cmacknz commented Sep 12, 2022

I believe this is resolved by #1152.

I have created a follow up issue to discuss moving to a streaming protocol between agent and fleet server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants