When an agent misses a checkin, why do we call it unhealthy vs. offline? #1148

amitkanfer · 2022-09-12T06:21:47Z

Would be great to revisit this code.

We might be treating offline agents as unhealthy for no real reason...

The text was updated successfully, but these errors were encountered:

jlind23 · 2022-09-12T06:54:02Z

Here are the states that Elastic Agent can report.
Offline state does not exist inside Elastic Agent which means that the mapping is done on Kibana's end.

I will investigate further to check how we can perform this change.
Found the code here

I believe the Elastic Agent should still report as healthy even if it misses some checkins and as soon as the last check in time is older than the checkin interval it will then be shown as Offline.

@joshdover @kpollich @craig do you agree with the fact that agent still has to report healthy and only fleet ui will take care of showing it offline as soon as the time is expired?

joshdover · 2022-09-12T08:47:59Z

I'm not sure if quite need to change how offline is determined. To me it makes sense that the only way to determine this is by how long it's been since the agent successfully checked in.

IMO the issue is that the agent can report itself as degraded or in an error state if one of it's previous check in attempts failed. Presumably if it can check in with Fleet Server, it's no longer degraded. It seems to be that we report the agent's status during a checkin based on the last attempt, which isn't really relevant anymore if we've checked in successfully.

@blakerouse I think this is related to what you brought up over Zoom on Friday. We may want the fleet_gateway code to disconnect and reconnect as soon as the health status changes so Fleet doesn't have an outdated view of the agent. Otherwise, Fleet's view may be up to 5 minutes out of date. So the flow here might be:

Agent misses a Fleet Server checkin, and sets itself to degraded
Agent reports its degraded status to Fleet with a successful checkin
Since the checkin was successful, the agent updates it's internal status to healthy
A status change prompts the Agent to disconnect from Fleet Server intentionally and re-checkin with the updated status.

Some things to think about:

This will create more incoming requests to Fleet Server, and it may happen precisely when Fleet Server is already under high load (which caused the initial missed checkin). We may want to explore some way for Agent to send updated status information on the same request, possibly with a streaming request over HTTP/2 when possible.
Another option could be to avoid sending a degraded status in the first place if the only reason for that degraded status is Fleet Server connectivity issues. I suspect this may be complex to implement though as it makes that type of status special from others.

Here's the relevant code where we currently set the agent status to degraded due a missed checkin: https://github.com/elastic/elastic-agent/blob/5ca0ae1c94ac35e5d458b32ec0ad715e9f08a83f/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go/#L294-L312

michalpristas · 2022-09-12T08:59:18Z

answering Q from Julien: When Agent is missing some checkins it changes it status to degraded. Why that? What about keeping it healthy and letting fleet ui deal with the "offline" status?

the way how degraded is computed is a bit but not too complex. agent consists of multiple components each of them are reporting health state to reporter. if gateway is broken then gateway is unhealthy.
overall agent health is considered the worst case of its components health. so if gateway is unhealhy agent is unhealthy. offline should not be related to unhealthy, it should be related to last checkin

proposed filter of gateway health may be implementable as a filter on a component health reporting level. it's up to the component in this case gateway to report unhealthy state to reporter. so if we decide that fleet connectivity is not a reason for agent becoming unhealthy it is implementable. and not even that complex

cmacknz · 2022-09-12T13:09:37Z

If we decide that fleet connectivity is not a reason for agent becoming unhealthy it is implementable. and not even that complex.

Yes, let's remove fleet connectivity as a reason for agent becoming unhealthy. The agent is either online or offline from a user's perspective. That we hit an error and had to retry a few times on a particular checkin is interesting to us in engineering, but should not be shown in the Fleet UI as long as the product keeps working.

We may want to explore some way for Agent to send updated status information on the same request, possibly with a streaming request over HTTP/2 when possible.

Yes, I am in favour of a longer term move to WebSockets or an HTTP/2 based streaming protocol. One thing we will likely find as we try to drive connectivity errors to zero is that chance of something going wrong is highest when making new connections, and we will want to avoid making connections unless we have to. We will find situations where we can't make new connections but can keep the existing ones alive (transient DNS errors for example).

cmacknz · 2022-09-12T13:26:56Z

I believe this is resolved by #1152.

I have created a follow up issue to discuss moving to a streaming protocol between agent and fleet server.

amitkanfer added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 12, 2022

amitkanfer assigned jlind23 Sep 12, 2022

michalpristas mentioned this issue Sep 12, 2022

Avoid reporting Unhealthy on fleet connectivity issues #1152

Merged

6 tasks

joshdover assigned michalpristas and unassigned jlind23 Sep 12, 2022

cmacknz closed this as completed Sep 12, 2022

michel-laterman mentioned this issue Sep 12, 2022

Liveness endpoint behaviour when unable to check in with fleet server #1157

Closed

michel-laterman mentioned this issue Sep 23, 2022

Expand status reporter/controller interfaces to allow local reporters #1285

Merged

4 tasks

This was referenced Jan 26, 2023

Reset gateway error on successful request #2194

Closed

Do not propagate gateway errors. #2203

Merged

michalpristas mentioned this issue Feb 8, 2023

Expose fleet connectivity state separately #2239

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When an agent misses a checkin, why do we call it unhealthy vs. offline? #1148

When an agent misses a checkin, why do we call it unhealthy vs. offline? #1148

amitkanfer commented Sep 12, 2022

jlind23 commented Sep 12, 2022 •

edited

Loading

joshdover commented Sep 12, 2022

michalpristas commented Sep 12, 2022 •

edited

Loading

cmacknz commented Sep 12, 2022 •

edited

Loading

cmacknz commented Sep 12, 2022

When an agent misses a checkin, why do we call it unhealthy vs. offline? #1148

When an agent misses a checkin, why do we call it unhealthy vs. offline? #1148

Comments

amitkanfer commented Sep 12, 2022

jlind23 commented Sep 12, 2022 • edited Loading

joshdover commented Sep 12, 2022

michalpristas commented Sep 12, 2022 • edited Loading

cmacknz commented Sep 12, 2022 • edited Loading

cmacknz commented Sep 12, 2022

jlind23 commented Sep 12, 2022 •

edited

Loading

michalpristas commented Sep 12, 2022 •

edited

Loading

cmacknz commented Sep 12, 2022 •

edited

Loading