Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Prioritize showing agents as offline #140477

Closed
joshdover opened this issue Sep 12, 2022 · 6 comments · Fixed by #140621
Closed

[Fleet] Prioritize showing agents as offline #140477

joshdover opened this issue Sep 12, 2022 · 6 comments · Fixed by #140621
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@joshdover
Copy link
Contributor

In our agent status calculation code, we have some prioritization to prioritize signals of one status over others. This is done here:

if (!agent.active) {
return 'inactive';
}
if (agent.unenrollment_started_at && !agent.unenrolled_at) {
return 'unenrolling';
}
if (!agent.last_checkin) {
return 'enrolling';
}
const msLastCheckIn = new Date(lastCheckIn || 0).getTime();
const msSinceLastCheckIn = new Date().getTime() - msLastCheckIn;
const intervalsSinceLastCheckIn = Math.floor(msSinceLastCheckIn / AGENT_POLLING_THRESHOLD_MS);
if (agent.last_checkin_status === 'error') {
return 'error';
}
if (agent.last_checkin_status === 'degraded') {
return 'degraded';
}
const policyRevision =
'policy_revision' in agent
? agent.policy_revision
: 'policy_revision_idx' in agent
? agent.policy_revision_idx
: undefined;
if (!policyRevision || (agent.upgrade_started_at && !agent.upgraded_at)) {
return 'updating';
}
if (intervalsSinceLastCheckIn >= offlineTimeoutIntervalCount) {
return 'offline';
}
return 'online';

Since the offline status is second to last, this can result in a few common situations where an agent really is not able to check in at all and offline showing up in other statuses:

  • An agent that was reporting as unhealthy and then the user closed to laptop will stay as unhealthy in the UI
  • An agent that was able to enroll but then not subsequently check in will show as "updating" indefinitely

These situations may give the false impression that there's something that the user can do take action on this agent from the UI, when in fact there is not because the Agent is not able to check in with Fleet Server at all.

One simple solution could be to always show these agents as offline in the agent table view if they haven't checked in the required window but still show the other status information on the agent detail page.

@joshdover joshdover added the Team:Fleet Team label for Observability Data Collection Fleet team label Sep 12, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor Author

Related to #122206

@jen-huang jen-huang added the bug Fixes for quality problems that affect the customer experience label Sep 12, 2022
@jen-huang
Copy link
Contributor

@nchaulet Putting this high priority bug on your list.

@nchaulet
Copy link
Member

nchaulet commented Sep 12, 2022

Before I start working on that @paul-tavares @kevinlog I know endpoint is using Fleet status somehow does this change make sense to you?

@kevinlog
Copy link
Contributor

@nchaulet - apologies, I missed this this ping earlier. Thank you for the heads up. We are using the Fleet status in some places, but just to show users what it is. We don't rely on the status for any logic. I think it's OK to change the priority of statuses in Fleet to make more sense for customers.

We have recently added features to bubble up Endpoint specific errors in the Fleet Agent details UI, but this, again, isn't dependent on Agent showing that it is "Unhealthy". We detect these errors from the Endpoint policy response itself. So changing the way we show Offline is fine.

@paul-tavares
Copy link
Contributor

@nchaulet - @kevinlog is correct - I don't think we are impacted by this change.

We use a few pieces of data from the Agent in enriching our endpoint metadata API here:

return {
metadata: endpointMetadata,
host_status: fleetAgent
? // eslint-disable-next-line @typescript-eslint/no-non-null-assertion
fleetAgentStatusToEndpointHostStatus(fleetAgent.status!)
: DEFAULT_ENDPOINT_HOST_STATUS,
policy_info: {
agent: {
applied: {
revision: fleetAgent?.policy_revision ?? 0,
id: fleetAgent?.policy_id ?? '',
},
configured: {
revision: fleetAgentPolicy?.revision ?? 0,
id: fleetAgentPolicy?.id ?? '',
},
},
endpoint: {
revision: endpointPackagePolicy?.revision ?? 0,
id: endpointPackagePolicy?.id ?? '',
},
},
};

Specifically to the statuses, we use the AgentStatus return from fleet (agent api) to map it to a status value we display in our Endpoint list - code here:

// For an understanding of how fleet agent status is calculated:
// @see `x-pack/plugins/fleet/common/services/agent_status.ts`
const STATUS_MAPPING: ReadonlyMap<AgentStatus, HostStatus> = new Map([
['online', HostStatus.HEALTHY],
['offline', HostStatus.OFFLINE],
['inactive', HostStatus.INACTIVE],
['unenrolling', HostStatus.UPDATING],
['enrolling', HostStatus.UPDATING],
['updating', HostStatus.UPDATING],
['warning', HostStatus.UNHEALTHY],
['error', HostStatus.UNHEALTHY],
['degraded', HostStatus.UNHEALTHY],
]);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants