Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/api/fleet/agent_status provides incorrect counts #134798

Closed
pjbertels opened this issue Jun 21, 2022 · 12 comments · Fixed by #135816
Closed

/api/fleet/agent_status provides incorrect counts #134798

pjbertels opened this issue Jun 21, 2022 · 12 comments · Fixed by #135816
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@pjbertels
Copy link

Kibana version:
8.3.0-6e69754c

Elasticsearch version:

Server OS version:

Browser version:

Browser OS version:

Original install method (e.g. download page, yum, from source, etc.):

Describe the bug:
I use /api/fleet/agent_status daily to determine what is going on in Fleet. The result of the /api/fleet/agent_status REST API call is often wrong because it provides incorrect counts. If you poll the /api/fleet/agent_status while doing fleet operations these inconsistencies show up often.

For example the following call for a policy with 10000 agents reports the total as 10004.
FleetAgentStatus(total=10004, inactive=0, online=9811, error=0, offline=150, updating=43, other=43, events=0, run_id=None, timestamp=None, kuery='policy_id : swarm.py:95
f2fba850-f0f7-11ec-9c99-f30f5bda23da', cluster_name=None)

Steps to reproduce:
1.
2.
3.

Expected behavior:
I expect that total summing some combination of known fields will always match the total.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:
Accurate reporting for this interface is critical to testing and debugging Fleet Server. If the REST call is reporting the wrong counts I would expect the Kibana UI to be affected.

@pjbertels pjbertels added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Jun 21, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@pjbertels
Copy link
Author

pjbertels commented Jun 21, 2022

In this case, after my drones had been up I found that Fleet was reporting 4 drones in updating state and 10000 healthy but the actual number of drones on my system was 10000 Plus the default agent policy so it should have been 10001.

@pjbertels
Copy link
Author

Here is an example of a cluster with 10001 agents(the default agent and 10,000 Horde drones).
image

@pjbertels
Copy link
Author

Turns out the description was missing which is unfortunate because we had dumped the JSON from Fleet.

httpx._client:_client.py:1002 HTTP Request: PUT https://perf-tqjrz-custom.kb.us-west2.gcp.elastic-cloud.com:9243/api/fleet/package_policies/elastic-cloud-fleet-server "HTTP/1.1 400 Bad Request" {"statusCode":400,"error":"Bad Request","message":"[request body.description]: expected value of type [string] but got [null]"}

We fixed the file by adding a description. This likely resulted in another issue ...
image

I'll retest to verify the causality.

@juliaElastic juliaElastic self-assigned this Jul 5, 2022
@pjbertels
Copy link
Author

pjbertels commented Jul 6, 2022

Recent examples while trying to do 25,000:
FleetAgentStatus(total=18473, inactive=0, online=10733, error=0, offline=7850, updating=71, other=-110, events=0,
FleetAgentStatus(total=18492, inactive=-3, online=11475, error=0, offline=7064, updating=56, other=-50, events=0,
FleetAgentStatus(total=18496, inactive=-1, online=11050, error=0, offline=6133, updating=53, other=1312, events=0
These problems happen without the Fleet server going unhealthy. So just ignore my previous comment.

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 6, 2022

I looked into this, and what happens in the code is that there are queries made in a loop for different statuses:

const [all, allActive, online, error, offline, updating] = await pMap(
[
undefined, // All agents, including inactive
undefined, // All active agents
AgentStatusKueryHelper.buildKueryForOnlineAgents(),
AgentStatusKueryHelper.buildKueryForErrorAgents(),
AgentStatusKueryHelper.buildKueryForOfflineAgents(),
AgentStatusKueryHelper.buildKueryForUpdatingAgents(),
],
(kuery, index) =>
getAgentsByKuery(esClient, {

I think the reason of discrepancies is that the status changes quickly between healthy and offline when the last_checkin time passes 5 min, so some agents show up in healthy and offline buckets as well.
I think starting a point in time before doing the status queries would help with this.
I can put up a small pr for this.

For the other discrepancy, where the total > enrolled agents, I didn't find any reason yet. The query just counts the records in the .fleet-agents table, so it shouldn't contain more than the enrolled agents count.

EDIT: while testing the fix, I found one occurrence of the count discrepancy after starting upgrade of agents before halting them.
It seems that the current filters allow an overlap between offline and updating (or online) status. I'm not sure how this can happen as offline query excludes updating status.

{
  "total": 3188,
  "inactive": 0,
  "online": 3122,
  "error": 0,
  "offline": 4,
  "updating": 64,
  "other": 62,
  "events": 0
}

I couldn't reproduce this problem since, what I noticed is that when I start upgrade and unenroll agents (with horde halt), agents are still showing up in updating state even after the offline timeout (5m) has passed.
I think we could change the logic here so that agents go to offline after 5m even if they would otherwise be in updating state (upgrade hasn't finished), because it is not likely that upgrade will ever finish if the agent was stopped.

image

@pjbertels
Copy link
Author

I see this one in 8.4.0 when trying to bring up 25,000.
FleetAgentStatus(total=25020, inactive=0, online=25000, error=0, offline=0, updating=20, other=20, events=0

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 7, 2022

I see this one in 8.4.0 when trying to bring up 25,000. FleetAgentStatus(total=25020, inactive=0, online=25000, error=0, offline=0, updating=20, other=20, events=0

This looks good, online + updating = total (total stands for all active).
For other, there is a calculation other = all - online - error - offline. I am not sure why we need the other status, it is confusing that it is a different subset (doesn't include online, includes inactive).
@joshdover @jen-huang do you recall what is the use case for other status?

@pjbertels
Copy link
Author

Thanks for clarifying how the counts work. In the case above, I got stuck in that state and timed out waiting to converge so maybe a slightly different issue. It seems like a different version of the same issue ... can total be 25020 when only 25000 drones are involved?

@juliaElastic
Copy link
Contributor

Thanks for clarifying how the counts work. In the case above, I got stuck in that state and timed out waiting to converge so maybe a slightly different issue. It seems like a different version of the same issue ... can total be 25020 when only 25000 drones are involved?

Oh I see, I missed that the total was greater than the actual agents enrolled. Yes, this seems like a bug, though I do not have any clue yet what could cause that.

@jen-huang
Copy link
Contributor

@joshdover @jen-huang do you recall what is the use case for other status?

I do not recall this, maybe @nchaulet knows?

@nchaulet
Copy link
Member

I do not recall this, maybe @nchaulet knows?

I think the other was used by security solution at some point, but looking at the code looks like it not the case anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Project:FleetScaling Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants