/api/fleet/agent_status provides incorrect counts #134798

pjbertels · 2022-06-21T03:17:45Z

Kibana version:
8.3.0-6e69754c

Elasticsearch version:

Server OS version:

Browser version:

Browser OS version:

Original install method (e.g. download page, yum, from source, etc.):

Describe the bug:
I use /api/fleet/agent_status daily to determine what is going on in Fleet. The result of the /api/fleet/agent_status REST API call is often wrong because it provides incorrect counts. If you poll the /api/fleet/agent_status while doing fleet operations these inconsistencies show up often.

For example the following call for a policy with 10000 agents reports the total as 10004.
FleetAgentStatus(total=10004, inactive=0, online=9811, error=0, offline=150, updating=43, other=43, events=0, run_id=None, timestamp=None, kuery='policy_id : swarm.py:95
f2fba850-f0f7-11ec-9c99-f30f5bda23da', cluster_name=None)

Steps to reproduce:
1.
2.
3.

Expected behavior:
I expect that total summing some combination of known fields will always match the total.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:
Accurate reporting for this interface is critical to testing and debugging Fleet Server. If the REST call is reporting the wrong counts I would expect the Kibana UI to be affected.

elasticmachine · 2022-06-21T03:17:46Z

Pinging @elastic/fleet (Team:Fleet)

pjbertels · 2022-06-21T10:22:43Z

In this case, after my drones had been up I found that Fleet was reporting 4 drones in updating state and 10000 healthy but the actual number of drones on my system was 10000 Plus the default agent policy so it should have been 10001.

pjbertels · 2022-06-23T14:12:55Z

Here is an example of a cluster with 10001 agents(the default agent and 10,000 Horde drones).

pjbertels · 2022-06-27T14:40:00Z

Turns out the description was missing which is unfortunate because we had dumped the JSON from Fleet.

httpx._client:_client.py:1002 HTTP Request: PUT https://perf-tqjrz-custom.kb.us-west2.gcp.elastic-cloud.com:9243/api/fleet/package_policies/elastic-cloud-fleet-server "HTTP/1.1 400 Bad Request" {"statusCode":400,"error":"Bad Request","message":"[request body.description]: expected value of type [string] but got [null]"}

We fixed the file by adding a description. This likely resulted in another issue ...

I'll retest to verify the causality.

pjbertels · 2022-07-06T13:05:34Z

Recent examples while trying to do 25,000:
FleetAgentStatus(total=18473, inactive=0, online=10733, error=0, offline=7850, updating=71, other=-110, events=0,
FleetAgentStatus(total=18492, inactive=-3, online=11475, error=0, offline=7064, updating=56, other=-50, events=0,
FleetAgentStatus(total=18496, inactive=-1, online=11050, error=0, offline=6133, updating=53, other=1312, events=0
These problems happen without the Fleet server going unhealthy. So just ignore my previous comment.

juliaElastic · 2022-07-06T13:11:18Z

I looked into this, and what happens in the code is that there are queries made in a loop for different statuses:

kibana/x-pack/plugins/fleet/server/services/agents/status.ts

Lines 58 to 68 in de9a822

    
           const [all, allActive, online, error, offline, updating] = await pMap( 
        
             [ 
        
               undefined, // All agents, including inactive 
        
               undefined, // All active agents 
        
               AgentStatusKueryHelper.buildKueryForOnlineAgents(), 
        
               AgentStatusKueryHelper.buildKueryForErrorAgents(), 
        
               AgentStatusKueryHelper.buildKueryForOfflineAgents(), 
        
               AgentStatusKueryHelper.buildKueryForUpdatingAgents(), 
        
             ], 
        
             (kuery, index) => 
        
               getAgentsByKuery(esClient, {

I think the reason of discrepancies is that the status changes quickly between healthy and offline when the last_checkin time passes 5 min, so some agents show up in healthy and offline buckets as well.
I think starting a point in time before doing the status queries would help with this.
I can put up a small pr for this.

For the other discrepancy, where the total > enrolled agents, I didn't find any reason yet. The query just counts the records in the .fleet-agents table, so it shouldn't contain more than the enrolled agents count.

EDIT: while testing the fix, I found one occurrence of the count discrepancy after starting upgrade of agents before halting them.
It seems that the current filters allow an overlap between offline and updating (or online) status. I'm not sure how this can happen as offline query excludes updating status.

{
  "total": 3188,
  "inactive": 0,
  "online": 3122,
  "error": 0,
  "offline": 4,
  "updating": 64,
  "other": 62,
  "events": 0
}

I couldn't reproduce this problem since, what I noticed is that when I start upgrade and unenroll agents (with horde halt), agents are still showing up in updating state even after the offline timeout (5m) has passed.
I think we could change the logic here so that agents go to offline after 5m even if they would otherwise be in updating state (upgrade hasn't finished), because it is not likely that upgrade will ever finish if the agent was stopped.

pjbertels · 2022-07-07T00:35:49Z

I see this one in 8.4.0 when trying to bring up 25,000.
FleetAgentStatus(total=25020, inactive=0, online=25000, error=0, offline=0, updating=20, other=20, events=0

juliaElastic · 2022-07-07T06:46:15Z

I see this one in 8.4.0 when trying to bring up 25,000. FleetAgentStatus(total=25020, inactive=0, online=25000, error=0, offline=0, updating=20, other=20, events=0

This looks good, online + updating = total (total stands for all active).
For other, there is a calculation other = all - online - error - offline. I am not sure why we need the other status, it is confusing that it is a different subset (doesn't include online, includes inactive).
@joshdover @jen-huang do you recall what is the use case for other status?

pjbertels · 2022-07-07T10:37:22Z

Thanks for clarifying how the counts work. In the case above, I got stuck in that state and timed out waiting to converge so maybe a slightly different issue. It seems like a different version of the same issue ... can total be 25020 when only 25000 drones are involved?

juliaElastic · 2022-07-07T10:47:53Z

Thanks for clarifying how the counts work. In the case above, I got stuck in that state and timed out waiting to converge so maybe a slightly different issue. It seems like a different version of the same issue ... can total be 25020 when only 25000 drones are involved?

Oh I see, I missed that the total was greater than the actual agents enrolled. Yes, this seems like a bug, though I do not have any clue yet what could cause that.

jen-huang · 2022-07-11T19:38:07Z

@joshdover @jen-huang do you recall what is the use case for other status?

I do not recall this, maybe @nchaulet knows?

nchaulet · 2022-07-11T20:29:55Z

I do not recall this, maybe @nchaulet knows?

I think the other was used by security solution at some point, but looking at the code looks like it not the case anymore.

pjbertels added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Jun 21, 2022

joshdover added the Project:FleetScaling label Jun 23, 2022

juliaElastic self-assigned this Jul 5, 2022

juliaElastic mentioned this issue Jul 6, 2022

[Fleet] using point in time for agent status query to avoid discrepancy #135816

Merged

1 task

juliaElastic closed this as completed in #135816 Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/api/fleet/agent_status provides incorrect counts #134798

/api/fleet/agent_status provides incorrect counts #134798

pjbertels commented Jun 21, 2022

elasticmachine commented Jun 21, 2022

pjbertels commented Jun 21, 2022 •

edited

Loading

pjbertels commented Jun 23, 2022

pjbertels commented Jun 27, 2022

pjbertels commented Jul 6, 2022 •

edited

Loading

juliaElastic commented Jul 6, 2022 •

edited

Loading

pjbertels commented Jul 7, 2022

juliaElastic commented Jul 7, 2022 •

edited

Loading

pjbertels commented Jul 7, 2022

juliaElastic commented Jul 7, 2022

jen-huang commented Jul 11, 2022

nchaulet commented Jul 11, 2022

/api/fleet/agent_status provides incorrect counts #134798

/api/fleet/agent_status provides incorrect counts #134798

Comments

pjbertels commented Jun 21, 2022

elasticmachine commented Jun 21, 2022

pjbertels commented Jun 21, 2022 • edited Loading

pjbertels commented Jun 23, 2022

pjbertels commented Jun 27, 2022

pjbertels commented Jul 6, 2022 • edited Loading

juliaElastic commented Jul 6, 2022 • edited Loading

pjbertels commented Jul 7, 2022

juliaElastic commented Jul 7, 2022 • edited Loading

pjbertels commented Jul 7, 2022

juliaElastic commented Jul 7, 2022

jen-huang commented Jul 11, 2022

nchaulet commented Jul 11, 2022

pjbertels commented Jun 21, 2022 •

edited

Loading

pjbertels commented Jul 6, 2022 •

edited

Loading

juliaElastic commented Jul 6, 2022 •

edited

Loading

juliaElastic commented Jul 7, 2022 •

edited

Loading