Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.6] Fleet Usage telemetry extension (#145353) #146105

Merged
merged 1 commit into from
Nov 23, 2022

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.6:

Questions ?

Please refer to the Backport tool documentation

## Summary

Closes elastic/ingest-dev#1261

Added a snippet to the telemetry that I added for each requirement.
Please review and let me know if any changes are needed.
Also asked a few questions below. @jlind23 @kpollich

6. is blocked by [elasticsearch
change](elastic/elasticsearch#91701) to give
kibana_system the missing privilege to read logs-elastic_agent* indices.

Took inspiration for task versioning from
https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186

- [x] 1. Elastic Agent versions
Versions of all the Elastic Agent running: `agent.version` field on
`.fleet-agents` documents

```
"agent_versions": [
    "8.6.0"
  ],
```

- [x] 2. Fleet server configuration
Think we can query for `.fleet-policies` where some `input` has `type:
'fleet-server'` for this, as well as use the `Fleet Server Hosts`
settings that we define via saved objects in Fleet

```
  "fleet_server_config": {
    "policies": [
      {
        "input_config": {
          "server": {
            "limits.max_agents": 10000
          },
          "server.runtime": "gc_percent:20"
        }
      }
    ]
  }
```

- [x] 3. Number of policies
Count of `.fleet-policies` index

To confirm, did we mean agent policies here?

```
 "agent_policies": {
    "count": 7,
```

- [x] 4. Output type contained in those policies
Collecting this from ts logic, querying from `.fleet-policies` index.
The alternative would be to write a painless script (because the
`outputs` are an object with dynamic keys, we can't do an aggregation
directly).

```
"agent_policies": {
    "output_types": [
      "elasticsearch"
    ]
  }
```

Did we mean to just collect the types here, or any other info? e.g.
output urls

- [x] 5. Average number of checkin failures
We only have the most recent checkin status and timestamp on
`.fleet-agents`.

Do we mean here to publish the total last checkin failure count? E.g. 3
if 3 agents are in failure checkin status currently.
Or do we mean to publish specific info for all agents
(`last_checkin_status`, `last_checkin` time, `last_checkin_message`)?
Are the only statuses `error` and `degraded` that we want to send?

```
  "agent_last_checkin_status": {
    "error": 0,
    "degraded": 0
  },
```

- [ ] 6. Top 3 most common errors in the Elastic Agent logs

Do we mean here elastic-agent logs only, or fleet-server logs as well
(maybe separately)?

I found an alternative way to query the message field using sampler and
categorize text aggregation:
```
GET logs-elastic_agent*/_search
{
    "size": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "log.level": "error"
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "gte": "now-1h"
                        }
                    }
                }
            ]
        }
    },
    "aggregations": {
        "message_sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "categories": {
                    "categorize_text": {
                        "field": "message",
                        "size": 10
                    }
                }
            }
        }
    }
}
```
Example response:
```
"aggregations": {
    "message_sample": {
      "doc_count": 112,
      "categories": {
        "buckets": [
          {
            "doc_count": 73,
            "key": "failed to unenroll offline agents",
            "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?",
            "max_matching_length": 36
          },
          {
            "doc_count": 7,
            "key": """stderr panic close of closed channel n ngoroutine running Stop ngithub.com/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n
```

- [x] 7.  Number of checkin failure over the past period of time

I think this is almost the same as elastic#5. The difference would be to report
new failures happened only in the last hour, or report all agents in
failure state. (which would be an increasing number if the agent stays
in failed state).
Do we want these 2 separate telemetry fields?

EDIT: removed the last1hr query, instead added a new field to report
agents enrolled per policy (top 10). See comments below.

```
  "agent_checkin_status": {
    "error": 3,
    "degraded": 0
  },
  "agents_per_policy": [2, 1000],
```

- [x] 8. Number of Elastic Agent and number of fleet server

This is already there in the existing telemetry:
```
  "agents": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "total_all_statuses": 1,
    "updating": 0
  },
  "fleet_server": {
    "total_enrolled": 0,
    "healthy": 0,
    "unhealthy": 0,
    "offline": 0,
    "updating": 0,
    "total_all_statuses": 0,
    "num_host_urls": 1
  },
```

### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Kibana Machine <[email protected]>
(cherry picked from commit e00e26e)
@kibanamachine kibanamachine enabled auto-merge (squash) November 23, 2022 09:27
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 23, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 59 65 +6
osquery 108 113 +5
securitySolution 442 448 +6
total +19

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 68 74 +6
osquery 109 115 +6
securitySolution 519 525 +6
total +20

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@kibanamachine kibanamachine merged commit 7b99f4c into elastic:8.6 Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants