[Fleet] Added logs-elastic_agent* read privileges to kibana_system #91701

juliaElastic · 2022-11-18T13:30:05Z

Required change for Fleet telemetry to read from elastic agent log indices.
elastic/kibana#146107

elasticsearchmachine · 2022-11-18T13:40:18Z

Pinging @elastic/es-security (Team:Security)

elasticsearchmachine · 2022-11-18T13:41:44Z

Hi @juliaElastic, I've created a changelog YAML for you.

juliaElastic · 2022-11-18T15:27:00Z

I see some ml tests failing, which is unrelated to my changes. Any idea how to fix this? I just did a merge from main as well.

15:53:26 * What went wrong:
15:53:26 Execution failed for task ':x-pack:plugin:ml:test'.
15:53:26 > There were failing tests. See the report at: file:///dev/shm/elastic+elasticsearch+pull-request+part-3-fips/x-pack/plugin/ml/build/reports/tests/test/index.html
15:53:26

15:47:59 org.elasticsearch.xpack.ml.utils.NativeMemoryCalculatorTests > testActualNodeSizeCalculationConsistency FAILED
15:47:59     java.lang.AssertionError: native memory [113246208] smaller than original native memory [113464638]
15:47:59     Expected: a value equal to or greater than <113464637L>
15:47:59          but: <113246208L> was less than <113464637L>
15:47:59         at __randomizedtesting.SeedInfo.seed([34D70F53042A2AED:11DE06502B080F4A]:0)
15:47:59         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
15:47:59         at org.junit.Assert.assertThat(Assert.java:956)
15:47:59         at org.elasticsearch.xpack.ml.utils.NativeMemoryCalculatorTests.lambda$testActualNodeSizeCalculationConsistency$0(NativeMemoryCalculatorTests.java:170)
15:47:59         at org.elasticsearch.xpack.ml.utils.NativeMemoryCalculatorTests.testActualNodeSizeCalculationConsistency(NativeMemoryCalculatorTests.java:207)
15:47:59

@droberts195 do you have any insight here? I see the test file changed a few hours ago here: #91694

juliaElastic · 2022-11-18T15:36:28Z

Jenkins, test this please

droberts195 · 2022-11-18T16:01:16Z

Jenkins run elasticsearch-ci/part-3-fips

ywangd · 2022-11-21T00:26:46Z

Ping @elastic/kibana-security

...core/src/main/java/org/elasticsearch/xpack/core/security/authz/store/ReservedRolesStore.java

juliaElastic · 2022-11-22T10:13:40Z

@ywangd are you ready to approve?

azasypkin · 2022-11-22T10:27:21Z

...core/src/main/java/org/elasticsearch/xpack/core/security/authz/store/ReservedRolesStore.java

@@ -719,6 +719,8 @@ public static RoleDescriptor kibanaSystemRoleDescriptor(String name) {
                // Fleet Server indices. Kibana create this indice before Fleet Server use them.
                // Fleet Server indices. Kibana read and write to this indice to manage Elastic Agents
                RoleDescriptor.IndicesPrivileges.builder().indices(".fleet*").allowRestrictedIndices(true).privileges("all").build(),
+                // Fleet telemetry queries Agent Logs indices in kibana task runner
+                RoleDescriptor.IndicesPrivileges.builder().indices("logs-elastic_agent*").privileges("read").build(),


question: I see @joshdover hasn't replied on this yet in https://github.com/elastic/ingest-dev/issues/1261#issuecomment-1322223139, did you folks sync over Slack/Zoom? I just want to be sure we're all aligned, if so - the change looks good from the Kibana Security perspective.

Not yet, I'll wait for his feedback.

See elastic/kibana#145353 (comment)

Great, thanks for confirming

Is there anything else you need from me to approve?

Nope, everything looks good from Kibana perspective, feel free to merge as soon as ES team approves the change.

@elastic/elasticsearch-team Hey, could someone review and approve?

@juliaElastic
We try to advise people not to use @elastic/elasticsearch-team handle, since it pings the whole Elasticsearch team (116 people).

I see that you already notified the correct area team (Team:Security) and that Yang (who is on PTO) did the initial review. Everyone from Security team receives notifications for every comment, so you can either simply comment on this PR or ping @elastic/es-security if there is an urgent need for someone to have a look.

From my perspective, the changes look good! 🚀 If you need this to be backported to 8.6 branch you can simply apply the auto-backport-and-merge label (before merging this PR) and it will automatically open a new PR and merge it (otherwise you would have to do it manually).

Another small suggestion would be to update the description of this PR (as it serves for future references) and remove contributor template bullet points and the link to a private GitHub repo.

Thank you for your suggestions! I wasn't sure which team to ping, as there was no team assigned as pr reviewer. I'll keep this in mind.

I've made the same mistakes myself. It's not something you could have known, so no worries. :)

@jlind23

## Summary Closes elastic/ingest-dev#1261 Added a snippet to the telemetry that I added for each requirement. Please review and let me know if any changes are needed. Also asked a few questions below. @jlind23 @kpollich 6. is blocked by [elasticsearch change](elastic/elasticsearch#91701) to give kibana_system the missing privilege to read logs-elastic_agent* indices. Took inspiration for task versioning from https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186 - [x] 1. Elastic Agent versions Versions of all the Elastic Agent running: `agent.version` field on `.fleet-agents` documents ``` "agent_versions": [ "8.6.0" ], ``` - [x] 2. Fleet server configuration Think we can query for `.fleet-policies` where some `input` has `type: 'fleet-server'` for this, as well as use the `Fleet Server Hosts` settings that we define via saved objects in Fleet ``` "fleet_server_config": { "policies": [ { "input_config": { "server": { "limits.max_agents": 10000 }, "server.runtime": "gc_percent:20" } } ] } ``` - [x] 3. Number of policies Count of `.fleet-policies` index To confirm, did we mean agent policies here? ``` "agent_policies": { "count": 7, ``` - [x] 4. Output type contained in those policies Collecting this from ts logic, querying from `.fleet-policies` index. The alternative would be to write a painless script (because the `outputs` are an object with dynamic keys, we can't do an aggregation directly). ``` "agent_policies": { "output_types": [ "elasticsearch" ] } ``` Did we mean to just collect the types here, or any other info? e.g. output urls - [x] 5. Average number of checkin failures We only have the most recent checkin status and timestamp on `.fleet-agents`. Do we mean here to publish the total last checkin failure count? E.g. 3 if 3 agents are in failure checkin status currently. Or do we mean to publish specific info for all agents (`last_checkin_status`, `last_checkin` time, `last_checkin_message`)? Are the only statuses `error` and `degraded` that we want to send? ``` "agent_last_checkin_status": { "error": 0, "degraded": 0 }, ``` - [ ] 6. Top 3 most common errors in the Elastic Agent logs Do we mean here elastic-agent logs only, or fleet-server logs as well (maybe separately)? I found an alternative way to query the message field using sampler and categorize text aggregation: ``` GET logs-elastic_agent*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "log.level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } }, "aggregations": { "message_sample": { "sampler": { "shard_size": 200 }, "aggs": { "categories": { "categorize_text": { "field": "message", "size": 10 } } } } } } ``` Example response: ``` "aggregations": { "message_sample": { "doc_count": 112, "categories": { "buckets": [ { "doc_count": 73, "key": "failed to unenroll offline agents", "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?", "max_matching_length": 36 }, { "doc_count": 7, "key": """stderr panic close of closed channel n ngoroutine running Stop ngithub.com/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n ``` - [x] 7. Number of checkin failure over the past period of time I think this is almost the same as #5. The difference would be to report new failures happened only in the last hour, or report all agents in failure state. (which would be an increasing number if the agent stays in failed state). Do we want these 2 separate telemetry fields? EDIT: removed the last1hr query, instead added a new field to report agents enrolled per policy (top 10). See comments below. ``` "agent_checkin_status": { "error": 3, "degraded": 0 }, "agents_per_policy": [2, 1000], ``` - [x] 8. Number of Elastic Agent and number of fleet server This is already there in the existing telemetry: ``` "agents": { "total_enrolled": 0, "healthy": 0, "unhealthy": 0, "offline": 0, "total_all_statuses": 1, "updating": 0 }, "fleet_server": { "total_enrolled": 0, "healthy": 0, "unhealthy": 0, "offline": 0, "updating": 0, "total_all_statuses": 0, "num_host_urls": 1 }, ``` ### Checklist - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <[email protected]>

@jlind23

## Summary Closes elastic/ingest-dev#1261 Added a snippet to the telemetry that I added for each requirement. Please review and let me know if any changes are needed. Also asked a few questions below. @jlind23 @kpollich 6. is blocked by [elasticsearch change](elastic/elasticsearch#91701) to give kibana_system the missing privilege to read logs-elastic_agent* indices. Took inspiration for task versioning from https://github.com/elastic/kibana/pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186 - [x] 1. Elastic Agent versions Versions of all the Elastic Agent running: `agent.version` field on `.fleet-agents` documents ``` "agent_versions": [ "8.6.0" ], ``` - [x] 2. Fleet server configuration Think we can query for `.fleet-policies` where some `input` has `type: 'fleet-server'` for this, as well as use the `Fleet Server Hosts` settings that we define via saved objects in Fleet ``` "fleet_server_config": { "policies": [ { "input_config": { "server": { "limits.max_agents": 10000 }, "server.runtime": "gc_percent:20" } } ] } ``` - [x] 3. Number of policies Count of `.fleet-policies` index To confirm, did we mean agent policies here? ``` "agent_policies": { "count": 7, ``` - [x] 4. Output type contained in those policies Collecting this from ts logic, querying from `.fleet-policies` index. The alternative would be to write a painless script (because the `outputs` are an object with dynamic keys, we can't do an aggregation directly). ``` "agent_policies": { "output_types": [ "elasticsearch" ] } ``` Did we mean to just collect the types here, or any other info? e.g. output urls - [x] 5. Average number of checkin failures We only have the most recent checkin status and timestamp on `.fleet-agents`. Do we mean here to publish the total last checkin failure count? E.g. 3 if 3 agents are in failure checkin status currently. Or do we mean to publish specific info for all agents (`last_checkin_status`, `last_checkin` time, `last_checkin_message`)? Are the only statuses `error` and `degraded` that we want to send? ``` "agent_last_checkin_status": { "error": 0, "degraded": 0 }, ``` - [ ] 6. Top 3 most common errors in the Elastic Agent logs Do we mean here elastic-agent logs only, or fleet-server logs as well (maybe separately)? I found an alternative way to query the message field using sampler and categorize text aggregation: ``` GET logs-elastic_agent*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "log.level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } }, "aggregations": { "message_sample": { "sampler": { "shard_size": 200 }, "aggs": { "categories": { "categorize_text": { "field": "message", "size": 10 } } } } } } ``` Example response: ``` "aggregations": { "message_sample": { "doc_count": 112, "categories": { "buckets": [ { "doc_count": 73, "key": "failed to unenroll offline agents", "regex": ".*?failed.+?to.+?unenroll.+?offline.+?agents.*?", "max_matching_length": 36 }, { "doc_count": 7, "key": """stderr panic close of closed channel n ngoroutine running Stop ngithub.com/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5 \n\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go n ``` - [x] 7. Number of checkin failure over the past period of time I think this is almost the same as elastic#5. The difference would be to report new failures happened only in the last hour, or report all agents in failure state. (which would be an increasing number if the agent stays in failed state). Do we want these 2 separate telemetry fields? EDIT: removed the last1hr query, instead added a new field to report agents enrolled per policy (top 10). See comments below. ``` "agent_checkin_status": { "error": 3, "degraded": 0 }, "agents_per_policy": [2, 1000], ``` - [x] 8. Number of Elastic Agent and number of fleet server This is already there in the existing telemetry: ``` "agents": { "total_enrolled": 0, "healthy": 0, "unhealthy": 0, "offline": 0, "total_all_statuses": 1, "updating": 0 }, "fleet_server": { "total_enrolled": 0, "healthy": 0, "unhealthy": 0, "offline": 0, "updating": 0, "total_all_statuses": 0, "num_host_urls": 1 }, ``` ### Checklist - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <[email protected]> (cherry picked from commit e00e26e)

@jlind23

# Backport This will backport the following commits from `main` to `8.6`: - [Fleet Usage telemetry extension (#145353)](#145353)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Julia Bardi <[email protected]>

…lastic#91701) * Added logs-elastic_agent* read privileges to kibana_system * Update docs/changelog/91701.yaml * added unit test * Fixed formatting * removed read cross cluster role

…91701) (#91842) * Added logs-elastic_agent* read privileges to kibana_system * Update docs/changelog/91701.yaml * added unit test * Fixed formatting * removed read cross cluster role

@timestamp

## Summary Closes elastic/ingest-dev#1261 Merged: [elasticsearch change](elastic/elasticsearch#91701) to give kibana_system the missing privilege to read logs-elastic_agent* indices. ## Top 3 most common errors in the Elastic Agent logs Added most common elastic-agent and fleet-server logs to telemetry. Using a query of message field using sampler and categorize text aggregation. This is a workaround as we can't directly do aggregation on `message` field. ``` GET logs-elastic_agent*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "log.level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } }, "aggregations": { "message_sample": { "sampler": { "shard_size": 200 }, "aggs": { "categories": { "categorize_text": { "field": "message", "size": 10 } } } } } } ``` Tested with latest Elasticsearch snapshot, and verified that the logs are added to telemetry: ``` { "agent_logs_top_errors": [ "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri", "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n", "stderr panic close of closed channel n ngoroutine running Stop" ], "fleet_server_logs_top_errors": [ "Dispatch abort response", "error while closing", "failed to take ownership" ] } ``` Did some measurements locally, and the query took a few ms only. I'll try to check with larger datasets in elastic agent logs too. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <[email protected]>

@timestamp

## Summary Closes elastic/ingest-dev#1261 Merged: [elasticsearch change](elastic/elasticsearch#91701) to give kibana_system the missing privilege to read logs-elastic_agent* indices. ## Top 3 most common errors in the Elastic Agent logs Added most common elastic-agent and fleet-server logs to telemetry. Using a query of message field using sampler and categorize text aggregation. This is a workaround as we can't directly do aggregation on `message` field. ``` GET logs-elastic_agent*/_search { "size": 0, "query": { "bool": { "must": [ { "term": { "log.level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } }, "aggregations": { "message_sample": { "sampler": { "shard_size": 200 }, "aggs": { "categories": { "categorize_text": { "field": "message", "size": 10 } } } } } } ``` Tested with latest Elasticsearch snapshot, and verified that the logs are added to telemetry: ``` { "agent_logs_top_errors": [ "failed to dispatch actions error failed reloading q q q nil nil config failed reloading artifact config for composed snapshot.downloader failed to generate snapshot config failed to detect remote snapshot repo proceeding with configured not an agent uri", "fleet-server stderr level info time message No applicable limit for agents using default \\n level info time message No applicable limit for agents using default \\n", "stderr panic close of closed channel n ngoroutine running Stop" ], "fleet_server_logs_top_errors": [ "Dispatch abort response", "error while closing", "failed to take ownership" ] } ``` Did some measurements locally, and the query took a few ms only. I'll try to check with larger datasets in elastic agent logs too. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Kibana Machine <[email protected]>

Added logs-elastic_agent* read privileges to kibana_system

6d6284b

juliaElastic added >enhancement Team:Fleet v8.6.1 v8.7.0 labels Nov 18, 2022

juliaElastic self-assigned this Nov 18, 2022

elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Nov 18, 2022

juliaElastic added the :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC label Nov 18, 2022

elasticsearchmachine added the Team:Security Meta label for security team label Nov 18, 2022

juliaElastic and others added 2 commits November 18, 2022 14:41

Update docs/changelog/91701.yaml

17b4354

added unit test

eb60e36

juliaElastic mentioned this pull request Nov 18, 2022

Fleet Usage telemetry extension elastic/kibana#145353

Merged

9 tasks

juliaElastic added 2 commits November 18, 2022 15:20

Fixed formatting

2212ba0

Merge remote-tracking branch 'origin/main' into juliaElastic-patch-1

aaeecc6

ywangd reviewed Nov 21, 2022

View reviewed changes

...core/src/main/java/org/elasticsearch/xpack/core/security/authz/store/ReservedRolesStore.java Outdated Show resolved Hide resolved

removed read cross cluster role

00ede49

juliaElastic requested a review from ywangd November 21, 2022 08:27

azasypkin reviewed Nov 22, 2022

View reviewed changes

juliaElastic mentioned this pull request Nov 23, 2022

added telemetry with most common error from agent logs elastic/kibana#146107

Merged

1 task

slobodanadamovic self-requested a review November 23, 2022 10:44

slobodanadamovic approved these changes Nov 23, 2022

View reviewed changes

juliaElastic added the auto-backport-and-merge label Nov 23, 2022

juliaElastic merged commit 8b34388 into elastic:main Nov 23, 2022

juliaElastic mentioned this pull request Nov 23, 2022

[8.6] [Fleet] Added logs-elastic_agent* read privileges to kibana_system (#91701) #91842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Added logs-elastic_agent* read privileges to kibana_system #91701

[Fleet] Added logs-elastic_agent* read privileges to kibana_system #91701

juliaElastic commented Nov 18, 2022 •

edited

Loading

elasticsearchmachine commented Nov 18, 2022

elasticsearchmachine commented Nov 18, 2022

juliaElastic commented Nov 18, 2022 •

edited

Loading

juliaElastic commented Nov 18, 2022

droberts195 commented Nov 18, 2022

ywangd commented Nov 21, 2022

juliaElastic commented Nov 22, 2022

azasypkin Nov 22, 2022

juliaElastic Nov 22, 2022

juliaElastic Nov 22, 2022

azasypkin Nov 22, 2022

juliaElastic Nov 23, 2022 •

edited

Loading

azasypkin Nov 23, 2022

juliaElastic Nov 23, 2022

slobodanadamovic Nov 23, 2022 •

edited

Loading

juliaElastic Nov 23, 2022

slobodanadamovic Nov 23, 2022

[Fleet] Added logs-elastic_agent* read privileges to kibana_system #91701

[Fleet] Added logs-elastic_agent* read privileges to kibana_system #91701

Conversation

juliaElastic commented Nov 18, 2022 • edited Loading

elasticsearchmachine commented Nov 18, 2022

elasticsearchmachine commented Nov 18, 2022

juliaElastic commented Nov 18, 2022 • edited Loading

juliaElastic commented Nov 18, 2022

droberts195 commented Nov 18, 2022

ywangd commented Nov 21, 2022

juliaElastic commented Nov 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slobodanadamovic Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic commented Nov 18, 2022 •

edited

Loading

juliaElastic commented Nov 18, 2022 •

edited

Loading

juliaElastic Nov 23, 2022 •

edited

Loading

slobodanadamovic Nov 23, 2022 •

edited

Loading