[Elastic Agent] The system/metrics input should report itself as degraded when it encounters a permissions error #39737

cmacknz · 2024-05-24T23:16:25Z

Requires [Elastic Agent] Allow Metricbeat metricsets to report their status to the Elastic Agent #39736

When the system/metrics input in the Elastic Agent is run as part of an unprivileged agent, it will fail to collect metrics for some processes and fail to open some file it uses as a data source for certain metricsets. Today these problems are only visible in Elastic Agent logs. An example from the diagnostics in elastic/elastic-agent#4647 follows below.

{"log.level":"debug","@timestamp":"2024-05-02T05:49:00.137Z","message":"Error fetching PID info for 1216, skipping: GetInfoForPid: could not get all information for PID 1216: error fetching name: OpenProcess failed for pid=1216: Access is denied.\nerror fetching status: OpenProcess failed for pid=1216: Access is denied.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":173,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidIter"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}

Use the work done in #39736 to set the input to degraded when it encounters a permissions error like the one above attempting to read data for a metricset.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-24T23:16:26Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2024-05-24T23:25:25Z

One concern I have about this input is that we have seen failures to read permissions outside of the unprivileged agent use case, for example we were unable to read data from endpoint-security due to it running as a protected process on Windows.

We need to be careful we do not create a plague of degraded agents for benign or known errors that can't be fixed. We may need to make the reporting for this input optional, perhaps on a per metricset basis.

nimarezainia · 2024-05-27T01:22:07Z

One concern I have about this input is that we have seen failures to read permissions outside of the unprivileged agent use case, for example we were unable to read data from endpoint-security due to it running as a protected process on Windows.

We need to be careful we do not create a plague of degraded agents for benign or known errors that can't be fixed. We may need to make the reporting for this input optional, perhaps on a per metricset basis.

For benign I agree, but for known errors - I think these should be reported. Otherwise, we are showing that the agent is healthy but in actual fact there is an error.

cmacknz · 2024-05-27T14:03:09Z

Agree we should show the error but I think we'll want to be able to disable certain types of errors to prevent the system integration from making every agent degraded by default for weeks or months depending on where in the release schedule our fix lands.

In general not being able to read a system metricset or access a particular PID is worth reporting, but once known I don't think the agent needs to be reported as unhealthy continuously as this will make other, potentially more serious errors harder to notice.

For a recent example (that is now fixed), every agent with defend+system installed on Windows would have been reported as degraded permanently as the system integration failed to read information from Defend's PID. This is important to know, but doesn't need to be continuously flagged to the user for every agent they have once known.

nimarezainia · 2024-05-28T01:23:44Z

is there a way we could identify and then throttle these continuous errors? so say after the 10th error received, we can flag that at the agent/fleet level for investigation but revert the agent to healthy? but knowing that there are persistent errors that may not necessarily be a reason for an agent degradation warning.

cmacknz · 2024-05-28T15:17:55Z

There are a few ways to approach this. One would be a configuration option the keeps the errors in the logs, but filters them from marking the agent as degraded in Fleet. I think this is reasonable and can be in scope for this issue.

We could additionally rate limit the error messages themselves, or perhaps log a periodic summary error for all metricsets that encountered permissions errors in a given interval. So rather than 10 metricsets generating 10 individual permissions error log lines, we write 1 log line that includes the 10 affected metricsets. If we want this, I think this needs to be a separate implementation issue.

lucabelluccini · 2024-07-19T14:59:32Z

Please let's think of the supportability .

If we enable debug logging, some of those errors might be buried under other errors. The "observability" window in the Elastic Agent diagnostic bundle is limited especially when using a lot of integrations. We might miss them.
Imho we should take a similar approach of the input_metrics, but with more "details" on the failures. Right now we have just in/out/failures for most inputs. But we do not know the reason of the failures and metrics harvesting lacks input_metrics.
The failures/successes in harvesting data and metrics should have an associated observable metric in the Elastic Agent diagnostic and we should not rely only logs.
Thinking serverless, logging efficiency is important and again, we should be able to "know what's going on" without strongly relying on logs.

Looking at #40025 it might address all the points above.

TL;DR:

We need to be able to understand what's going on from an Elastic Agent Diagnostic
Users need to be able to understand what's going on from the Elastic Agent page without the need of collecting the logs or generating diagnostics

I wonder if this opens the door to introducing a new Elastic Agent state like degraded or limited...
E.g. A user can "acknowledge" an EA being limited due to permissions when the integration runs unprivileged.

cmacknz · 2024-07-19T20:25:33Z

We are using the agent input health reporting for this. There will be a single info level log when the state changes from healthy to degraded (this state already exists), and the input state will be visible in the state.yaml in diagnostics and directly in the Fleet UI. Permissions errors are likely to be permanent, so this avoids pointless repetitive logging.

Sharing a screenshot from @VihasMakwana on what it looks like with the various in progress PRs assembled into the product:

lucabelluccini · 2024-08-05T17:52:46Z

Thank you @cmacknz for the additional context.

Will we expand this to Filebeat and other beats?
Example: Filebeat is unable to read one file path. Can it report degraded with the error & the path?

pierrehilbert · 2024-08-06T06:58:44Z

Hey Luca,
This is already done for other Beats, more details in this meta issue: #39604

lucabelluccini · 2024-08-06T09:53:15Z

Thanks Pierre, I literally was reading it but my eyes were looking for Filebeat while I should have looked for filestream,log, winlog input and so on... Sorry for the useless question.

In general, if other input types (e.g. aws-s3 and other security owned inputs) need to report such detailed status, do we need to tell them to implement something?
Let's say we want to report unhealthy if aws-s3 input gets "invalid credentials". Is it something which is already covered or we would need some change?

cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 24, 2024

This was referenced May 24, 2024

[Meta] Enhance input Health reporting from agent to better convey issues related to installation of unprivileged agent #39604

Closed

[Elastic Agent] Allow Metricbeat metricsets to report their status to the Elastic Agent #39736

Closed

pierrehilbert assigned VihasMakwana Jun 10, 2024

pierrehilbert mentioned this issue Jun 21, 2024

[Windows] - system.diskio datastream missing on Kibana for unprivileged mode. elastic/elastic-agent#4982

Closed

VihasMakwana mentioned this issue Jun 26, 2024

[metricbeat] - Allow metricsets to report their status via v2 protocol #40025

Merged

6 tasks

VihasMakwana closed this as completed in #40025 Jul 24, 2024

mergify bot mentioned this issue Jul 24, 2024

[8.15](backport #40025) [metricbeat] - Allow metricsets to report their status via v2 protocol #40327

Closed

6 tasks

pierrehilbert reopened this Jul 25, 2024

VihasMakwana mentioned this issue Jul 31, 2024

[metricbeat] - Allow metricsets to report their status via v2 protocol #40400

Merged

6 tasks

VihasMakwana closed this as completed in #40400 Aug 6, 2024

mergify bot mentioned this issue Aug 6, 2024

[8.15](backport #40400) [metricbeat] - Allow metricsets to report their status via v2 protocol #40443

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elastic Agent] The system/metrics input should report itself as degraded when it encounters a permissions error #39737

[Elastic Agent] The system/metrics input should report itself as degraded when it encounters a permissions error #39737

cmacknz commented May 24, 2024

elasticmachine commented May 24, 2024

cmacknz commented May 24, 2024

nimarezainia commented May 27, 2024

cmacknz commented May 27, 2024

nimarezainia commented May 28, 2024

cmacknz commented May 28, 2024

lucabelluccini commented Jul 19, 2024

cmacknz commented Jul 19, 2024

lucabelluccini commented Aug 5, 2024

pierrehilbert commented Aug 6, 2024

lucabelluccini commented Aug 6, 2024

[Elastic Agent] The system/metrics input should report itself as degraded when it encounters a permissions error #39737

[Elastic Agent] The system/metrics input should report itself as degraded when it encounters a permissions error #39737

Comments

cmacknz commented May 24, 2024

elasticmachine commented May 24, 2024

cmacknz commented May 24, 2024

nimarezainia commented May 27, 2024

cmacknz commented May 27, 2024

nimarezainia commented May 28, 2024

cmacknz commented May 28, 2024

lucabelluccini commented Jul 19, 2024

cmacknz commented Jul 19, 2024

lucabelluccini commented Aug 5, 2024

pierrehilbert commented Aug 6, 2024

lucabelluccini commented Aug 6, 2024