Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metric/system/process] - return errors encountered while monitoring set of processes #164

Closed
VihasMakwana opened this issue Jul 17, 2024 · 3 comments · Fixed by #166 or #172
Closed
Assignees
Labels
enhancement New feature or request Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@VihasMakwana
Copy link
Contributor

  • While monitoring set of processes, we don't return them to the caller but rather log them at debug leve.
  • Example of such logs are:
{"log.level":"debug","@timestamp":"2024-07-16T21:04:30.148Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:04:40.143Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Error fetching PID info for 30642, skipping: GetInfoForPid: no such process","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":198,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidIter"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.055Z","message":"Non-fatal error fetching PID metrics for 30356, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30356/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30649, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30649/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30650, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30650/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30651, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30651/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30652, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30652/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30670, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30670/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","ecs.version":"1.6.0"}
  • Such logs captures important information such as "permission denied", "partial metrics" etc.
  • With the recent introduction of the status reporter for metricsets, it is impossible to change the status to degraded if such errors are not passed to the caller.

Proposed Solution

  • Use mutierr to combine the important errors and pass it to caller with the metrics and let the caller decide what to do.

Please share your thoughts on this!

cc: @cmacknz @pierrehilbert @jlind23

@VihasMakwana VihasMakwana added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jul 17, 2024
@jlind23
Copy link
Contributor

jlind23 commented Jul 17, 2024

@VihasMakwana once this is implemented I believe it will work right away without any more changes required on the Beats side thanks to elastic/beats#40025 (review) right?

@VihasMakwana
Copy link
Contributor Author

@VihasMakwana once this is implemented I believe it will work right away without any more changes required on the Beats side thanks to elastic/beats#40025 (review) right?

We will need a small tweak here
https://github.com/elastic/beats/blob/c00345ffc1cfea63f1ff46e6af981e0a0a19adf0/metricbeat/module/system/process/process.go#L113-L116
Apart from this, nothing else is required.

@jlind23
Copy link
Contributor

jlind23 commented Jul 17, 2024

Well I guess we will have to create a new release of the elastic-agent-system-metrics but that was already part of the plan.

@VihasMakwana VihasMakwana self-assigned this Jul 17, 2024
@ycombinator ycombinator added the enhancement New feature or request label Jul 18, 2024
VihasMakwana added a commit that referenced this issue Aug 1, 2024
## What does this PR do?

- Previously, we weren't passing errors to the caller while monitoring
set of processes.
- With the recent introduction of the status reporter for metricsets, it
is impossible to change the status to degraded if such errors are not
passed to the caller.
- Fix this by passing errors to the caller. We also populate the process
related information to our best-effort.

## Checklist

- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have added an entry in `CHANGELOG.md`

## Manual testing and general information

- See elastic/beats#40400 for testing it on
`metricbeat`
**NOTE**: 
   - **Only applicable if you're using `system/process` module**
- Non-fatal errors are only received when you have insufficient
privileges.

Steps:
   - While receiving any error, test for nature of error
   - call `errors.Is(err, NonFatalErr{}))` on received error
- If true, error is non-fatal and you can proceed further (metrics will
be partially available, most probably insufficient privileges).
      - Else, log the error and stop execution (metrics will be empty)

Genreal info related to the changes in this PR:
- While getting process related information, you might also receive a
non-nil error.
   - Such errors come in two flavours:
       - Fatal errors: 
- This indicates that the error was fatal (for eg. `no process found`,)
- Caller should stop further execution if they receive fatal errors
       - Non-fatal errors: 
- This indicates that the error was fatal (for eg. `not enough
privileges`)
          - It means that metrics are partially filled.
- Further execution can be continued if non-fatal errors are encountered

- Closes
#164
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
3 participants