Increment gather_errors for all errors emitted by inputs #2339
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
README.md updated (if adding a new plugin)I'm not sure if the change made here is the correct way to go about this, but I figured a PR would be better way to start a discussion than an issue or groups post.
Most Telegraf input plugins don't currently seem to provide a metric that can be used to determine if the gather operation is running successfully. For example, the prometheus input plugin logs an error if given a target address that returns HTTP 401, but won't return any metrics. That makes it difficult to tell whether a particular Prometheus client is "up".
I'd thought of deploying http_response inputs alongside httpjson/prometheus inputs, but that's a fair bit of extra configuration and doesn't handle endpoints that return 200 but with invalid data, etc. Adding a new metric to the prometheus input was another option but it looks like some other plugins behave similarly and it would be nice to have a more generic solution.
Telegraf 1.2 added the internal plugin, which exposes an
internal_agent_gather_errors
metric. That seems like a reasonable thing to monitor, however as far as I can tell it's only incremented by the SNMP plugin. This PR aims to increment the metric whenever any input emits an error. This won't catch all errors, as quite a number of plugins handle and log errors internally. I can update those too, but that's probably best in a separate PR.