Proposal: Intelligent Health Monitoring for the Collector #2573

PettitWesley · 2021-03-01T22:52:07Z

Our goal is to make it easier to “observe the observers”

Currently the collector has a health extension: https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/healthcheckextension

However it is very rudimentary. We want to build a more sophisticated health extension that will emit a health status based on whether or not the collector is functioning properly and achieving the user’s end goal.

Principles/Goals

Intelligent: The health extension should be “smart”, it should emit healthy or healthy based on whether or not the collector is “working as expected”. The definition of “working as expected” is fuzzy; our goal is the health extension to be a best effort tool. It may not always show unhealthy when the user would want it to, but it should cover many useful cases. For example, a user might want to the collector to mark itself as unhealthy if the following happens:
- An exporter can not send data due to permissions, networking, or some configuration issue.
- A receiver has not collected any data
- A processor or any component has emitted X failures or has not returned success for Y minutes.
Configurable: The user should have control over which sorts of failures lead to an unhealthy status.

Minimum Viable Solution: Health Extension that integrates with the plugin framework

Our hope is that we could build an extension which simply integrates with the core of the collector. A user can configure criteria and then the collector will be marked as healthy or unhealthy. For example, if any exporter fails X retries then that could be an unhealthy condition.

Presumably this should be easy. The core code of the collector must track errors returned by exporters.

The health monitor must be sophisticated and store state on which components have failed. For example consider the following:

Exporter test CLA #1 fails retries => collector marked unhealthy
Exporter Create CODEOWNERS #2 fails retries as well => collector still unhealthy
Exporter test CLA #1 recovers => collector still unhealthy because exporter Create CODEOWNERS #2 is still failing
Exporter Create CODEOWNERS #2 recovers => collector now back to healthy

Nice-to-have Proposal 1: Allow plugins to interact with the health monitor

We are trying to figure out if the solution outlined above is sufficient. What if each plugin could directly interact with the health monitor and set the health status as needed?

For example, say could add the following line of code in an error case that you know would be unrecoverable/critical for your users:

monitor.SetUnhealthy()

At the moment, we can not think of any situations which require this... if the health system can monitor errors from components that should be sufficient criteria to mark it as healthy/unhealthy.

Long term work? Health Monitoring for Collector and SDK

This is just an idea that I want to throw out in case its good- could we monitor the health of the SDK + Collector as a system?

For example, may be the SDK and collector can establish a heart beat to verify their connection. And if that breaks, then the collector could be configured to mark itself unhealthy. This covers you in cases where the SDK isn’t sending any data because there’s nothing to send; the heart beat lets you know that this is safe/expected, rather than indicating a problem.

The text was updated successfully, but these errors were encountered:

PettitWesley · 2021-03-01T22:52:53Z

CC @JohnWu20 who will be working on this with me

bogdandrutu · 2021-03-02T00:51:37Z

Couple of questions:

Who is the consumer of this? Human or the machine?
What would be the protocol to expose this? Would a health per component be enough?

PettitWesley · 2021-03-02T00:59:10Z

@bogdandrutu I'm thinking the consumer would be the same as the existing health check extension. As in, it can just be an http endpoint on the collector that returns 200 or not to show the status. Just that now the criteria for healthy is more sophisticated. You can use it for your container health check. And then the health status can be consumed by humans or machines.

I think the current health extension is almost entirely useless/meaningless... the idea is that instead of some simple and relatively opaque "healthy means the collector is 'alive'", users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.

Would a health per component be enough?

I think the code probably needs to track the health of each component. The user should be exposed with a single collector health status I think. It might also be useful to let the health system log a message every time a components health changes just in case users are interested in diving deeper.

bogdandrutu · 2021-03-02T01:10:06Z

users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.

What are the available input data for this conditions?

FYI: Trying to better understand what we need to build in order to figure out what code changes we need :)

PettitWesley · 2021-03-02T01:36:11Z

@bogdandrutu I don't know the core functionality of the collector... but the core of the code must track errors from each plugin?

I came to OT from Fluent Bit, it has an "engine" which manages the plugins and sends them data and tracks their success/error/retry status for processing that data. At least for processors/exporters that's how it works. For receivers I'm a bit fuzzy on how that works, but you could definitely track if a receiver has sent you any data.

With those data inputs, you can build a health criteria. You can say for example, that for me healthy means OTLP receiver is uptaking data and the AWS X-Ray exporter is not returning errors for sends. (The receiver part needs some sort of timeout for how long not receiving data is "bad". Don't focus too much on the receiver use case actually... my thinking is that this is mainly useful for exporters).

bogdandrutu · 2021-03-02T02:06:12Z

but the core of the code must track errors from each plugin?

Yes I am thinking about something in these lines, expand Component interface to expose a getHealth() functionality which every component can implement.

We do have a "service" which manages the lifetime of all the "Components" so if every component exposes the health we can then make "Service" expose this smart health status.

But I am also thinking how a helper could be implemented to allow components to not have to implement all these "customizations".

PettitWesley · 2021-03-02T08:18:03Z

@bogdandrutu I have added this on the agenda for the next collector SIG meeting.

expand Component interface to expose a getHealth() functionality which every component can implement.

That might be ideal to allow each component to implement its own logic for setting its health... but I think there is also a way to implement it that does not require any component to implement anything new.

Again I don't know the collector internals deeply enough, but I saw here: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#data-ownership

Each exporter implements ConsumeTraces/ConsumeMetrics/ConsumeLogs- AFAICT each of those functions returns an error- does that error indicate whether they succeeded or not? If it does, then I presume there's some central code that manages calling those functions for each component as needed- could the health system integrate there and just track errors returned by components?

PettitWesley · 2021-03-02T20:05:22Z

Actually we'll probably have to discuss this at next week's Collector SIG, just realized I have a conflict tomorrow morning.

bogdandrutu · 2021-03-02T22:15:43Z

@PettitWesley for a rough idea, look into this packet https://github.com/open-telemetry/opentelemetry-collector/tree/main/obsreport we called this from every component.

PettitWesley · 2021-04-08T21:14:41Z

todo: Need to check out the zpages extension, it has much of the data we need, may be this health check is an enhancement to it?

https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension

PettitWesley · 2021-05-26T16:59:07Z

@bogdandrutu Please send us the details when you get a chance on how to use obsreport as discussed in the SIG:

Get the number of successes/failure calls for a exporter, not the number of metrics/traces/logs that failed/succeeded to be sent.
Configure an exporter in opencensus to consume these metrics in the code

bogdandrutu · 2021-05-26T17:03:14Z

See this about adding views #3039. Also see https://github.com/census-instrumentation/opencensus-go#views

What you need to do is to register a view for the "send_failed_spans" measure with the aggregation Count instead of Sum.

bogdandrutu · 2021-05-26T17:06:17Z

For exporter, see the example https://github.com/census-instrumentation/opencensus-go/blob/master/examples/gauges/gauge.go#L150 you can implement that interface and keep values in memory instead of in the logs.

rakyll · 2021-06-03T20:37:16Z

A naive comment about health check without much context about the past discussions.

When we think about health, we usually think about the case where the collector process is available and ready to accept incoming requests. We should tackle #3002 in order to determine whether collector is ready to "accept" requests at the OTLP endpoints.

@PettitWesley I can help with the OC views and export if you need a hand.

skyduo · 2021-06-14T17:58:33Z

Hey @bogdandrutu , we have a design doc for the health check for OT and a few questions include in it, may need you to take a look and approve the design. https://docs.google.com/document/d/1SpUMsWA2DeaoVazeQ8uEc1Wvu5LphmQU_TjzLmuJ4QM/edit#

skyduo · 2021-06-30T23:27:06Z

Hi @bogdandrutu ! As we want to add the awsecshealthcheckextension to the opentelemetry-collector-contrib repo due to this is AWS ECS specific extension, there is a issue that the obsreportconfig is located inside the internal package in the opentelemetry-collector, so i cannot get the obsreportconfig/obsmetrics which is need to build our own view, we have some options like:

could we move the obsreportconfig out of the internal folder so that we can directly use it? if you agree with this, it will be easy for us to implement
if we cannot directly use it, then we have to rewrite a lot of duplicate code in the obsreportconfig, which i don't think this pr is good to merge if write a lot of code same as the existing one.

so could you help us on that?

bogdandrutu · 2021-07-09T23:29:43Z

could we move the obsreportconfig out of the internal folder so that we can directly use it? if you agree with this, it will be easy for us to implement

Let's craft a small PR and add the views that you need in core, see how large that change is. Is that enough? Would that resolve your issue?

skyduo · 2021-07-10T01:15:17Z

Hey @bogdandrutu ! do you mean add our new view into the obsreportconfig in core? i'm not sure if my understand is correct but seems like the new view will still located in the internal folder, right? we still cannot access to it

jrcamp added enhancement New feature or request priority:p3 Lowest labels Mar 10, 2021

jrcamp assigned PettitWesley Mar 10, 2021

jrcamp added the need more info label Mar 10, 2021

skyduo mentioned this issue Jul 13, 2021

Create new AWS ECS health check extension. #3614

Closed

This was referenced Aug 5, 2021

Add new view for AWS ECS health check extension. #3776

Merged

Create a new AWS ECS Health Check Extension open-telemetry/opentelemetry-collector-contrib#4451

Closed

skyduo mentioned this issue Sep 7, 2021

New ECS health check extension open-telemetry/opentelemetry-collector-contrib#5144

Closed

skyduo mentioned this issue Oct 5, 2021

Modify health check extension to add new metrics health check functions open-telemetry/opentelemetry-collector-contrib#5643

Merged

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023

Pin knife to 17.10.0 to fix chef release (open-telemetry#2573)

566b303

github-actions bot added the Stale label Jul 10, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2023

mx-psi added waiting-for-author and removed need more info labels Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Intelligent Health Monitoring for the Collector #2573

Proposal: Intelligent Health Monitoring for the Collector #2573

PettitWesley commented Mar 1, 2021 •

edited

Loading

PettitWesley commented Mar 1, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Apr 8, 2021 •

edited

Loading

PettitWesley commented May 26, 2021

bogdandrutu commented May 26, 2021 •

edited

Loading

bogdandrutu commented May 26, 2021

rakyll commented Jun 3, 2021

skyduo commented Jun 14, 2021 •

edited

Loading

skyduo commented Jun 30, 2021

bogdandrutu commented Jul 9, 2021

skyduo commented Jul 10, 2021

Proposal: Intelligent Health Monitoring for the Collector #2573

Proposal: Intelligent Health Monitoring for the Collector #2573

Comments

PettitWesley commented Mar 1, 2021 • edited Loading

Principles/Goals

Minimum Viable Solution: Health Extension that integrates with the plugin framework

Nice-to-have Proposal 1: Allow plugins to interact with the health monitor

Long term work? Health Monitoring for Collector and SDK

PettitWesley commented Mar 1, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

PettitWesley commented Mar 2, 2021

bogdandrutu commented Mar 2, 2021

PettitWesley commented Apr 8, 2021 • edited Loading

PettitWesley commented May 26, 2021

bogdandrutu commented May 26, 2021 • edited Loading

bogdandrutu commented May 26, 2021

rakyll commented Jun 3, 2021

skyduo commented Jun 14, 2021 • edited Loading

skyduo commented Jun 30, 2021

bogdandrutu commented Jul 9, 2021

skyduo commented Jul 10, 2021

PettitWesley commented Mar 1, 2021 •

edited

Loading

PettitWesley commented Apr 8, 2021 •

edited

Loading

bogdandrutu commented May 26, 2021 •

edited

Loading

skyduo commented Jun 14, 2021 •

edited

Loading