Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Intelligent Health Monitoring for the Collector #2573

Closed
PettitWesley opened this issue Mar 1, 2021 · 18 comments
Closed

Proposal: Intelligent Health Monitoring for the Collector #2573

PettitWesley opened this issue Mar 1, 2021 · 18 comments
Assignees

Comments

@PettitWesley
Copy link

PettitWesley commented Mar 1, 2021

Our goal is to make it easier to “observe the observers”

Currently the collector has a health extension: https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/healthcheckextension

However it is very rudimentary. We want to build a more sophisticated health extension that will emit a health status based on whether or not the collector is functioning properly and achieving the user’s end goal.

Principles/Goals

  • Intelligent: The health extension should be “smart”, it should emit healthy or healthy based on whether or not the collector is “working as expected”. The definition of “working as expected” is fuzzy; our goal is the health extension to be a best effort tool. It may not always show unhealthy when the user would want it to, but it should cover many useful cases. For example, a user might want to the collector to mark itself as unhealthy if the following happens:
    • An exporter can not send data due to permissions, networking, or some configuration issue.
    • A receiver has not collected any data
    • A processor or any component has emitted X failures or has not returned success for Y minutes.
  • Configurable: The user should have control over which sorts of failures lead to an unhealthy status.

Minimum Viable Solution: Health Extension that integrates with the plugin framework

Our hope is that we could build an extension which simply integrates with the core of the collector. A user can configure criteria and then the collector will be marked as healthy or unhealthy. For example, if any exporter fails X retries then that could be an unhealthy condition.

Presumably this should be easy. The core code of the collector must track errors returned by exporters.

The health monitor must be sophisticated and store state on which components have failed. For example consider the following:

  1. Exporter test CLA #1 fails retries => collector marked unhealthy
  2. Exporter Create CODEOWNERS #2 fails retries as well => collector still unhealthy
  3. Exporter test CLA #1 recovers => collector still unhealthy because exporter Create CODEOWNERS #2 is still failing
  4. Exporter Create CODEOWNERS #2 recovers => collector now back to healthy

Nice-to-have Proposal 1: Allow plugins to interact with the health monitor

We are trying to figure out if the solution outlined above is sufficient. What if each plugin could directly interact with the health monitor and set the health status as needed?

For example, say could add the following line of code in an error case that you know would be unrecoverable/critical for your users:

monitor.SetUnhealthy()

At the moment, we can not think of any situations which require this... if the health system can monitor errors from components that should be sufficient criteria to mark it as healthy/unhealthy.

Long term work? Health Monitoring for Collector and SDK

This is just an idea that I want to throw out in case its good- could we monitor the health of the SDK + Collector as a system?

For example, may be the SDK and collector can establish a heart beat to verify their connection. And if that breaks, then the collector could be configured to mark itself unhealthy. This covers you in cases where the SDK isn’t sending any data because there’s nothing to send; the heart beat lets you know that this is safe/expected, rather than indicating a problem.

@PettitWesley
Copy link
Author

CC @JohnWu20 who will be working on this with me

@bogdandrutu
Copy link
Member

Couple of questions:

  • Who is the consumer of this? Human or the machine?
  • What would be the protocol to expose this? Would a health per component be enough?

@PettitWesley
Copy link
Author

@bogdandrutu I'm thinking the consumer would be the same as the existing health check extension. As in, it can just be an http endpoint on the collector that returns 200 or not to show the status. Just that now the criteria for healthy is more sophisticated. You can use it for your container health check. And then the health status can be consumed by humans or machines.

I think the current health extension is almost entirely useless/meaningless... the idea is that instead of some simple and relatively opaque "healthy means the collector is 'alive'", users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.

Would a health per component be enough?

I think the code probably needs to track the health of each component. The user should be exposed with a single collector health status I think. It might also be useful to let the health system log a message every time a components health changes just in case users are interested in diving deeper.

@bogdandrutu
Copy link
Member

users can instead configure a custom health check criteria such that "collector is healthy" == data is being sent to my destination.

What are the available input data for this conditions?

FYI: Trying to better understand what we need to build in order to figure out what code changes we need :)

@PettitWesley
Copy link
Author

@bogdandrutu I don't know the core functionality of the collector... but the core of the code must track errors from each plugin?

I came to OT from Fluent Bit, it has an "engine" which manages the plugins and sends them data and tracks their success/error/retry status for processing that data. At least for processors/exporters that's how it works. For receivers I'm a bit fuzzy on how that works, but you could definitely track if a receiver has sent you any data.

With those data inputs, you can build a health criteria. You can say for example, that for me healthy means OTLP receiver is uptaking data and the AWS X-Ray exporter is not returning errors for sends. (The receiver part needs some sort of timeout for how long not receiving data is "bad". Don't focus too much on the receiver use case actually... my thinking is that this is mainly useful for exporters).

@bogdandrutu
Copy link
Member

but the core of the code must track errors from each plugin?

Yes I am thinking about something in these lines, expand Component interface to expose a getHealth() functionality which every component can implement.

We do have a "service" which manages the lifetime of all the "Components" so if every component exposes the health we can then make "Service" expose this smart health status.

But I am also thinking how a helper could be implemented to allow components to not have to implement all these "customizations".

@PettitWesley
Copy link
Author

@bogdandrutu I have added this on the agenda for the next collector SIG meeting.

expand Component interface to expose a getHealth() functionality which every component can implement.

That might be ideal to allow each component to implement its own logic for setting its health... but I think there is also a way to implement it that does not require any component to implement anything new.

Again I don't know the collector internals deeply enough, but I saw here: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter#data-ownership

Each exporter implements ConsumeTraces/ConsumeMetrics/ConsumeLogs- AFAICT each of those functions returns an error- does that error indicate whether they succeeded or not? If it does, then I presume there's some central code that manages calling those functions for each component as needed- could the health system integrate there and just track errors returned by components?

@PettitWesley
Copy link
Author

Actually we'll probably have to discuss this at next week's Collector SIG, just realized I have a conflict tomorrow morning.

@bogdandrutu
Copy link
Member

@PettitWesley for a rough idea, look into this packet https://github.com/open-telemetry/opentelemetry-collector/tree/main/obsreport we called this from every component.

@jrcamp jrcamp added enhancement New feature or request priority:p3 Lowest labels Mar 10, 2021
@PettitWesley
Copy link
Author

PettitWesley commented Apr 8, 2021

todo: Need to check out the zpages extension, it has much of the data we need, may be this health check is an enhancement to it?

https://github.com/open-telemetry/opentelemetry-collector/tree/main/extension/zpagesextension

@PettitWesley
Copy link
Author

@bogdandrutu Please send us the details when you get a chance on how to use obsreport as discussed in the SIG:

  1. Get the number of successes/failure calls for a exporter, not the number of metrics/traces/logs that failed/succeeded to be sent.
  2. Configure an exporter in opencensus to consume these metrics in the code

@bogdandrutu
Copy link
Member

bogdandrutu commented May 26, 2021

See this about adding views #3039. Also see https://github.com/census-instrumentation/opencensus-go#views

What you need to do is to register a view for the "send_failed_spans" measure with the aggregation Count instead of Sum.

@bogdandrutu
Copy link
Member

For exporter, see the example https://github.com/census-instrumentation/opencensus-go/blob/master/examples/gauges/gauge.go#L150 you can implement that interface and keep values in memory instead of in the logs.

@rakyll
Copy link
Contributor

rakyll commented Jun 3, 2021

A naive comment about health check without much context about the past discussions.

When we think about health, we usually think about the case where the collector process is available and ready to accept incoming requests. We should tackle #3002 in order to determine whether collector is ready to "accept" requests at the OTLP endpoints.

@PettitWesley I can help with the OC views and export if you need a hand.

@skyduo
Copy link
Contributor

skyduo commented Jun 14, 2021

Hey @bogdandrutu , we have a design doc for the health check for OT and a few questions include in it, may need you to take a look and approve the design. https://docs.google.com/document/d/1SpUMsWA2DeaoVazeQ8uEc1Wvu5LphmQU_TjzLmuJ4QM/edit#

@skyduo
Copy link
Contributor

skyduo commented Jun 30, 2021

Hi @bogdandrutu ! As we want to add the awsecshealthcheckextension to the opentelemetry-collector-contrib repo due to this is AWS ECS specific extension, there is a issue that the obsreportconfig is located inside the internal package in the opentelemetry-collector, so i cannot get the obsreportconfig/obsmetrics which is need to build our own view, we have some options like:

  1. could we move the obsreportconfig out of the internal folder so that we can directly use it? if you agree with this, it will be easy for us to implement
  2. if we cannot directly use it, then we have to rewrite a lot of duplicate code in the obsreportconfig, which i don't think this pr is good to merge if write a lot of code same as the existing one.

so could you help us on that?

@bogdandrutu
Copy link
Member

could we move the obsreportconfig out of the internal folder so that we can directly use it? if you agree with this, it will be easy for us to implement

Let's craft a small PR and add the views that you need in core, see how large that change is. Is that enough? Would that resolve your issue?

@skyduo
Copy link
Contributor

skyduo commented Jul 10, 2021

Hey @bogdandrutu ! do you mean add our new view into the obsreportconfig in core? i'm not sure if my understand is correct but seems like the new view will still located in the internal folder, right? we still cannot access to it

@github-actions github-actions bot added the Stale label Jul 10, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants