-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[component] component.go - when is component.Start returning errors? #9324
Comments
Thanks for opening this issue @atoulme. I don't think that we have well defined rules for how components should handle errors during start, but now that we have component status reporting, and an in progress health check extension that makes use of that data, it makes sense for us to come up with guidelines for how this should work together. This table has the definitions that I have in mind for each status:
Now that we have the permanent error state, I think it makes sense to ask whether or not we need fatal error any longer. The reason it was included with status reporting is that it was a preexisting error state. The intent of the permanent error state is to allow the collector to continue running in a degraded mode and alert users that there was a condition detected at runtime that will require human intervention to fix. I would propose the following guidelines for start:
Users can use the health check extension and / or logs to monitor permanent and recoverable errors. If we agree that a collector that starts without an error should continue running, albeit in a possibly degraded state, we can remove fatal errors altogether and communicate these via permanent error status events instead. In the examples you site:
This should be reported as a recoverable error and recovery should be monitored / assessed via the health check extension. The extension allows a user to set a recovery duration before considering a recoverable error to be considered unhealthy.
Presumably this could be discovered before start as a part of config validation. However, if there are complications that make that impossible, it should report a permanent error and let the collector run in a degraded mode. The permanent error can be monitored via the health check extension and / or logs. |
I'll propose an alternative to the guidelines I previously mentioned. The previous proposal is still on the table, as well any other alternatives anyone wants to propose. For this alternative, I suggest that we remove the error return value from start and that we remove This would eliminate the confusion between a |
Regardless of the In your alternative approach I like it keeps things unambiguous. Right now we also have no ambiguity - if a component returns an error during As you mentioned in your guideline proposal, if we try to write rules for when a component should report an error there are always going to be edge cases. We're also leaving the state of the collector up to component authors, hoping they use component status correctly. If a component reports With your alternative approach a misused status no longer causes the whole collector to stop. It also is unambiguous about what the collector does when a component has an error on startup - it keeps running. I think my one fear with the alternative proposal is that it would allow a collector to start with all components experiencing Maybe thats ok? Feels like it will lead to confusion with end users. Is it possible to detect this situation and stop the collector? Or maybe it is better to leave |
This situation is one of the downsides to allowing a collector to run in a degraded state. While not impossible, having all components in a permanent error state should be an edge case.
The collector itself doesn't keep or aggregate component status. The (new) health check extension does and can surface this information, although, it wouldn't shut the collector down. |
In my mind the blockers for this idea are then:
If we allow configuring the capability then I think there is no problem with the edge case. My opinion is that we should allow it to be configurable If the behavior is configurable, I think there are a couple solutions:
|
I'm ultimately ok with allowing As things currently stand, the primary way to fail fast is returning an error from To sum up this secondary proposal:
|
After thinking about this a little more, I think we can retain the behavior of the previous proposal and simplify how errors are be handled during start. I suggest that we remove the error return value from To clarify this proposal suggests we:
|
Reading through this issue and #9823, I have an alternative proposal to consider:
This approach has the benefits of removing any ambiguity between I think this proposal has a downside and thats how to stop the collector when async work from Start fails. I'd like to better understand this use case, are there any component you can share as examples? I am hoping the async work could report |
There are two issues with this proposal. One is that The other issue is that the collector does not maintain or aggregate status for components. It dispatches status events to |
@mwear in that specific zipkin example, would it be inappropriate to report a
Could it? Is there a reason why the collector's Run loop couldn't read from the Status channel and react accordingly? |
The FatalError reported by the Zipkin receiver and FatalErrors generally are a result of the
It could, but I don't think that it should. This goes against the design of component status reporting. The system was designed with the idea that components report their status via events, the collector automates this where possible, and it dispatches the events to extensions (implementers of the StatusWatcher interface). The extensions can process events as they see fit. The collector provides the machinery to pass events between components and extensions. It makes no attempt to interpret the events. That responsibility is left to extensions. As a concrete example, we can briefly mention the new health check extension. It has an event aggregation system in order to answer questions about pipeline health and overall collector health, however, this system is specific to the extension; it would not likely be something that would be adopted into the core collector. The aggregation itself varies based on user configuration. There is a preliminary set of of knobs to tune aggregation depending on what the user wants to consider healthy or not, and there will likely be many more to follow. By default both Permanent and Recoverable errors are considered healthy. Users have to opt-in to consider either, or both to be unhealthy. Recoverables have the additional feature that you can specify a recovery interval during which they should recover. They will be considered healthy until the interval has elapsed and unhealthy afterwards. While the health check extension is an early example of a StatusWatcher, I expect more will be added over time. I believe we will eventually have an extension to export the raw events (as an OTLP logs) to backends that will process them according to their own rules. Getting back on topic, component status reporting was not designed to determine what is healthy, only to facilitate the reporting of events between components and extensions; the extensions handle the events as they choose. Moving aggregation into the collector, or having a parallel system is not really in line with the original design and I don't think there is a good reason to do this if we keep FatalError as a fail-fast mechanism. |
#### Description Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - #9823 - #10058 - #9957 - #9324 - #6506 --------- Co-authored-by: Matthew Wear <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
With #10413 merged, the decision to keep |
Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - open-telemetry#9823 - open-telemetry#10058 - open-telemetry#9957 - open-telemetry#9324 - open-telemetry#6506 --------- Co-authored-by: Matthew Wear <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]>
The
component.Start
function returns an error.When is an error expected to be returned upon starting? Configuration validation happens outside of the lifecycle of the component.
The error is handled as reporting a permanent error:
opentelemetry-collector/service/internal/graph/graph.go
Lines 396 to 402 in c5a2c78
A few examples from contrib from working with lifecycle tests recently:
Typically, we see those when the components cannot pass lifecycle tests:
https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aissue+is%3Aopen+skip+lifecycle
Given that we have component status reporting now, would it make sense to ask component implementers to handle start failures via component status instead of returning an error?
The text was updated successfully, but these errors were encountered: