-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce component status reporting #6560
Introduce component status reporting #6560
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this better, much simpler change and cleaner design (at least for me).
// StatusSource component that reports a status about itself. | ||
// The implementation of this interface must be comparable to be useful as a map key. | ||
type StatusSource interface { | ||
ID() ID | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this interface at all? Simply pass the ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a key that uniquely identifies the component. ID is not a unique identification. You may have a processor and a receiver that have the same ID (we don't enforce that receivers and processors don't have the same type string - although in reality they are all unique in core and contrib).
Why it needs to be unique? So that the StatusWatcher can maintain a map[StatusSource]Status
.
Alternatively we should enforce uniqueness of ID across all component types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point... I understand now the need, let me think a bit more, probably better to have this as a struct if we can define it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another alternate is to require the Host to track the aggregate component status. In that case StatusWatcher will only be notified of that single aggregate bool status. But in that case we have to hard-code the aggregation logic in the Host and lose flexibility in allowing the extension to decide what to do with individual component statuses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "global component ID", I think we need something like:
package component
type GlobalID struct {
ID ID
Kind Kind
PipelineID ID // Not empty only if the Kind is Processor
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also requires #6540 otherwise you can still have duplicates even with GlobalID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about that. Now I am more confused after I looked at your implementation https://github.com/open-telemetry/opentelemetry-collector/pull/6560/files#diff-bcb3fc758a54acf0dda21a0d9c79b6cf6e6adcf58d0a282c93be50553523b09dR41 and https://github.com/open-telemetry/opentelemetry-collector/pull/6560/files#diff-19abf8e6f60ec275051d1447d4d97f4c55b449192f2cb946e3e4ea44ff49fcc0R69
At least in the statusReportingComponent
is not unique, so my suggestion at least improves that :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think &statusReportingComponent{}
is unique. We create a new instance for every created component. It is not a value of statusReportingComponent{}
struct, we create a pointer to a new struct every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test that verifies the uniqueness of StatusSource
: https://github.com/tigrannajaryan/opentelemetry-collector/blob/2ac7d388cf548e538601796ec9fff25ae5eb6a78/service/service_test.go#L229
// StatusWatcher is an extra interface for Extension hosted by the OpenTelemetry | ||
// Collector that is to be implemented by extensions interested in changes to component | ||
// status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any Component or just extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only extensions can watch. Any component can be a source.
We can add support for any component to watch but that requires more work and I don't know if we need it.
// Err returns the error associated with the ComponentEvent. | ||
func (ev *StatusEvent) Err() error { | ||
return ev.err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a "Description" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to rename to func (ev *StatusEvent) Description() error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (ev *StatusEvent) Description() string
not sure we need error
as type, just a Description
string. Idea being that Description
is a bit more generic, can apply to things like StatusRecovering
or other status type.
Codecov ReportBase: 91.35% // Head: 90.63% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #6560 +/- ##
==========================================
- Coverage 91.35% 90.63% -0.72%
==========================================
Files 241 245 +4
Lines 13885 13988 +103
==========================================
- Hits 12684 12678 -6
- Misses 956 1070 +114
+ Partials 245 240 -5
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
0f0980b
to
09d0295
Compare
Yeah, I also like this one more. |
@open-telemetry/collector-approvers please take a look at this alternate. |
9d9b647
to
2ac7d38
Compare
// ReportComponentStatus can be used by a component to communicate its status to the Host. | ||
// The Host implementations may broadcast this information to interested parties via | ||
// StatusWatcher interface. | ||
ReportComponentStatus(event *StatusEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addition of this method is a breaking change. See failure on Jaeger repo: https://github.com/open-telemetry/opentelemetry-collector/actions/runs/3481700391/jobs/5823123326
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdandrutu we have a problem with this.
Adding a method to this interface breaks external codebases (like Jaeger).
I do not see a good way to handle this gracefully. We can add the reporting capability to a different interface and probe for it in the component's, but that does not look very nice to me, e.g.
type ComponentStatusAcceptor interface {
ReportComponentStatus(event *StatusEvent)
}
func (c* MyComponent) Start(ctx context.Context, host component.Host) error {
if h, ok:=host.(ComponentStatusAcceptor); ok {
h.ReportComponentStatus(...)
}
}
Another approach would be to add a new optional Start2(ctx context.Context, host component.Host2)
to Component2 interface and components can opt in to it, but must continue to support Start(). This looks even uglier.
If we don't want to handle this gracefully then we must coordinate this change with Jaeger.
Any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted a contrib PR that should remove the dependency on Jaeger collector, so we can make contrib tests pass, but I believe we will still likely need to run through a breaking change.
I see a few other problems with the current state of this interface:
- The set of methods doesn't seem to be complete and potentially can change after we release 1.0, e.g. we might want to introduce getters for receivers and processors, not only exporters and extensions.
- Most of the methods start with
Get
which goes against Go recommendations, maybe we want to change them and removeGet
prefix.
I believe we need to take the opportunity and make a bigger refactoring that would allow us to solve all the outlined problems at once. We discussed the problem with @bogdandrutu and came up with the following plan:
Keep only one "core" method ReportComponentStatus
the Host
interface and move all other "getter" methods to optional interfaces. The optional interfaces for exporter and extension getters can be moved to their own packages (exporter/extension), so we don't need to change returned values from Extension|Exporter
toComponent
as done in #6553 anymore.
I'll play with it, submit a draft PR, then we can combine it with this PR and try to migrate to the new interfaces as gracefully as feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm catching up on this PR. It looks like this is where it stalled out. Some thoughts on how to move forward:
The set of methods doesn't seem to be complete and potentially can change after we release 1.0, e.g. we might want to introduce getters for receivers and processors, not only exporters and extensions.
I'm proposing that we deprecate GetExporters
(#7370). I don't believe there are valid reasons to need this function or others like it, such as GetReceivers
, GetProcessors
, etc.
Most of the methods start with Get which goes against Go recommendations, maybe we want to change them and remove Get prefix.
Between renaming GetExtensions
to Extensions
, deprecating ReportFatalError
and GetExporters
, and adding ReportComponentStatus
, it looks like we want quite a different interface. Hard to believe we will just live with this as is, so ideally we just need to identify the best migration option. It looks like we no longer need to coordinate with Jaeger but presumably this will affect other components defined elsewhere.
I also tried to identify a graceful option. The best I came up with is quite ugly, but would require no unannounced breaking changes and does not require changes to any other interfaces (i.e. component.Component
):
- Add an optional interface so it is possible to use the new functions. I'll call it
component.Host2
. - Deprecate
ReportFatalError
,GetExporters
, andGetExtensions
to alert users of these specific functions. Recommend the ugly solution of asserting the optional interfaces (temporarily). - After some time Deprecate
component.Host
to alert all all affected. However, clearly indicate that this interface will be changed tocomponent.Host
, rather than removed. - Change
component.Host
to matchcomponent.Host2
. Undeprecatecomponent.Host
and deprecatecomponent.Host2
This is an alternate to open-telemetry#6550 ## Summary of changes - Add component status concept. Components can report their status via Host.ReportComponentStatus(). Interested extensions can watch status events if they implement StatusWatcher interface. This is similar to how PipelineWatcher works today. - Deprecate Host.ReportFatalError() in favour of Host.ReportComponentStatus(). ## TODO after this is merged - healthcheck extension must implement StatusWatcher. - Replace all ReportFatalError() calls by ReportComponentStatus() calls in core and contrib. ## Open Questions - StatusWatchers need to be able to tell if all current components are healthy. It is assumed that the listeners need to maintain a map of components and track the status of each component. This works only if we assume that the set of components cannot change during the lifetime of the listener. This assumption is true today but can change in the future if we introduce partial pipeline restarts where only modified/added/removed components are recreated (this will break listener's assumption and the map will become invalid). Should we instead keep track of this entire status map in the Host and broadcast the entire status to the listeners as a whole instead of (or in addition to) individual component events?
2ac7d38
to
444ed5d
Compare
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan what's the status of this PR? Was there ever a workaround the breaking interface change or is it still blocked on that?
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
(still want to work on this, was on vacation for quite some time) |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closing this since I don't work on it anymore. |
This picks up where open-telemetry#6560 left off. The first step is to get the code introduced in that PR working with the collector as it is today. There were significant changes to how pipelines are built and the component package was split into separate packages based on type (extension, processor, etc). This commit makes the necessary changes to get everything working, likely not in the most ideal way, but it's a start that we can iterate on.
This is an implementation based on the thread starting with this comment: open-telemetry#6560 (comment)
This picks up where open-telemetry#6560 left off. The first step is to get the code introduced in that PR working with the collector as it is today. There were significant changes to how pipelines are built and the component package was split into separate packages based on type (extension, processor, etc). This commit makes the necessary changes to get everything working, likely not in the most ideal way, but it's a start that we can iterate on.
This is an implementation based on the thread starting with this comment: open-telemetry#6560 (comment)
This picks up where open-telemetry#6560 left off. The first step is to get the code introduced in that PR working with the collector as it is today. There were significant changes to how pipelines are built and the component package was split into separate packages based on type (extension, processor, etc). This commit makes the necessary changes to get everything working, likely not in the most ideal way, but it's a start that we can iterate on.
This is an implementation based on the thread starting with this comment: open-telemetry#6560 (comment)
This picks up where open-telemetry#6560 left off. The first step is to get the code introduced in that PR working with the collector as it is today. There were significant changes to how pipelines are built and the component package was split into separate packages based on type (extension, processor, etc). This commit makes the necessary changes to get everything working, likely not in the most ideal way, but it's a start that we can iterate on.
This is an implementation based on the thread starting with this comment: open-telemetry#6560 (comment)
This picks up where open-telemetry#6560 left off. The first step is to get the code introduced in that PR working with the collector as it is today. There were significant changes to how pipelines are built and the component package was split into separate packages based on type (extension, processor, etc). This commit makes the necessary changes to get everything working, likely not in the most ideal way, but it's a start that we can iterate on.
This is an implementation based on the thread starting with this comment: open-telemetry#6560 (comment)
This picks up where open-telemetry#6560 left off. The first step is to get the code introduced in that PR working with the collector as it is today. There were significant changes to how pipelines are built and the component package was split into separate packages based on type (extension, processor, etc). This commit makes the necessary changes to get everything working, likely not in the most ideal way, but it's a start that we can iterate on.
This is an implementation based on the thread starting with this comment: open-telemetry#6560 (comment)
This PR introduces component status reporting. There have been several attempts to introduce this functionality previously, with the most recent being: #6560. This PR was orignally based off of #6560, but has evolved based on the feedback received and some additional enhancements to improve the ease of use of the `ReportComponentStatus` API. In earlier discussions (see #8169 (comment)) we decided to model status as a finite state machine with the following statuses: `Starting`, `OK`, `RecoverableError`, `PermanentError`, `FatalError`. `Stopping`, and `Stopped`. A benefit of this design is that `StatusWatcher`s will be notified on changes in status rather than on potentially repetitive reports of the same status. With the additional statuses and modeling them using a finite state machine, there are more statuses to report. Rather than having each component be responsible for reporting all of the statuses, I automated status reporting where possible. A component's status will automatically be set to `Starting` at startup. If the components `Start` returns an error, the status will automatically be set to `PermanentError`. A component is expected to report `StatusOK` when it has successfully started (if it has successfully started) and from there can report changes in status as it runs. It will likely be a common scenario for components to transition between `StatusOK` and `StatusRecoverableError` during their lifetime. In extenuating circumstances they can transition into terminal states of `PermanentError` and `FatalError` (where a fatal error initiates collector shutdown). Additionally, during component Shutdown statuses are automatically reported where possible. A component's status is set to `Stopping` when Shutdown is initially called, if Shutdown returns an error, the status will be set to `PermanentError` if it does not return an error, the status is set to `Stopped`. In #6560 ReportComponentStatus was implemented on the `Host` interface. I found that few components use the Host interface, and none of them save a handle to it (to be used outside of the `start` method). I found that many components keep a handle to the `TelemetrySettings` that they are initialized with, and this seemed like a more natural, convenient place for the `ReportComponentStatus` API. I'm ultimately flexible on where this method resides, but feel that `TelemetrySettings` a more user friendly place for it. Regardless of where the `ReportComponentStatus` method resides (Host or TelemetrySettings), there is a difference in the method signature for the API based on whether it is used from the service or from a component. As the service is not bound to a specific component, it needs to take the `instanceID` of a component as a parameter, whereas the component version of the method already knows the `instanceID`. In #6560 this led to having both `component.Host` and `servicehost.Host` versions of the Host interface to be used at the component or service levels. In this version, we have the same for TelemetrySettings. There is a `component.TelemetrySettings` and a `servicetelemetry.Settings` with the only difference being the method signature of `ReportComponentStatus`. Lastly, this PR sets up the machinery for report component status, and allows extensions to be `StatusWatcher`s, but it does not introduce any `StatusWatcher`s. We expect the OpAMP extension to be a `StatusWatcher` and use data from this system as part of its AgentHealth message (the message is currently being extended to accommodate more component level details). We also expect there to be a non-OpAMP `StatusWatcher` implementation, likely via the HealthCheck extension (or something similiar). **Link to tracking Issue:** #7682 cc: @tigrannajaryan @djaglowski @evan-bradley --------- Co-authored-by: Tigran Najaryan <[email protected]> Co-authored-by: Pablo Baeyens <[email protected]> Co-authored-by: Daniel Jaglowski <[email protected]> Co-authored-by: Evan Bradley <[email protected]> Co-authored-by: Tigran Najaryan <[email protected]> Co-authored-by: Alex Boten <[email protected]>
This is an alternate to #6550
Summary of changes
TODO after this is merged
Open Questions