Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce component status reporting #6560

Conversation

tigrannajaryan
Copy link
Member

@tigrannajaryan tigrannajaryan commented Nov 15, 2022

This is an alternate to #6550

Summary of changes

  • Add component status concept. Components can report their status via Host.ReportComponentStatus(). Interested extensions can watch status events if they implement StatusWatcher interface. This is similar to how PipelineWatcher works today.
  • Deprecate Host.ReportFatalError() in favour of Host.ReportComponentStatus().

TODO after this is merged

  • healthcheck extension must implement StatusWatcher.
  • Replace all ReportFatalError() calls by ReportComponentStatus() calls in core and contrib.

Open Questions

  • StatusWatchers need to be able to tell if all current components are healthy. It is assumed that the listeners need to maintain a map of components and track the status of each component. This works only if we assume that the set of components cannot change during the lifetime of the listener. This assumption is true today but can change in the future if we introduce partial pipeline restarts where only modified/added/removed components are recreated (this will break listener's assumption and the map will become invalid). Should we instead keep track of this entire status map in the Host and broadcast the entire status to the listeners as a whole instead of (or in addition to) individual component events?

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better, much simpler change and cleaner design (at least for me).

component/status.go Outdated Show resolved Hide resolved
Comment on lines +29 to +32
// StatusSource component that reports a status about itself.
// The implementation of this interface must be comparable to be useful as a map key.
type StatusSource interface {
ID() ID
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this interface at all? Simply pass the ID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a key that uniquely identifies the component. ID is not a unique identification. You may have a processor and a receiver that have the same ID (we don't enforce that receivers and processors don't have the same type string - although in reality they are all unique in core and contrib).

Why it needs to be unique? So that the StatusWatcher can maintain a map[StatusSource]Status.

Alternatively we should enforce uniqueness of ID across all component types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point... I understand now the need, let me think a bit more, probably better to have this as a struct if we can define it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another alternate is to require the Host to track the aggregate component status. In that case StatusWatcher will only be notified of that single aggregate bool status. But in that case we have to hard-code the aggregation logic in the Host and lose flexibility in allowing the extension to decide what to do with individual component statuses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "global component ID", I think we need something like:

package component

type GlobalID struct {
  ID ID
  Kind Kind
  PipelineID ID  // Not empty only if the Kind is Processor
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also requires #6540 otherwise you can still have duplicates even with GlobalID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think &statusReportingComponent{} is unique. We create a new instance for every created component. It is not a value of statusReportingComponent{} struct, we create a pointer to a new struct every time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +100 to +83
// StatusWatcher is an extra interface for Extension hosted by the OpenTelemetry
// Collector that is to be implemented by extensions interested in changes to component
// status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any Component or just extensions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only extensions can watch. Any component can be a source.

We can add support for any component to watch but that requires more work and I don't know if we need it.

Comment on lines +50 to +46
// Err returns the error associated with the ComponentEvent.
func (ev *StatusEvent) Err() error {
return ev.err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a "Description" instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to rename to func (ev *StatusEvent) Description() error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (ev *StatusEvent) Description() string not sure we need error as type, just a Description string. Idea being that Description is a bit more generic, can apply to things like StatusRecovering or other status type.

component/status.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Nov 15, 2022

Codecov Report

Base: 91.35% // Head: 90.63% // Decreases project coverage by -0.71% ⚠️

Coverage data is based on head (2ac7d38) compared to base (a8d6280).
Patch coverage: 29.09% of modified lines in pull request are covered.

❗ Current head 2ac7d38 differs from pull request most recent head 444ed5d. Consider uploading reports for the commit 444ed5d to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6560      +/-   ##
==========================================
- Coverage   91.35%   90.63%   -0.72%     
==========================================
  Files         241      245       +4     
  Lines       13885    13988     +103     
==========================================
- Hits        12684    12678       -6     
- Misses        956     1070     +114     
+ Partials      245      240       -5     
Impacted Files Coverage Δ
component/componenttest/nop_host.go 90.00% <0.00%> (-10.00%) ⬇️
component/componenttest/statuswatcher_extension.go 0.00% <0.00%> (ø)
component/componenttest/unhealthy_processor.go 0.00% <0.00%> (ø)
component/status.go 0.00% <0.00%> (ø)
service/internal/servicehost/nop_host.go 40.00% <40.00%> (ø)
service/extensions/extensions.go 81.13% <64.28%> (-1.85%) ⬇️
service/internal/pipelines/pipelines.go 94.70% <85.71%> (-0.42%) ⬇️
service/collector.go 76.31% <100.00%> (ø)
service/host.go 100.00% <100.00%> (ø)
service/internal/components/host_wrapper.go 100.00% <100.00%> (ø)
... and 8 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/status-via-interfaces branch from 0f0980b to 09d0295 Compare November 16, 2022 00:53
@tigrannajaryan
Copy link
Member Author

I like this better, much simpler change and cleaner design (at least for me).

Yeah, I also like this one more.

@tigrannajaryan
Copy link
Member Author

@open-telemetry/collector-approvers please take a look at this alternate.

@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/status-via-interfaces branch 2 times, most recently from 9d9b647 to 2ac7d38 Compare November 16, 2022 17:34
// ReportComponentStatus can be used by a component to communicate its status to the Host.
// The Host implementations may broadcast this information to interested parties via
// StatusWatcher interface.
ReportComponentStatus(event *StatusEvent)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition of this method is a breaking change. See failure on Jaeger repo: https://github.com/open-telemetry/opentelemetry-collector/actions/runs/3481700391/jobs/5823123326

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu we have a problem with this.

Adding a method to this interface breaks external codebases (like Jaeger).

I do not see a good way to handle this gracefully. We can add the reporting capability to a different interface and probe for it in the component's, but that does not look very nice to me, e.g.

type ComponentStatusAcceptor interface {
  ReportComponentStatus(event *StatusEvent)
}

func (c* MyComponent) Start(ctx context.Context, host component.Host) error {
  if h, ok:=host.(ComponentStatusAcceptor); ok {
    h.ReportComponentStatus(...)
  }
}

Another approach would be to add a new optional Start2(ctx context.Context, host component.Host2) to Component2 interface and components can opt in to it, but must continue to support Start(). This looks even uglier.

If we don't want to handle this gracefully then we must coordinate this change with Jaeger.

Any other ideas?

Copy link
Member

@dmitryax dmitryax Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a contrib PR that should remove the dependency on Jaeger collector, so we can make contrib tests pass, but I believe we will still likely need to run through a breaking change.

I see a few other problems with the current state of this interface:

  1. The set of methods doesn't seem to be complete and potentially can change after we release 1.0, e.g. we might want to introduce getters for receivers and processors, not only exporters and extensions.
  2. Most of the methods start with Get which goes against Go recommendations, maybe we want to change them and remove Get prefix.

I believe we need to take the opportunity and make a bigger refactoring that would allow us to solve all the outlined problems at once. We discussed the problem with @bogdandrutu and came up with the following plan:
Keep only one "core" method ReportComponentStatus the Host interface and move all other "getter" methods to optional interfaces. The optional interfaces for exporter and extension getters can be moved to their own packages (exporter/extension), so we don't need to change returned values from Extension|Exporter toComponent as done in #6553 anymore.

I'll play with it, submit a draft PR, then we can combine it with this PR and try to migrate to the new interfaces as gracefully as feasible.

Copy link
Member

@djaglowski djaglowski Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm catching up on this PR. It looks like this is where it stalled out. Some thoughts on how to move forward:

The set of methods doesn't seem to be complete and potentially can change after we release 1.0, e.g. we might want to introduce getters for receivers and processors, not only exporters and extensions.

I'm proposing that we deprecate GetExporters (#7370). I don't believe there are valid reasons to need this function or others like it, such as GetReceivers, GetProcessors, etc.

Most of the methods start with Get which goes against Go recommendations, maybe we want to change them and remove Get prefix.

Between renaming GetExtensions to Extensions, deprecating ReportFatalError and GetExporters, and adding ReportComponentStatus, it looks like we want quite a different interface. Hard to believe we will just live with this as is, so ideally we just need to identify the best migration option. It looks like we no longer need to coordinate with Jaeger but presumably this will affect other components defined elsewhere.

I also tried to identify a graceful option. The best I came up with is quite ugly, but would require no unannounced breaking changes and does not require changes to any other interfaces (i.e. component.Component):

  1. Add an optional interface so it is possible to use the new functions. I'll call it component.Host2.
  2. Deprecate ReportFatalError, GetExporters, and GetExtensions to alert users of these specific functions. Recommend the ugly solution of asserting the optional interfaces (temporarily).
  3. After some time Deprecate component.Host to alert all all affected. However, clearly indicate that this interface will be changed to component.Host, rather than removed.
  4. Change component.Host to match component.Host2. Undeprecate component.Host and deprecate component.Host2

This is an alternate to open-telemetry#6550

## Summary of changes

- Add component status concept. Components can report their status via
  Host.ReportComponentStatus(). Interested extensions can watch status events
  if they implement StatusWatcher interface.
  This is similar to how PipelineWatcher works today.
- Deprecate Host.ReportFatalError() in favour of Host.ReportComponentStatus().

## TODO after this is merged

- healthcheck extension must implement StatusWatcher.
- Replace all ReportFatalError() calls by ReportComponentStatus() calls in core and contrib.

## Open Questions

- StatusWatchers need to be able to tell if all current components are healthy.
  It is assumed that the listeners need to maintain a map of components and track
  the status of each component. This works only if we assume that the set of
  components cannot change during the lifetime of the listener. This assumption
  is true today but can change in the future if we introduce partial pipeline
  restarts where only modified/added/removed components are recreated (this will
  break listener's assumption and the map will become invalid). Should we
  instead keep track of this entire status map in the Host and broadcast the
  entire status to the listeners as a whole instead of (or in addition to)
  individual component events?
@tigrannajaryan tigrannajaryan force-pushed the feature/tigran/status-via-interfaces branch from 2ac7d38 to 444ed5d Compare November 21, 2022 14:02
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2022

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 6, 2022
@dmitryax dmitryax removed the Stale label Dec 6, 2022
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 18, 2023
@dmitryax dmitryax removed the Stale label Jan 18, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2023

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 2, 2023
@dmitryax dmitryax removed the Stale label Feb 2, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Feb 17, 2023
@dmitryax dmitryax removed the Stale label Feb 17, 2023
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan what's the status of this PR? Was there ever a workaround the breaking interface change or is it still blocked on that?

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Mar 14, 2023
@dmitryax dmitryax removed the Stale label Mar 14, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Mar 30, 2023
@dmitryax dmitryax removed the Stale label Mar 30, 2023
@jaronoff97
Copy link
Contributor

(still want to work on this, was on vacation for quite some time)

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2023

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 9, 2023
@djaglowski djaglowski removed the Stale label May 9, 2023
@tigrannajaryan
Copy link
Member Author

Closing this since I don't work on it anymore.

mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Aug 2, 2023
This picks up where open-telemetry#6560
left off. The first step is to get the code introduced in that PR working with the collector
as it is today. There were significant changes to how pipelines are built and the
component package was split into separate packages based on type (extension, processor, etc).
This commit makes the necessary changes to get everything working, likely not in the most
ideal way, but it's a start that we can iterate on.
@mwear mwear mentioned this pull request Aug 2, 2023
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Aug 3, 2023
This is an implementation based on the thread starting with this comment:
open-telemetry#6560 (comment)
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Aug 23, 2023
This picks up where open-telemetry#6560
left off. The first step is to get the code introduced in that PR working with the collector
as it is today. There were significant changes to how pipelines are built and the
component package was split into separate packages based on type (extension, processor, etc).
This commit makes the necessary changes to get everything working, likely not in the most
ideal way, but it's a start that we can iterate on.
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Aug 23, 2023
This is an implementation based on the thread starting with this comment:
open-telemetry#6560 (comment)
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Sep 10, 2023
This picks up where open-telemetry#6560
left off. The first step is to get the code introduced in that PR working with the collector
as it is today. There were significant changes to how pipelines are built and the
component package was split into separate packages based on type (extension, processor, etc).
This commit makes the necessary changes to get everything working, likely not in the most
ideal way, but it's a start that we can iterate on.
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Sep 10, 2023
This is an implementation based on the thread starting with this comment:
open-telemetry#6560 (comment)
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Sep 26, 2023
This picks up where open-telemetry#6560
left off. The first step is to get the code introduced in that PR working with the collector
as it is today. There were significant changes to how pipelines are built and the
component package was split into separate packages based on type (extension, processor, etc).
This commit makes the necessary changes to get everything working, likely not in the most
ideal way, but it's a start that we can iterate on.
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Sep 26, 2023
This is an implementation based on the thread starting with this comment:
open-telemetry#6560 (comment)
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Sep 29, 2023
This picks up where open-telemetry#6560
left off. The first step is to get the code introduced in that PR working with the collector
as it is today. There were significant changes to how pipelines are built and the
component package was split into separate packages based on type (extension, processor, etc).
This commit makes the necessary changes to get everything working, likely not in the most
ideal way, but it's a start that we can iterate on.
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Sep 29, 2023
This is an implementation based on the thread starting with this comment:
open-telemetry#6560 (comment)
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Oct 6, 2023
This picks up where open-telemetry#6560
left off. The first step is to get the code introduced in that PR working with the collector
as it is today. There were significant changes to how pipelines are built and the
component package was split into separate packages based on type (extension, processor, etc).
This commit makes the necessary changes to get everything working, likely not in the most
ideal way, but it's a start that we can iterate on.
mwear added a commit to mwear/opentelemetry-collector that referenced this pull request Oct 6, 2023
This is an implementation based on the thread starting with this comment:
open-telemetry#6560 (comment)
codeboten pushed a commit that referenced this pull request Oct 6, 2023
This PR introduces component status reporting. There have been several
attempts to introduce this functionality previously, with the most
recent being: #6560.

This PR was orignally based off of #6560, but has evolved based on the
feedback received and some additional enhancements to improve the ease
of use of the `ReportComponentStatus` API.

In earlier discussions (see
#8169 (comment))
we decided to model status as a finite state machine with the following
statuses: `Starting`, `OK`, `RecoverableError`, `PermanentError`,
`FatalError`. `Stopping`, and `Stopped`. A benefit of this design is
that `StatusWatcher`s will be notified on changes in status rather than
on potentially repetitive reports of the same status.

With the additional statuses and modeling them using a finite state
machine, there are more statuses to report. Rather than having each
component be responsible for reporting all of the statuses, I automated
status reporting where possible. A component's status will automatically
be set to `Starting` at startup. If the components `Start` returns an
error, the status will automatically be set to `PermanentError`. A
component is expected to report `StatusOK` when it has successfully
started (if it has successfully started) and from there can report
changes in status as it runs. It will likely be a common scenario for
components to transition between `StatusOK` and `StatusRecoverableError`
during their lifetime. In extenuating circumstances they can transition
into terminal states of `PermanentError` and `FatalError` (where a fatal
error initiates collector shutdown). Additionally, during component
Shutdown statuses are automatically reported where possible. A
component's status is set to `Stopping` when Shutdown is initially
called, if Shutdown returns an error, the status will be set to
`PermanentError` if it does not return an error, the status is set to
`Stopped`.

In #6560 ReportComponentStatus was implemented on the `Host` interface.
I found that few components use the Host interface, and none of them
save a handle to it (to be used outside of the `start` method). I found
that many components keep a handle to the `TelemetrySettings` that they
are initialized with, and this seemed like a more natural, convenient
place for the `ReportComponentStatus` API. I'm ultimately flexible on
where this method resides, but feel that `TelemetrySettings` a more user
friendly place for it.

Regardless of where the `ReportComponentStatus` method resides (Host or
TelemetrySettings), there is a difference in the method signature for
the API based on whether it is used from the service or from a
component. As the service is not bound to a specific component, it needs
to take the `instanceID` of a component as a parameter, whereas the
component version of the method already knows the `instanceID`. In #6560
this led to having both `component.Host` and `servicehost.Host` versions
of the Host interface to be used at the component or service levels. In
this version, we have the same for TelemetrySettings. There is a
`component.TelemetrySettings` and a `servicetelemetry.Settings` with the
only difference being the method signature of `ReportComponentStatus`.

Lastly, this PR sets up the machinery for report component status, and
allows extensions to be `StatusWatcher`s, but it does not introduce any
`StatusWatcher`s. We expect the OpAMP extension to be a `StatusWatcher`
and use data from this system as part of its AgentHealth message (the
message is currently being extended to accommodate more component level
details). We also expect there to be a non-OpAMP `StatusWatcher`
implementation, likely via the HealthCheck extension (or something
similiar).

**Link to tracking Issue:** #7682

cc: @tigrannajaryan @djaglowski @evan-bradley

---------

Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: Pablo Baeyens <[email protected]>
Co-authored-by: Daniel Jaglowski <[email protected]>
Co-authored-by: Evan Bradley <[email protected]>
Co-authored-by: Tigran Najaryan <[email protected]>
Co-authored-by: Alex Boten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants