-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advanced pipeline status based on data flow #425
Comments
As part of the investigations we decided that the main interface should be the module status indicating the error situations. With that the status is accessible via UI and CLI automatically. On top, we need a generic way to enable the user to collect the status of each module listed in the Kyma CR in a consistent way, most probably without exposing custom metrics via the telemetry-manager. The story of collecting the status will be covered in a dedicated epic: #728 |
We planned to finish the open issues within next 2 weeks, so that a promotion to the regular release can be started afterwards |
Feature will be part of tomorrows release |
Description
The telemetry pipelines are deploying active components to a cluster which are dispatching data to 3party services. Any kind of interruptions can happen which prevents a successful delivery of the data. Retries and buffering will introduce short-term resilience. Still, there will be situations where data cannot be delivered and the user needs to get notified about that to be able to react.
The typical situations causing problems are:
These situations can be observed by collecting metrics of the relevant components, a documentation for that was recently added.
However, that approach is cumbersome as the user needs to know what services to scrape and need to filter the relevant metrics. Also, these details are more internals which might change in future.
Goal
The telemetry-manager is managing the lifecycle of pipelines and should be the only place knowing how to interpret the metrics of the components. Any problematic situation should be reported in the pipeline and module status as warning so that a user can easily detect the problem. Also, custom metrics should be exposed via a dedicated endpoint (only returning relevant user facing metrics) which will be maintained long-term, even if internals will change. Another typical channel for notification about problems could be to emit a k8s event in case of turning a pipeline into an unhealthy mode.
Criterias
Ideas
The operator could scrape all active components, interpret the data, to report status and custom metrics. For that, also a prometheus sidecar could be used which does the scrape job in a generic way, so that the operator can just use PromQL at any time to do a query. Hereby, a question is if historical data can be beneficial.
Potential metrics to expose
Potential status API
New reasons for the module conditions should be introduced for the new situation a pipeline can be in. For the TraceComponentsHealthy condition new reasons could be:
Items
- What situations we want to detect and is it possible using this approach (what metrics and labels we need to collect)
- have a simple base setup deploying a Prometheus and start collecting the needed metrics
- add metric collection to identify the typical notification situations and filter relevant labels
- add alert rules for that situations with important labels
- what is the query performance? Is it good enough to call it as part of reconciliation?
- Can we base the queries on alerts only?
- Will recording rules help us in contrast do do raw queries?
The text was updated successfully, but these errors were encountered: