Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced pipeline status based on data flow #425

Closed
18 tasks done
a-thaler opened this issue Sep 22, 2023 · 3 comments
Closed
18 tasks done

Advanced pipeline status based on data flow #425

a-thaler opened this issue Sep 22, 2023 · 3 comments
Assignees
Labels
area/manager Manager or module changes kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Milestone

Comments

@a-thaler
Copy link
Collaborator

a-thaler commented Sep 22, 2023

Description
The telemetry pipelines are deploying active components to a cluster which are dispatching data to 3party services. Any kind of interruptions can happen which prevents a successful delivery of the data. Retries and buffering will introduce short-term resilience. Still, there will be situations where data cannot be delivered and the user needs to get notified about that to be able to react.

The typical situations causing problems are:

  • Connectivity issues to the 3party service
  • Backpressure/throttling caused by overload of the 3party service
  • Ingestion limit of a pipeline is reached (even with auto-scaling, there will be a maxReplica setting)

These situations can be observed by collecting metrics of the relevant components, a documentation for that was recently added.
However, that approach is cumbersome as the user needs to know what services to scrape and need to filter the relevant metrics. Also, these details are more internals which might change in future.

Goal
The telemetry-manager is managing the lifecycle of pipelines and should be the only place knowing how to interpret the metrics of the components. Any problematic situation should be reported in the pipeline and module status as warning so that a user can easily detect the problem. Also, custom metrics should be exposed via a dedicated endpoint (only returning relevant user facing metrics) which will be maintained long-term, even if internals will change. Another typical channel for notification about problems could be to emit a k8s event in case of turning a pipeline into an unhealthy mode.

Criterias

  • As a user I can detect the 3 described situations with my pipelines as part of the pipeline and module status. The description is good enough to identify the problem. The status is documented well.
  • As a user I can collect metrics around the new status to observe the 3 situations and being able to setup alerts in my monitoring backend. (to be covered by Telemetry module status as metric input to enable dashboarding and alerting on it #728)

Ideas
The operator could scrape all active components, interpret the data, to report status and custom metrics. For that, also a prometheus sidecar could be used which does the scrape job in a generic way, so that the operator can just use PromQL at any time to do a query. Hereby, a question is if historical data can be beneficial.

Potential metrics to expose

Metric Meaning Goal
telemetry_tracegateway_input_accepted counter indicating successful ingestion rate observe if data is arriving
telemetry_tracegateway_input_refused counter indicating a refusal on ingestion side observe if the component is overload and requires scaling
telemetry_tracegateway_output_dropped (pipeline=xxx) counter indicating a drop of data on exporter side observe if the backend has backpressure
telemetry_tracegateway_output_failed (pipeline=xxx) counter indicating a unrecoverable fail of data on exporter side observe if there are data consistency problems (400 responses)
telemetry_tracegateway_output_send (pipeline=xxx) counter indicating the export rate observe if data can be exported

Potential status API
New reasons for the module conditions should be introduced for the new situation a pipeline can be in. For the TraceComponentsHealthy condition new reasons could be:

Condition status Condition reason Message
False TraceGatewayIngestionThrottling The gateway cannot handle the incoming load. Please scale up manually
False PipelineDropsData The configured backend of pipeline XX drops data. Please check if the configured backend can handle the load.

Items

@a-thaler a-thaler added kind/feature Categorizes issue or PR as related to a new feature. area/telemetry labels Sep 22, 2023
@a-thaler a-thaler changed the title Pipeline healthyness as part of status and business metric Pipeline healthyness as part of status, business metric and k8s event Sep 22, 2023
@a-thaler a-thaler added area/manager Manager or module changes and removed area/telemetry labels Oct 6, 2023
@skhalash skhalash self-assigned this Nov 6, 2023
@a-thaler a-thaler added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 8, 2024
@a-thaler a-thaler changed the title Pipeline healthyness as part of status, business metric and k8s event Pipeline healthyness as part of pipeline status Mar 13, 2024
@a-thaler a-thaler changed the title Pipeline healthyness as part of pipeline status Advanced pipeline status based on data flow Mar 22, 2024
@a-thaler a-thaler assigned a-thaler and unassigned skhalash Mar 22, 2024
@a-thaler
Copy link
Collaborator Author

As part of the investigations we decided that the main interface should be the module status indicating the error situations. With that the status is accessible via UI and CLI automatically.

On top, we need a generic way to enable the user to collect the status of each module listed in the Kyma CR in a consistent way, most probably without exposing custom metrics via the telemetry-manager. The story of collecting the status will be covered in a dedicated epic: #728

@a-thaler
Copy link
Collaborator Author

We planned to finish the open issues within next 2 weeks, so that a promotion to the regular release can be started afterwards

@a-thaler
Copy link
Collaborator Author

a-thaler commented Jun 3, 2024

Feature will be part of tomorrows release

@a-thaler a-thaler added this to the 1.17.0 milestone Jun 4, 2024
@a-thaler a-thaler closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/manager Manager or module changes kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

3 participants