Advanced pipeline status based on data flow #425

a-thaler · 2023-09-22T09:41:05Z

Description
The telemetry pipelines are deploying active components to a cluster which are dispatching data to 3party services. Any kind of interruptions can happen which prevents a successful delivery of the data. Retries and buffering will introduce short-term resilience. Still, there will be situations where data cannot be delivered and the user needs to get notified about that to be able to react.

The typical situations causing problems are:

Connectivity issues to the 3party service
Backpressure/throttling caused by overload of the 3party service
Ingestion limit of a pipeline is reached (even with auto-scaling, there will be a maxReplica setting)

These situations can be observed by collecting metrics of the relevant components, a documentation for that was recently added.
However, that approach is cumbersome as the user needs to know what services to scrape and need to filter the relevant metrics. Also, these details are more internals which might change in future.

Goal
The telemetry-manager is managing the lifecycle of pipelines and should be the only place knowing how to interpret the metrics of the components. Any problematic situation should be reported in the pipeline and module status as warning so that a user can easily detect the problem. Also, custom metrics should be exposed via a dedicated endpoint (only returning relevant user facing metrics) which will be maintained long-term, even if internals will change. Another typical channel for notification about problems could be to emit a k8s event in case of turning a pipeline into an unhealthy mode.

Criterias

As a user I can detect the 3 described situations with my pipelines as part of the pipeline and module status. The description is good enough to identify the problem. The status is documented well.
As a user I can collect metrics around the new status to observe the 3 situations and being able to setup alerts in my monitoring backend. (to be covered by Telemetry module status as metric input to enable dashboarding and alerting on it #728)

Ideas
The operator could scrape all active components, interpret the data, to report status and custom metrics. For that, also a prometheus sidecar could be used which does the scrape job in a generic way, so that the operator can just use PromQL at any time to do a query. Hereby, a question is if historical data can be beneficial.

Potential metrics to expose

Metric	Meaning	Goal
telemetry_tracegateway_input_accepted	counter indicating successful ingestion rate	observe if data is arriving
telemetry_tracegateway_input_refused	counter indicating a refusal on ingestion side	observe if the component is overload and requires scaling
telemetry_tracegateway_output_dropped (pipeline=xxx)	counter indicating a drop of data on exporter side	observe if the backend has backpressure
telemetry_tracegateway_output_failed (pipeline=xxx)	counter indicating a unrecoverable fail of data on exporter side	observe if there are data consistency problems (400 responses)
telemetry_tracegateway_output_send (pipeline=xxx)	counter indicating the export rate	observe if data can be exported

Potential status API
New reasons for the module conditions should be introduced for the new situation a pipeline can be in. For the TraceComponentsHealthy condition new reasons could be:

Condition status	Condition reason	Message
False	TraceGatewayIngestionThrottling	The gateway cannot handle the incoming load. Please scale up manually
False	PipelineDropsData	The configured backend of pipeline XX drops data. Please check if the configured backend can handle the load.

Items

a-thaler · 2024-03-25T15:56:04Z

As part of the investigations we decided that the main interface should be the module status indicating the error situations. With that the status is accessible via UI and CLI automatically.

On top, we need a generic way to enable the user to collect the status of each module listed in the Kyma CR in a consistent way, most probably without exposing custom metrics via the telemetry-manager. The story of collecting the status will be covered in a dedicated epic: #728

a-thaler · 2024-05-14T08:35:48Z

We planned to finish the open issues within next 2 weeks, so that a promotion to the regular release can be started afterwards

a-thaler · 2024-06-03T08:19:40Z

Feature will be part of tomorrows release

a-thaler added kind/feature Categorizes issue or PR as related to a new feature. area/telemetry labels Sep 22, 2023

a-thaler changed the title ~~Pipeline healthyness as part of status and business metric~~ Pipeline healthyness as part of status, business metric and k8s event Sep 22, 2023

a-thaler added area/manager Manager or module changes and removed area/telemetry labels Oct 6, 2023

rakesh-garimella assigned rakesh-garimella and unassigned rakesh-garimella Oct 25, 2023

skhalash self-assigned this Nov 6, 2023

This was referenced Nov 6, 2023

docs: ADR: Trace/Metric Pipeline status based on Otel Collector metrics #516

Merged

PoC: Assess the impact of direct Prometheus queries on reconciliation duration #526

Closed

skhalash mentioned this issue Dec 5, 2023

Align Telemetry Pipeline Statuses with Kubernetes API Conventions #601

Closed

a-thaler added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 8, 2024

a-thaler mentioned this issue Jan 15, 2024

Spike for pipeline healthyness as part of the status #713

Closed

a-thaler mentioned this issue Feb 5, 2024

Spike for hardened prometheus setup #770

Closed

This was referenced Feb 12, 2024

Understand and design kubernetes events for pipeline status #787

Open

Deploy the self-monitor on pipeline activations #807

Closed

a-thaler mentioned this issue Mar 4, 2024

Report self-monitor alerts to the pipeline status #822

Closed

a-thaler mentioned this issue Mar 11, 2024

Advanced status feature on development release with custom image for self-monitor #871

Closed

a-thaler changed the title ~~Pipeline healthyness as part of status, business metric and k8s event~~ Pipeline healthyness as part of pipeline status Mar 13, 2024

This was referenced Mar 18, 2024

Add webhook to trigger status reconciliations faster #903

Closed

Metrics and alerts for Fluent Bit in advanced status #904

Closed

a-thaler changed the title ~~Pipeline healthyness as part of pipeline status~~ Advanced pipeline status based on data flow Mar 22, 2024

a-thaler assigned a-thaler and unassigned skhalash Mar 22, 2024

This was referenced Mar 25, 2024

Support for fluentbit in advanced status handling #917

Closed

Document advanced status API #918

Closed

Full test coverage for advanced status #919

Closed

Self-Monitor monitoring #920

Closed

a-thaler mentioned this issue Apr 8, 2024

fix: Set content-security-policy header in webhook handler #949

Merged

8 tasks

a-thaler mentioned this issue Apr 16, 2024

Reflect agent scrape problems in pipeline status #976

Open

9 tasks

a-thaler mentioned this issue May 17, 2024

Advanced pipeline status based on data flow as part of regular release #1093

Closed

a-thaler added this to the 1.17.0 milestone Jun 4, 2024

a-thaler closed this as completed Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced pipeline status based on data flow #425

Advanced pipeline status based on data flow #425

a-thaler commented Sep 22, 2023 •

edited by skhalash

Loading

a-thaler commented Mar 25, 2024

a-thaler commented May 14, 2024

a-thaler commented Jun 3, 2024

Advanced pipeline status based on data flow #425

Advanced pipeline status based on data flow #425

Comments

a-thaler commented Sep 22, 2023 • edited by skhalash Loading

a-thaler commented Mar 25, 2024

a-thaler commented May 14, 2024

a-thaler commented Jun 3, 2024

a-thaler commented Sep 22, 2023 •

edited by skhalash

Loading