Spike for pipeline healthyness as part of the status #713

a-thaler · 2024-01-15T16:29:26Z

Description
For realizing #425 a concrete concept is agreed on. In order to determine if the solution will fit to the problem well and finding obstacles, a spike should be done which brings all building blocks together and allows a verification of the e2e flow.

Run an e2e spike focussing on the business functionality, collect metrics and leverage them to serve the new API:

Have a prometheus running (no automation needed yet)
Configure prometheus so that well-defined alerts are available for the error situations
- every alert should be one error situation, per pipeline if applicable
Implement the alert webhook to trigger reconcilation, triggering only needed reconcilations (do not trigger traces if logs are firing)
Define a concrete status API (for the new Status AP of metrics)
Have a look in to the old status API and get a feeling on integrating it as well
Implement the reconcile logic and set the status (for the new Status AP of metrics)
Show the status in the dashboard
Reduce the amount of timeseries as much as possible

Reasons

Attachments

Release Notes

The text was updated successfully, but these errors were encountered:

rakesh-garimella · 2024-01-30T20:49:25Z

ADR for not making prometheus part of service mesh: #758

rakesh-garimella · 2024-01-30T20:52:56Z

required istio settings

Prometheus Wont be part of the istio service mesh. So no setting required.

what if the status is in an alert and prometheus restarts, will a recovery from the alert be detected

In case of prometheus restart we see error failed to query Prometheus alerts: Get \"http://prometheus-server.default:80/api/v1/alerts\" and when the prometheus is healthy again we see the alerts still shown.

can there be flaky alerts

Could not reproduce this as when a alert is triggered as per current query it takes atleast 5 mins for the alert to be resolved. So if we keep the interval on which the alert is measured long enough then we should not see falkiness

what is the actual footprint of prometheus?

rakesh-garimella · 2024-02-28T09:12:23Z

The ADR for health api: #802 has been merged so closing the issue

a-thaler added area/metrics MetricPipeline area/manager Manager or module changes labels Jan 15, 2024

a-thaler mentioned this issue Jan 15, 2024

Advanced pipeline status based on data flow #425

Closed

18 tasks

a-thaler added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 22, 2024

rakesh-garimella self-assigned this Jan 24, 2024

rakesh-garimella mentioned this issue Jan 30, 2024

docs: ADR: Do not make prometheus part of the Istio service mesh #758

Merged

8 tasks

rakesh-garimella mentioned this issue Jan 30, 2024

feat: Set metricpipeline status based on alerts #753

Closed

8 tasks

skhalash assigned skhalash and unassigned rakesh-garimella Feb 5, 2024

skhalash mentioned this issue Feb 16, 2024

docs: ADR: Telemetry Flow Healthiness Status API #802

Merged

8 tasks

skhalash added this to the 1.10.0 milestone Feb 16, 2024

a-thaler modified the milestones: 1.10.0, 1.11.0 Feb 26, 2024

rakesh-garimella closed this as completed Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike for pipeline healthyness as part of the status #713

Spike for pipeline healthyness as part of the status #713

a-thaler commented Jan 15, 2024 •

edited

Loading

rakesh-garimella commented Jan 30, 2024

rakesh-garimella commented Jan 30, 2024 •

edited

Loading

rakesh-garimella commented Feb 28, 2024

Spike for pipeline healthyness as part of the status #713

Spike for pipeline healthyness as part of the status #713

Comments

a-thaler commented Jan 15, 2024 • edited Loading

rakesh-garimella commented Jan 30, 2024

rakesh-garimella commented Jan 30, 2024 • edited Loading

rakesh-garimella commented Feb 28, 2024

a-thaler commented Jan 15, 2024 •

edited

Loading

rakesh-garimella commented Jan 30, 2024 •

edited

Loading