Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike for pipeline healthyness as part of the status #713

Closed
a-thaler opened this issue Jan 15, 2024 · 3 comments
Closed

Spike for pipeline healthyness as part of the status #713

a-thaler opened this issue Jan 15, 2024 · 3 comments
Assignees
Labels
area/manager Manager or module changes area/metrics MetricPipeline kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@a-thaler
Copy link
Collaborator

a-thaler commented Jan 15, 2024

Description
For realizing #425 a concrete concept is agreed on. In order to determine if the solution will fit to the problem well and finding obstacles, a spike should be done which brings all building blocks together and allows a verification of the e2e flow.

Run an e2e spike focussing on the business functionality, collect metrics and leverage them to serve the new API:

  • Have a prometheus running (no automation needed yet)
  • Configure prometheus so that well-defined alerts are available for the error situations
    • every alert should be one error situation, per pipeline if applicable
  • Implement the alert webhook to trigger reconcilation, triggering only needed reconcilations (do not trigger traces if logs are firing)
  • Define a concrete status API (for the new Status AP of metrics)
  • Have a look in to the old status API and get a feeling on integrating it as well
  • Implement the reconcile logic and set the status (for the new Status AP of metrics)
  • Show the status in the dashboard
  • Reduce the amount of timeseries as much as possible

Reasons

Attachments

Release Notes


@a-thaler a-thaler added area/metrics MetricPipeline area/manager Manager or module changes labels Jan 15, 2024
@a-thaler a-thaler added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 22, 2024
@rakesh-garimella rakesh-garimella self-assigned this Jan 24, 2024
@rakesh-garimella
Copy link
Contributor

ADR for not making prometheus part of service mesh: #758

@rakesh-garimella
Copy link
Contributor

rakesh-garimella commented Jan 30, 2024

required istio settings

Prometheus Wont be part of the istio service mesh. So no setting required.

what if the status is in an alert and prometheus restarts, will a recovery from the alert be detected

In case of prometheus restart we see error failed to query Prometheus alerts: Get \"http://prometheus-server.default:80/api/v1/alerts\" and when the prometheus is healthy again we see the alerts still shown.

can there be flaky alerts

Could not reproduce this as when a alert is triggered as per current query it takes atleast 5 mins for the alert to be resolved. So if we keep the interval on which the alert is measured long enough then we should not see falkiness

what is the actual footprint of prometheus?

@rakesh-garimella
Copy link
Contributor

The ADR for health api: #802 has been merged so closing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/manager Manager or module changes area/metrics MetricPipeline kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants