Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve watchdog alert #2467

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

a-Tell
Copy link

@a-Tell a-Tell commented Jul 29, 2024

Description

This expression fires the Watchdog alert only if the TSDB is up to date and therefore checks the functionality of the full stack.

According to the Runbook, the intention of the Watchdog alert is, quote:
| This is an alert meant to ensure that the entire alerting pipeline is functional.

vector(1) fires when Alertmanager and Prometheus Pods are up and running. When either of them is down, with this expression, the Watchdog serves it's purpose as intended.

However, when TSDB storage runs full, Watchdog with vector(1) will still fire, but all alerts with expressions that depend on a metric will not trigger, because said metrics are now missing. So in short, a dysfunctional Prometheus-Stack goes completely unnoticed.

This may also happen in the (although unlikely) case the storage runs full faster than Prometheus can trigger a KubePersistentVolumeFillingUp alert.

Type of change

What type of changes does your code introduce to the kube-prometheus? Put an x in the box that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. Later this will be copied to the changelog file.

Improve Watchdog alert by checking if TSDB is up-to-date

@a-Tell a-Tell requested a review from a team as a code owner July 29, 2024 07:33
@a-Tell a-Tell marked this pull request as draft July 29, 2024 08:24
@a-Tell a-Tell marked this pull request as ready for review July 29, 2024 08:31
This expression fires the Watchdog alert also if the TSDB is up to date and therefore checks the functionality of the full stack. With the former `vector(1)`, only alertmanager needs to be functional to fire. So In case of a full TSDB storage, the Watchdog still fires and the lack of new metrics goes unnoticed.
@a-Tell
Copy link
Author

a-Tell commented Aug 29, 2024

Hm, I honestly don't know why the Ubuntu-Pipeline exited with 1. Just the Rule-Expression changed. Maybe it's an issue with the pipeline itself?

@paulfantom
Copy link
Member

Hm, I honestly don't know why the Ubuntu-Pipeline exited with 1

You need to run make generate and commit generated code changes too.


The initial purpose of the Watchdog alert was to ensure general ability to send alerts is operational. For this reason, vector(1), which was not touching prometheus tsdb, seemed like a good choice. This is somewhat confirmed by lookingat the alert summary which states nothing about the prometheus itself - An alert that should always be firing to certify that Alertmanager is working properly.. If runbook states something else, then I do believe it is a bug in the runbook itself.

As for testing if prometheus can generate alerts, I think this should be a part of prometheus mixin and if I am correct, one of PrometheusMissingRuleEvaluations, PrometheusRuleFailures, PrometheusNotIngestingSamples would fire in that case. This in turn gives a full picture - simple Watchdog for alert sending and additional alerts for rule evaluation.

@a-Tell
Copy link
Author

a-Tell commented Sep 2, 2024

Hey @paulfantom, thanks alot for the feedback!

In my years of experiencing "dead" Prometheus'es, I never saw one of the Alerts you mentioned firing. I'm gonna check on them.

And I agree with you, keeping the Watchdog-Alert as it is (and only Reporting on a down Alertmanager) and having another Alert which tells that there is an Issue with Rule-Evaluation would be a nicer solution.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants