Improve watchdog alert #2467

a-Tell · 2024-07-29T07:33:30Z

Description

This expression fires the Watchdog alert only if the TSDB is up to date and therefore checks the functionality of the full stack.

According to the Runbook, the intention of the Watchdog alert is, quote:
| This is an alert meant to ensure that the entire alerting pipeline is functional.

vector(1) fires when Alertmanager and Prometheus Pods are up and running. When either of them is down, with this expression, the Watchdog serves it's purpose as intended.

However, when TSDB storage runs full, Watchdog with vector(1) will still fire, but all alerts with expressions that depend on a metric will not trigger, because said metrics are now missing. So in short, a dysfunctional Prometheus-Stack goes completely unnoticed.

This may also happen in the (although unlikely) case the storage runs full faster than Prometheus can trigger a KubePersistentVolumeFillingUp alert.

Type of change

What type of changes does your code introduce to the kube-prometheus? Put an x in the box that apply.

CHANGE (fix or feature that would cause existing functionality to not work as expected)
FEATURE (non-breaking change which adds functionality)
BUGFIX (non-breaking change which fixes an issue)
ENHANCEMENT (non-breaking change which improves existing functionality)
NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. Later this will be copied to the changelog file.

Improve Watchdog alert by checking if TSDB is up-to-date

This expression fires the Watchdog alert also if the TSDB is up to date and therefore checks the functionality of the full stack. With the former `vector(1)`, only alertmanager needs to be functional to fire. So In case of a full TSDB storage, the Watchdog still fires and the lack of new metrics goes unnoticed.

a-Tell · 2024-08-29T13:21:34Z

Hm, I honestly don't know why the Ubuntu-Pipeline exited with 1. Just the Rule-Expression changed. Maybe it's an issue with the pipeline itself?

paulfantom · 2024-08-29T13:48:54Z

Hm, I honestly don't know why the Ubuntu-Pipeline exited with 1

You need to run make generate and commit generated code changes too.

The initial purpose of the Watchdog alert was to ensure general ability to send alerts is operational. For this reason, vector(1), which was not touching prometheus tsdb, seemed like a good choice. This is somewhat confirmed by lookingat the alert summary which states nothing about the prometheus itself - An alert that should always be firing to certify that Alertmanager is working properly.. If runbook states something else, then I do believe it is a bug in the runbook itself.

As for testing if prometheus can generate alerts, I think this should be a part of prometheus mixin and if I am correct, one of PrometheusMissingRuleEvaluations, PrometheusRuleFailures, PrometheusNotIngestingSamples would fire in that case. This in turn gives a full picture - simple Watchdog for alert sending and additional alerts for rule evaluation.

a-Tell · 2024-09-02T08:34:02Z

Hey @paulfantom, thanks alot for the feedback!

In my years of experiencing "dead" Prometheus'es, I never saw one of the Alerts you mentioned firing. I'm gonna check on them.

And I agree with you, keeping the Watchdog-Alert as it is (and only Reporting on a down Alertmanager) and having another Alert which tells that there is an Issue with Rule-Evaluation would be a nicer solution.

Thanks again!

a-Tell requested a review from a team as a code owner July 29, 2024 07:33

a-Tell marked this pull request as draft July 29, 2024 08:24

a-Tell force-pushed the main branch from 1c71057 to d4622dc Compare July 29, 2024 08:29

a-Tell marked this pull request as ready for review July 29, 2024 08:31

a-Tell force-pushed the main branch from d4622dc to 47dc983 Compare July 29, 2024 09:03

a-Tell force-pushed the main branch from 47dc983 to 99b0ead Compare August 23, 2024 07:15

a-Tell marked this pull request as draft September 2, 2024 09:00

a-Tell mentioned this pull request Oct 14, 2024

Disfunctional Prometheus-Stack with full TSDB-Storage goes unnoticed #2537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve watchdog alert #2467

Improve watchdog alert #2467

a-Tell commented Jul 29, 2024 •

edited

Loading

a-Tell commented Aug 29, 2024

paulfantom commented Aug 29, 2024

a-Tell commented Sep 2, 2024 •

edited

Loading

Improve watchdog alert #2467

Are you sure you want to change the base?

Improve watchdog alert #2467

Conversation

a-Tell commented Jul 29, 2024 • edited Loading

Description

Type of change

Changelog entry

a-Tell commented Aug 29, 2024

paulfantom commented Aug 29, 2024

a-Tell commented Sep 2, 2024 • edited Loading

a-Tell commented Jul 29, 2024 •

edited

Loading

a-Tell commented Sep 2, 2024 •

edited

Loading