Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PromtailDown alert should only page when nodes are ready #1207

Merged
merged 21 commits into from
Jul 11, 2024

Conversation

QuentinBisson
Copy link
Contributor

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.


Towards: fixing an alert :(

This PR ensures promtail down only pages when the node it is scheduled on is ready. This will avoid false alerts

@giantswarm/team-atlas I will create a silence for the ongoing alert because I've no idea why the tests are failing but I cannot read promtool syntax

Checklist

@QuentinBisson QuentinBisson self-assigned this Jun 3, 2024
@QuentinBisson QuentinBisson requested a review from a team as a code owner June 3, 2024 13:32
expr: count(up{container="promtail"} == 0) by (cluster_id, installation, provider, pipeline) > 0
expr: |-
(
# List promtail pods to be able to get the node label and join with the node status to not alert if the node is not ready
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having an other alert/inhibition for node not ready would make this expression easier. WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would replace the last section yes:

* on (node) group_left()
              (
                kube_node_status_condition{condition="Ready", status="true"} == 1
              )

But I'm really keen on using inhibition without a paging alert behind it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An inhibition would be better to:

  • make the alert easier to read
  • make the "check if node is ready" part easy to share for other alerts, because promtail is probably not the only alert requiring this inhibition

However, the caveats of the inhibition are:

  • you must ensure the inhibition alert fires before the inhibited alert
  • you must ensure the inhibition alert keeps firing longer than the inhibited alert
  • even when this timing is good, sometimes alerts get through the inhibition

So you may still get false positives.

Also, even if we use an inhibition the join with node is still required.

If we plan to do something similar for some other alerts (ie inhibit on node not ready), I prefer the inhibition.
If this is the only alert requiring this, I don't mind having the condition hardcoded in the alert query.

@QuentinBisson
Copy link
Contributor Author

@giantswarm/team-atlas and specifically @TheoBrigitte what do you think now that the alert is fixed?

@TheoBrigitte
Copy link
Member

@giantswarm/team-atlas and specifically @TheoBrigitte what do you think now that the alert is fixed?

Still there's no inhibition here.

@QuentinBisson
Copy link
Contributor Author

Inhibition work is being done here giantswarm/prometheus-meta-operator#1679 as we will need it for all our daemonsets

QuantumEnigmaa and others added 6 commits July 3, 2024 14:42
* add mimir support for resource usage estimation recording rules

* update mimir query

* update mimir query

* update query for mimir
* Add Atlas app-configuration alerts

* Adjust rules

* Update atlas-app-configuration alerts

* Update atlas-app-configuration alerts
* Release v4.4.2

* Update CHANGELOG.md

---------

Co-authored-by: Marie Roque <[email protected]>
Signed-off-by: QuentinBisson <[email protected]>
@QuentinBisson QuentinBisson force-pushed the page-for-missing-promtail-on-ready-nodes-only branch from 1cda66b to ebaa164 Compare July 3, 2024 12:47
@QuentinBisson
Copy link
Contributor Author

@TheoBrigitte we have an inhibition now :)

@QuentinBisson QuentinBisson merged commit 5b14adf into main Jul 11, 2024
7 checks passed
@QuentinBisson QuentinBisson deleted the page-for-missing-promtail-on-ready-nodes-only branch July 11, 2024 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants