-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PromtailDown alert should only page when nodes are ready #1207
PromtailDown alert should only page when nodes are ready #1207
Conversation
expr: count(up{container="promtail"} == 0) by (cluster_id, installation, provider, pipeline) > 0 | ||
expr: |- | ||
( | ||
# List promtail pods to be able to get the node label and join with the node status to not alert if the node is not ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like having an other alert/inhibition for node not ready would make this expression easier. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would replace the last section yes:
* on (node) group_left()
(
kube_node_status_condition{condition="Ready", status="true"} == 1
)
But I'm really keen on using inhibition without a paging alert behind it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An inhibition would be better to:
- make the alert easier to read
- make the "check if node is ready" part easy to share for other alerts, because promtail is probably not the only alert requiring this inhibition
However, the caveats of the inhibition are:
- you must ensure the inhibition alert fires before the inhibited alert
- you must ensure the inhibition alert keeps firing longer than the inhibited alert
- even when this timing is good, sometimes alerts get through the inhibition
So you may still get false positives.
Also, even if we use an inhibition the join with node
is still required.
If we plan to do something similar for some other alerts (ie inhibit on node not ready), I prefer the inhibition.
If this is the only alert requiring this, I don't mind having the condition hardcoded in the alert query.
@giantswarm/team-atlas and specifically @TheoBrigitte what do you think now that the alert is fixed? |
Still there's no inhibition here. |
Inhibition work is being done here giantswarm/prometheus-meta-operator#1679 as we will need it for all our daemonsets |
* add mimir support for resource usage estimation recording rules * update mimir query * update mimir query * update query for mimir
* Add Atlas app-configuration alerts * Adjust rules * Update atlas-app-configuration alerts * Update atlas-app-configuration alerts
* Release v4.4.2 * Update CHANGELOG.md --------- Co-authored-by: Marie Roque <[email protected]>
Signed-off-by: QuentinBisson <[email protected]>
1cda66b
to
ebaa164
Compare
@TheoBrigitte we have an inhibition now :) |
Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.
Towards: fixing an alert :(
This PR ensures promtail down only pages when the node it is scheduled on is ready. This will avoid false alerts
@giantswarm/team-atlas I will create a silence for the ongoing alert because I've no idea why the tests are failing but I cannot read promtool syntax
Checklist
oncall-kaas-cloud
GitHub group).