-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add InhibitionClusterWithoutWorkerNodes for CAPA #1397
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
{{- if eq .Values.managementCluster.provider.kind "capa" }} | ||
apiVersion: monitoring.coreos.com/v1 | ||
kind: PrometheusRule | ||
metadata: | ||
creationTimestamp: null | ||
labels: | ||
{{- include "labels.common" . | nindent 4 }} | ||
cluster_type: "management_cluster" | ||
name: capa.inhibitions.rules | ||
namespace: {{ .Values.namespace }} | ||
spec: | ||
groups: | ||
- name: capa.inhibitions | ||
rules: | ||
- alert: InhibitionClusterWithoutWorkerNodes | ||
annotations: | ||
description: '{{`Cluster ({{ $labels.cluster_id }}) has no worker nodes.`}}' | ||
expr: |- | ||
label_replace( | ||
capi_cluster_status_condition{type="ControlPlaneReady", status="True"}, | ||
"cluster_id", | ||
"$1", | ||
"name", | ||
"(.*)" | ||
) == 1 | ||
unless on (cluster_id) ( | ||
sum(capi_machinepool_spec_replicas{} > 0) by (cluster_id) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we accept 1 worker nodes only? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Prometheus-agent/Alloy could run on a 1-node WC, so that's potentially ok. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we sure we want to use
Do you think it would make sense to use one of those instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Current inhibition works when the cluster has been purposely scaled down. If the cluster should have nodes but none is ready/available, I think we should get a page. It seems to me that there's quite a gap between vintage AWS and CAPA alerts, but I don't think it's Atlas responsibility to fix it. So I went the quickest way to solving my actual issue 😅 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, I thought you wanted the inhibition to avoid paging when the cluster was having other issues i.e. no ready nodes, meaning there was nothing wrong with your component. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would be ideal. But my first expectation is to not get paged when a cluster has no issues, for now 🤣 |
||
) | ||
labels: | ||
area: kaas | ||
has_worker_nodes: "false" | ||
team: phoenix | ||
topic: status | ||
{{- end }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
--- | ||
rule_files: | ||
- capa.inhibition.rules.yml | ||
|
||
tests: | ||
# Tests for `InhibitionClusterWithoutWorkerNodes` inhibition alert | ||
- interval: 1m | ||
input_series: | ||
- series: 'capi_cluster_status_condition{cluster_id="golem", cluster_type="management_cluster", name="golem", pipeline="testing", status="True", type="ControlPlaneReady"}' | ||
values: "1+0x300" | ||
- series: 'capi_machinepool_spec_replicas{cluster_id="golem", cluster_name="golem", cluster_type="management_cluster", customer="giantswarm", installation="golem", organization="giantswarm", pipeline="testing", provider="capa"}' | ||
values: "_x60 0x60 3x60" | ||
alert_rule_test: | ||
- alertname: InhibitionClusterWithoutWorkerNodes | ||
eval_time: 30m | ||
exp_alerts: | ||
- exp_labels: | ||
area: kaas | ||
cluster_id: "golem" | ||
cluster_type: "management_cluster" | ||
has_worker_nodes: "false" | ||
name: "golem" | ||
pipeline: "testing" | ||
status: "True" | ||
team: "phoenix" | ||
topic: "status" | ||
type: "ControlPlaneReady" | ||
exp_annotations: | ||
description: "Cluster (golem) has no worker nodes." | ||
- alertname: InhibitionClusterWithoutWorkerNodes | ||
eval_time: 90m | ||
exp_alerts: | ||
- exp_labels: | ||
area: kaas | ||
cluster_id: "golem" | ||
cluster_type: "management_cluster" | ||
has_worker_nodes: "false" | ||
name: "golem" | ||
pipeline: "testing" | ||
status: "True" | ||
team: "phoenix" | ||
topic: "status" | ||
type: "ControlPlaneReady" | ||
exp_annotations: | ||
description: "Cluster (golem) has no worker nodes." | ||
- alertname: InhibitionClusterWithoutWorkerNodes | ||
eval_time: 150m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this for capa only? otherwise it looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that with capi_machinepool_spec_replicas and
capi_machinedeployment_spec_replicas
we should be good?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the metrics is only for CAPA.
Ref: https://gigantic.slack.com/archives/C02HLSDH3DZ/p1729502169569249?thread_ts=1729498455.102279&cid=C02HLSDH3DZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense :)