Add InhibitionClusterWithoutWorkerNodes for CAPA #1397

hervenicol · 2024-10-21T16:24:58Z

Towards: https://github.com/giantswarm/giantswarm/issues/31390

This PR adds an inhibition for clusters that have no worker nodes.

Checklist

Update CHANGELOG.md
Add Unit tests
Follow Alert structure
Consider creating a dashboard (guidelines) (if it does not exist already) to help oncallers monitor the status of the issue.
Request review from oncall area, as well as team (e.g: oncall-kaas-cloud GitHub group).

QuentinBisson · 2024-10-21T19:14:40Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/capa.inhibition.rules.yml

+      annotations:
+        description: '{{`Cluster ({{ $labels.cluster_id }}) has no worker nodes.`}}'
+      expr: |-
+        label_replace(


Why is this for capa only? otherwise it looks good

I would think that with capi_machinepool_spec_replicas and capi_machinedeployment_spec_replicas we should be good?

Because the metrics is only for CAPA.
Ref: https://gigantic.slack.com/archives/C02HLSDH3DZ/p1729502169569249?thread_ts=1729498455.102279&cid=C02HLSDH3DZ

Makes sense :)

QuentinBisson · 2024-10-21T19:15:29Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/capa.inhibition.rules.yml

+                "(.*)"
+            ) == 1
+        unless on (cluster_id) (
+            sum(capi_machinepool_spec_replicas{} > 0) by (cluster_id)


Do we accept 1 worker nodes only?

Prometheus-agent/Alloy could run on a 1-node WC, so that's potentially ok.

Are we sure we want to use capi_machinepool_spec_replicas? That's the number of replicas defined in the MachinePool spec, but that doesn't necessarily represent the current number of nodes. For the inhibition, I thought we would prefer to use the current number of replicas. There are other metrics like

capi_machinepool_status_replicas: Replicas is the most recently observed number of replicas.

capi_machinepool_status_replicas_ready: The number of ready replicas for this MachinePool. A machine is considered ready when the node has been created and is "Ready".

capi_machinepool_status_replicas_available: The number of available replicas (ready for at least minReadySeconds) for this MachinePool.

Do you think it would make sense to use one of those instead?

Current inhibition works when the cluster has been purposely scaled down.

If the cluster should have nodes but none is ready/available, I think we should get a page.
I don't think current state of CAPI monitoring manages it, so I'd rather have a "prometheus-agent down" alert than no alert in this case.

It seems to me that there's quite a gap between vintage AWS and CAPA alerts, but I don't think it's Atlas responsibility to fix it. So I went the quickest way to solving my actual issue 😅

ok, I thought you wanted the inhibition to avoid paging when the cluster was having other issues i.e. no ready nodes, meaning there was nothing wrong with your component.

That would be ideal. But my first expectation is to not get paged when a cluster has no issues, for now 🤣

hervenicol self-assigned this Oct 21, 2024

hervenicol requested review from a team as code owners October 21, 2024 16:24

Add InhibitionClusterWithoutWorkerNodes for CAPA

3a24c8c

hervenicol force-pushed the capa-inhibit-noworker-clusters branch from 08fe221 to 3a24c8c Compare October 21, 2024 16:41

QuentinBisson reviewed Oct 21, 2024

View reviewed changes

Merge branch 'main' into capa-inhibit-noworker-clusters

34040cb

QuentinBisson approved these changes Oct 22, 2024

View reviewed changes

hervenicol merged commit c3d9f2a into main Oct 22, 2024
7 checks passed

hervenicol deleted the capa-inhibit-noworker-clusters branch October 22, 2024 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add InhibitionClusterWithoutWorkerNodes for CAPA #1397

Add InhibitionClusterWithoutWorkerNodes for CAPA #1397

hervenicol commented Oct 21, 2024

QuentinBisson Oct 21, 2024

QuentinBisson Oct 21, 2024

hervenicol Oct 22, 2024

QuentinBisson Oct 22, 2024

QuentinBisson Oct 21, 2024

hervenicol Oct 22, 2024

fiunchinho Oct 22, 2024

hervenicol Oct 29, 2024

fiunchinho Oct 29, 2024

hervenicol Oct 29, 2024

Add InhibitionClusterWithoutWorkerNodes for CAPA #1397

Add InhibitionClusterWithoutWorkerNodes for CAPA #1397

Conversation

hervenicol commented Oct 21, 2024

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment