review phoenix alerts #1211

QuentinBisson · 2024-06-05T15:22:44Z

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.

How to create a SLO rule: https://github.com/giantswarm/sloth-rules#how-to-create-a-slo
Documentation: https://intranet.giantswarm.io/docs/monitoring/slo-alerting/

This PR is a review of the phoenix alerts towards mimir/capi migration

@giantswarm/oncall-kaas-cloud this is WIP but I would love a first review on this

Checklist

Update CHANGELOG.md
Add Unit tests
Follow Alert structure
Consider creating a dashboard (guidelines) (if it does not exist already) to help oncallers monitor the status of the issue.
Request review from oncall area, as well as team (e.g: oncall-kaas-cloud GitHub group).

QuentinBisson · 2024-06-05T15:23:33Z

helm/prometheus-rules/templates/shared/alerting-rules/job.rules.yml

@@ -21,16 +21,3 @@ spec:
        severity: notify
        team: {{ include "providerTeam" . }}
        topic: managementcluster
-{{- if eq .Values.managementCluster.provider.kind "aws" }}


This has been moved to the aws.job.rules.yml file

QuentinBisson · 2024-06-05T15:24:24Z

helm/prometheus-rules/templates/platform/cabbage/alerting-rules/network.all.rules.yml

@@ -59,7 +59,6 @@ spec:
        cancel_if_cluster_with_scaling_nodepools: "true"
        cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
        cancel_if_cluster_has_no_workers: "true"
-        cancel_if_nodes_down: "true"


nodes_down is never set so ...

QuentinBisson · 2024-06-05T15:25:14Z

helm/prometheus-rules/templates/shared/alerting-rules/certificate.management-cluster.rules.yml

@@ -23,13 +23,13 @@ spec:
        area: kaas
        cancel_if_outside_working_hours: "true"
        severity: page
-        team: phoenix


This should page the mc team and not phoenix right?

QuentinBisson · 2024-06-05T15:25:24Z

helm/prometheus-rules/templates/shared/alerting-rules/certificate.management-cluster.rules.yml

        topic: security
-    - alert: ManagementClusterAWSCertificateWillExpireInLessThanOneMonth
+    - alert: ManagementClusterCertificateWillExpireInLessThanOneMonth
      annotations:


This alert should also page for onprem

QuentinBisson · 2024-06-05T15:25:38Z

helm/prometheus-rules/templates/platform/atlas/alerting-rules/inhibit.oncall.rules.yml

+  groups:
+  - name: inhibit.oncall
+    rules:
+    - alert: InhibitionOutsideWorkingHours


This should not be a phoenix alert

QuentinBisson · 2024-06-05T15:25:49Z

helm/prometheus-rules/templates/kaas/turtles/alerting-rules/vertical-pod-autoscaler.rules.yml

@@ -27,5 +27,5 @@ spec:
        cancel_if_scrape_timeout: "true"
        cancel_if_outside_working_hours: "true"
        severity: page
-        team: phoenix
+        team: turtles


Changing ownership

QuentinBisson · 2024-06-05T15:26:04Z

helm/prometheus-rules/templates/kaas/turtles/alerting-rules/inhibit.kubelet.rules.yml

+  - name: inhibit.all
+    rules:
+    - alert: InhibitionKubeletDown
+      expr: label_replace(up{app="kubelet"}, "ip", "$1", "instance", "(.+):\\d+") == 0


Kubelets are turtles right?

QuentinBisson · 2024-06-05T15:26:52Z

If it's too much, I can try to split this out into a bunch of PRs :D

QuentinBisson · 2024-06-06T09:27:23Z

...rometheus-rules/templates/kaas/phoenix/alerting-rules/aws-load-balancer-controller.rules.yml

@@ -1,4 +1,4 @@
-{{- if eq .Values.managementCluster.provider.kind "aws" }}
+# This rule applies to vintage aws and capa workload clusters


This alert is not vintage aws only

QuentinBisson · 2024-06-06T09:28:07Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/aws.management-cluster.rules.yml

@@ -161,32 +163,4 @@ spec:
        severity: page
        team: phoenix
        topic: kubernetes
-    - alert: IRSATooManyErrors


This was moved to the irsa.rules.yml to avoid duplication

QuentinBisson · 2024-06-06T09:28:54Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/aws.workload-cluster.rules.yml

@@ -1,4 +1,4 @@
-{{- if eq .Values.managementCluster.provider.kind "aws" }}
+{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }}


The review on this one is not finished. I need to move ownership to turtles

QuentinBisson · 2024-06-06T09:29:42Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/capa.management-cluster.rules.yml

@@ -62,18 +63,4 @@ spec:
        severity: page
        team: phoenix
        topic: kubernetes
-    - alert: IRSATooManyErrors


Was moved to the irsa.rules.yml

QuentinBisson · 2024-06-06T09:31:18Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/irsa.rules.yml

@@ -0,0 +1,49 @@
+# This rule applies to vintage aws or capa management clusters
+{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }}


I think this is not respecting the multi-provider mc things if we run aws WCs from another provider 🤔

@giantswarm/team-phoenix tis only runs on an AWS/CAPA MC or can it run on any other MCs if we run CAPA/EKS WCs?

QuentinBisson · 2024-06-11T14:44:01Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/capa.management-cluster.rules.yml

-        or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="capa-controller-manager", cluster_type="management_cluster"})
-        or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="capa-iam-operator", cluster_type="management_cluster"})
-        or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="irsa-operator", cluster_type="management_cluster"})
+        absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="aws-resolver-rules-operator", cluster_id="{{ .Values.managementCluster.name }}", installation="{{ .Values.managementCluster.name }}", provider="{{ .Values.managementCluster.provider.kind }}", pipeline="{{ .Values.managementCluster.pipeline }}"})


Those label ensure we can use absent safely with mimir

QuentinBisson · 2024-06-11T14:44:47Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/dns-operator-azure.rules.yml

@@ -23,20 +23,20 @@ spec:
            area: kaas
            cancel_if_outside_working_hours: {{include "workingHoursOnly" .}}
            severity: notify
-            team: {{include "providerTeam" .}}


@giantswarm/team-phoenix this operator can only run on a capz MC right?

QuentinBisson · 2024-06-12T12:06:00Z

Waiting for this one #1238

QuentinBisson · 2024-06-12T12:06:44Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/aws.node.workload-cluster.rules.yml

@@ -28,15 +29,16 @@ spec:
        severity: page
        team: phoenix
        topic: kubernetes
+    {{- end }}
    - alert: WorkloadClusterNodeUnexpectedTaintNodeWithImpairedVolumes


Impaired nodes can happen on capa/eks

QuentinBisson · 2024-06-12T12:07:22Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/aws.workload-cluster.rules.yml

  name: aws.workload-cluster.rules
  namespace: {{ .Values.namespace  }}
 spec:
  groups:
-  - name: aws
+  - name: aws.workload-cluster


Just making sure we're not replacing another rule group

QuentinBisson · 2024-06-12T12:07:44Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/aws.workload-cluster.rules.yml

    rules:
    - alert: WorkloadClusterContainerIsRestartingTooFrequentlyAWS
      annotations:
        description: '{{`Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting too often.`}}'
        opsrecipe: container-is-restarting-too-often/
-      expr: label_join(increase(kube_pod_container_status_restarts_total{container=~"aws-node.*|kiam-agent.*|kiam-server.*|cluster-autoscaler.*|ebs-plugin.*|aws-pod-identity-webhook.*|etcd-kubernetes-resources-count-exporter.*"}[1h]),"service","/","namespace","pod") > 10
+      ## TODO Review this list once all vintage installations are gone


Moved cluster-autoscaler and etcd-kubernetes-resources-count-exporter.* to turtles

QuentinBisson · 2024-06-12T12:08:06Z

helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/aws.workload-cluster.rules.yml

        severity: page
        team: phoenix
-        topic: kubernetes
-    - alert: WorkloadClusterControlPlaneNodeMissingAWS


Moved to turtles

QuentinBisson · 2024-06-12T12:14:46Z

CHANGELOG.md

 - Reviewed turtles alerts labels.
 - Use `ready` replicas for Kyverno webhooks alert.
 - Sort out shared alert ownership by distributing them all to teams.
 - Review and fix phoenix alerts towards Mimir and multi-provider MCs.
-  - Move cluster-autoscaler and vpa alerts to turtles.
+  - Move core components alerts from phoenix to turtles (cluster-autoscaler, vertical-pod-autoscaler, kubelet, etcd-kubernetes-resources-count-exporter, certificates)


Improved changelog

Signed-off-by: QuentinBisson <[email protected]>

QuentinBisson · 2024-06-12T12:36:12Z

And we're done

QuentinBisson requested a review from a team June 5, 2024 15:22

QuentinBisson self-assigned this Jun 5, 2024

QuentinBisson requested a review from a team as a code owner June 5, 2024 15:22

QuentinBisson commented Jun 5, 2024

View reviewed changes

T-Kukawka requested a review from a team June 5, 2024 15:34

This was referenced Jun 5, 2024

review-phoenix-inhibitions #1212

Merged

Reorganize the job rules and the management-cluster-certificate alerts #1213

Merged

QuentinBisson commented Jun 6, 2024

View reviewed changes

QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 5 times, most recently from 99fe8a1 to 43d2cce Compare June 11, 2024 14:42

QuentinBisson commented Jun 11, 2024

View reviewed changes

QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 2 times, most recently from a2ad61e to d629825 Compare June 12, 2024 09:57

QuentinBisson requested a review from a team as a code owner June 12, 2024 09:57

Gacko approved these changes Jun 12, 2024

View reviewed changes

QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 2 times, most recently from 5537873 to 697c8b2 Compare June 12, 2024 12:03

QuentinBisson commented Jun 12, 2024

View reviewed changes

QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch from 697c8b2 to 8e44ce2 Compare June 12, 2024 12:12

QuentinBisson commented Jun 12, 2024

View reviewed changes

QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 2 times, most recently from 2d8b660 to a5eae40 Compare June 12, 2024 12:23

Review phoenix alerts

9eddb45

Signed-off-by: QuentinBisson <[email protected]>

QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch from a5eae40 to 9eddb45 Compare June 12, 2024 12:29

QuentinBisson merged commit eb49861 into main Jun 12, 2024
7 checks passed

QuentinBisson deleted the start-reviewing-phoenix-alerts branch June 12, 2024 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review phoenix alerts #1211

review phoenix alerts #1211

QuentinBisson commented Jun 5, 2024

QuentinBisson Jun 5, 2024

QuentinBisson Jun 5, 2024

QuentinBisson Jun 5, 2024

QuentinBisson Jun 5, 2024

QuentinBisson Jun 5, 2024

QuentinBisson Jun 5, 2024

QuentinBisson Jun 5, 2024

T-Kukawka Jun 5, 2024

QuentinBisson commented Jun 5, 2024

QuentinBisson Jun 6, 2024 •

edited

Loading

QuentinBisson Jun 6, 2024

QuentinBisson Jun 6, 2024

QuentinBisson Jun 6, 2024

QuentinBisson Jun 6, 2024

QuentinBisson Jun 11, 2024

QuentinBisson Jun 11, 2024

QuentinBisson Jun 11, 2024

QuentinBisson commented Jun 12, 2024

QuentinBisson Jun 12, 2024

QuentinBisson Jun 12, 2024

QuentinBisson Jun 12, 2024

QuentinBisson Jun 12, 2024

QuentinBisson Jun 12, 2024

QuentinBisson commented Jun 12, 2024

		@@ -1,4 +1,4 @@
		{{- if eq .Values.managementCluster.provider.kind "aws" }}
		# This rule applies to vintage aws and capa workload clusters

		@@ -1,4 +1,4 @@
		{{- if eq .Values.managementCluster.provider.kind "aws" }}
		{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }}

		@@ -0,0 +1,49 @@
		# This rule applies to vintage aws or capa management clusters
		{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }}

review phoenix alerts #1211

review phoenix alerts #1211

Conversation

QuentinBisson commented Jun 5, 2024

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuentinBisson commented Jun 5, 2024

QuentinBisson Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuentinBisson commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuentinBisson commented Jun 12, 2024

QuentinBisson Jun 6, 2024 •

edited

Loading