Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

review phoenix alerts #1211

Merged
merged 1 commit into from
Jun 12, 2024
Merged

review phoenix alerts #1211

merged 1 commit into from
Jun 12, 2024

Conversation

QuentinBisson
Copy link
Contributor

Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.


This PR is a review of the phoenix alerts towards mimir/capi migration

@giantswarm/oncall-kaas-cloud this is WIP but I would love a first review on this

Checklist

@QuentinBisson QuentinBisson requested a review from a team June 5, 2024 15:22
@QuentinBisson QuentinBisson self-assigned this Jun 5, 2024
@QuentinBisson QuentinBisson requested a review from a team as a code owner June 5, 2024 15:22
@@ -21,16 +21,3 @@ spec:
severity: notify
team: {{ include "providerTeam" . }}
topic: managementcluster
{{- if eq .Values.managementCluster.provider.kind "aws" }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been moved to the aws.job.rules.yml file

@@ -59,7 +59,6 @@ spec:
cancel_if_cluster_with_scaling_nodepools: "true"
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
cancel_if_cluster_has_no_workers: "true"
cancel_if_nodes_down: "true"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes_down is never set so ...

@@ -23,13 +23,13 @@ spec:
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: phoenix
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should page the mc team and not phoenix right?

topic: security
- alert: ManagementClusterAWSCertificateWillExpireInLessThanOneMonth
- alert: ManagementClusterCertificateWillExpireInLessThanOneMonth
annotations:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alert should also page for onprem

groups:
- name: inhibit.oncall
rules:
- alert: InhibitionOutsideWorkingHours
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a phoenix alert

@@ -27,5 +27,5 @@ spec:
cancel_if_scrape_timeout: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: phoenix
team: turtles
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing ownership

- name: inhibit.all
rules:
- alert: InhibitionKubeletDown
expr: label_replace(up{app="kubelet"}, "ip", "$1", "instance", "(.+):\\d+") == 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelets are turtles right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@QuentinBisson
Copy link
Contributor Author

If it's too much, I can try to split this out into a bunch of PRs :D

@@ -1,4 +1,4 @@
{{- if eq .Values.managementCluster.provider.kind "aws" }}
# This rule applies to vintage aws and capa workload clusters
Copy link
Contributor Author

@QuentinBisson QuentinBisson Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alert is not vintage aws only

@@ -161,32 +163,4 @@ spec:
severity: page
team: phoenix
topic: kubernetes
- alert: IRSATooManyErrors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved to the irsa.rules.yml to avoid duplication

@@ -1,4 +1,4 @@
{{- if eq .Values.managementCluster.provider.kind "aws" }}
{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review on this one is not finished. I need to move ownership to turtles

@@ -62,18 +63,4 @@ spec:
severity: page
team: phoenix
topic: kubernetes
- alert: IRSATooManyErrors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was moved to the irsa.rules.yml

@@ -0,0 +1,49 @@
# This rule applies to vintage aws or capa management clusters
{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not respecting the multi-provider mc things if we run aws WCs from another provider 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giantswarm/team-phoenix tis only runs on an AWS/CAPA MC or can it run on any other MCs if we run CAPA/EKS WCs?

@QuentinBisson QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 5 times, most recently from 99fe8a1 to 43d2cce Compare June 11, 2024 14:42
or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="capa-controller-manager", cluster_type="management_cluster"})
or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="capa-iam-operator", cluster_type="management_cluster"})
or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="irsa-operator", cluster_type="management_cluster"})
absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="aws-resolver-rules-operator", cluster_id="{{ .Values.managementCluster.name }}", installation="{{ .Values.managementCluster.name }}", provider="{{ .Values.managementCluster.provider.kind }}", pipeline="{{ .Values.managementCluster.pipeline }}"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those label ensure we can use absent safely with mimir

@@ -23,20 +23,20 @@ spec:
area: kaas
cancel_if_outside_working_hours: {{include "workingHoursOnly" .}}
severity: notify
team: {{include "providerTeam" .}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giantswarm/team-phoenix this operator can only run on a capz MC right?

@QuentinBisson QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 2 times, most recently from a2ad61e to d629825 Compare June 12, 2024 09:57
@QuentinBisson QuentinBisson requested a review from a team as a code owner June 12, 2024 09:57
@QuentinBisson QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 2 times, most recently from 5537873 to 697c8b2 Compare June 12, 2024 12:03
@QuentinBisson
Copy link
Contributor Author

Waiting for this one #1238

@@ -28,15 +29,16 @@ spec:
severity: page
team: phoenix
topic: kubernetes
{{- end }}
- alert: WorkloadClusterNodeUnexpectedTaintNodeWithImpairedVolumes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impaired nodes can happen on capa/eks

name: aws.workload-cluster.rules
namespace: {{ .Values.namespace }}
spec:
groups:
- name: aws
- name: aws.workload-cluster
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure we're not replacing another rule group

rules:
- alert: WorkloadClusterContainerIsRestartingTooFrequentlyAWS
annotations:
description: '{{`Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting too often.`}}'
opsrecipe: container-is-restarting-too-often/
expr: label_join(increase(kube_pod_container_status_restarts_total{container=~"aws-node.*|kiam-agent.*|kiam-server.*|cluster-autoscaler.*|ebs-plugin.*|aws-pod-identity-webhook.*|etcd-kubernetes-resources-count-exporter.*"}[1h]),"service","/","namespace","pod") > 10
## TODO Review this list once all vintage installations are gone
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved cluster-autoscaler and etcd-kubernetes-resources-count-exporter.* to turtles

severity: page
team: phoenix
topic: kubernetes
- alert: WorkloadClusterControlPlaneNodeMissingAWS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to turtles

@QuentinBisson QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch from 697c8b2 to 8e44ce2 Compare June 12, 2024 12:12
CHANGELOG.md Outdated
- Reviewed turtles alerts labels.
- Use `ready` replicas for Kyverno webhooks alert.
- Sort out shared alert ownership by distributing them all to teams.
- Review and fix phoenix alerts towards Mimir and multi-provider MCs.
- Move cluster-autoscaler and vpa alerts to turtles.
- Move core components alerts from phoenix to turtles (cluster-autoscaler, vertical-pod-autoscaler, kubelet, etcd-kubernetes-resources-count-exporter, certificates)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved changelog

@QuentinBisson QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch 2 times, most recently from 2d8b660 to a5eae40 Compare June 12, 2024 12:23
Signed-off-by: QuentinBisson <[email protected]>
@QuentinBisson QuentinBisson force-pushed the start-reviewing-phoenix-alerts branch from a5eae40 to 9eddb45 Compare June 12, 2024 12:29
@QuentinBisson
Copy link
Contributor Author

And we're done

@QuentinBisson QuentinBisson merged commit eb49861 into main Jun 12, 2024
7 checks passed
@QuentinBisson QuentinBisson deleted the start-reviewing-phoenix-alerts branch June 12, 2024 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants