-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
review phoenix alerts #1211
review phoenix alerts #1211
Conversation
@@ -21,16 +21,3 @@ spec: | |||
severity: notify | |||
team: {{ include "providerTeam" . }} | |||
topic: managementcluster | |||
{{- if eq .Values.managementCluster.provider.kind "aws" }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been moved to the aws.job.rules.yml file
@@ -59,7 +59,6 @@ spec: | |||
cancel_if_cluster_with_scaling_nodepools: "true" | |||
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }} | |||
cancel_if_cluster_has_no_workers: "true" | |||
cancel_if_nodes_down: "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodes_down is never set so ...
@@ -23,13 +23,13 @@ spec: | |||
area: kaas | |||
cancel_if_outside_working_hours: "true" | |||
severity: page | |||
team: phoenix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should page the mc team and not phoenix right?
topic: security | ||
- alert: ManagementClusterAWSCertificateWillExpireInLessThanOneMonth | ||
- alert: ManagementClusterCertificateWillExpireInLessThanOneMonth | ||
annotations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alert should also page for onprem
groups: | ||
- name: inhibit.oncall | ||
rules: | ||
- alert: InhibitionOutsideWorkingHours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be a phoenix alert
@@ -27,5 +27,5 @@ spec: | |||
cancel_if_scrape_timeout: "true" | |||
cancel_if_outside_working_hours: "true" | |||
severity: page | |||
team: phoenix | |||
team: turtles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing ownership
- name: inhibit.all | ||
rules: | ||
- alert: InhibitionKubeletDown | ||
expr: label_replace(up{app="kubelet"}, "ip", "$1", "instance", "(.+):\\d+") == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelets are turtles right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
If it's too much, I can try to split this out into a bunch of PRs :D |
@@ -1,4 +1,4 @@ | |||
{{- if eq .Values.managementCluster.provider.kind "aws" }} | |||
# This rule applies to vintage aws and capa workload clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alert is not vintage aws only
@@ -161,32 +163,4 @@ spec: | |||
severity: page | |||
team: phoenix | |||
topic: kubernetes | |||
- alert: IRSATooManyErrors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved to the irsa.rules.yml to avoid duplication
@@ -1,4 +1,4 @@ | |||
{{- if eq .Values.managementCluster.provider.kind "aws" }} | |||
{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The review on this one is not finished. I need to move ownership to turtles
@@ -62,18 +63,4 @@ spec: | |||
severity: page | |||
team: phoenix | |||
topic: kubernetes | |||
- alert: IRSATooManyErrors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was moved to the irsa.rules.yml
@@ -0,0 +1,49 @@ | |||
# This rule applies to vintage aws or capa management clusters | |||
{{- if or (eq .Values.managementCluster.provider.kind "aws") (eq .Values.managementCluster.provider.kind "capa") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not respecting the multi-provider mc things if we run aws WCs from another provider 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giantswarm/team-phoenix tis only runs on an AWS/CAPA MC or can it run on any other MCs if we run CAPA/EKS WCs?
99fe8a1
to
43d2cce
Compare
or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="capa-controller-manager", cluster_type="management_cluster"}) | ||
or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="capa-iam-operator", cluster_type="management_cluster"}) | ||
or absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="irsa-operator", cluster_type="management_cluster"}) | ||
absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="aws-resolver-rules-operator", cluster_id="{{ .Values.managementCluster.name }}", installation="{{ .Values.managementCluster.name }}", provider="{{ .Values.managementCluster.provider.kind }}", pipeline="{{ .Values.managementCluster.pipeline }}"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those label ensure we can use absent safely with mimir
@@ -23,20 +23,20 @@ spec: | |||
area: kaas | |||
cancel_if_outside_working_hours: {{include "workingHoursOnly" .}} | |||
severity: notify | |||
team: {{include "providerTeam" .}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giantswarm/team-phoenix this operator can only run on a capz MC right?
a2ad61e
to
d629825
Compare
5537873
to
697c8b2
Compare
Waiting for this one #1238 |
@@ -28,15 +29,16 @@ spec: | |||
severity: page | |||
team: phoenix | |||
topic: kubernetes | |||
{{- end }} | |||
- alert: WorkloadClusterNodeUnexpectedTaintNodeWithImpairedVolumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impaired nodes can happen on capa/eks
name: aws.workload-cluster.rules | ||
namespace: {{ .Values.namespace }} | ||
spec: | ||
groups: | ||
- name: aws | ||
- name: aws.workload-cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure we're not replacing another rule group
rules: | ||
- alert: WorkloadClusterContainerIsRestartingTooFrequentlyAWS | ||
annotations: | ||
description: '{{`Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting too often.`}}' | ||
opsrecipe: container-is-restarting-too-often/ | ||
expr: label_join(increase(kube_pod_container_status_restarts_total{container=~"aws-node.*|kiam-agent.*|kiam-server.*|cluster-autoscaler.*|ebs-plugin.*|aws-pod-identity-webhook.*|etcd-kubernetes-resources-count-exporter.*"}[1h]),"service","/","namespace","pod") > 10 | ||
## TODO Review this list once all vintage installations are gone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved cluster-autoscaler and etcd-kubernetes-resources-count-exporter.* to turtles
severity: page | ||
team: phoenix | ||
topic: kubernetes | ||
- alert: WorkloadClusterControlPlaneNodeMissingAWS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to turtles
697c8b2
to
8e44ce2
Compare
CHANGELOG.md
Outdated
- Reviewed turtles alerts labels. | ||
- Use `ready` replicas for Kyverno webhooks alert. | ||
- Sort out shared alert ownership by distributing them all to teams. | ||
- Review and fix phoenix alerts towards Mimir and multi-provider MCs. | ||
- Move cluster-autoscaler and vpa alerts to turtles. | ||
- Move core components alerts from phoenix to turtles (cluster-autoscaler, vertical-pod-autoscaler, kubelet, etcd-kubernetes-resources-count-exporter, certificates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved changelog
2d8b660
to
a5eae40
Compare
Signed-off-by: QuentinBisson <[email protected]>
a5eae40
to
9eddb45
Compare
And we're done |
Before adding a new alerting rule into this repository you should consider creating an SLO rules instead.
SLO helps you both increase the quality of your monitoring and reduce the alert noise.
This PR is a review of the phoenix alerts towards mimir/capi migration
@giantswarm/oncall-kaas-cloud this is WIP but I would love a first review on this
Checklist
oncall-kaas-cloud
GitHub group).