Skip to content

Commit

Permalink
Merge branch 'main' into start-reviewing-phoenix-alerts
Browse files Browse the repository at this point in the history
  • Loading branch information
QuentinBisson authored Jun 6, 2024
2 parents ba55ada + e583b80 commit c1999e2
Show file tree
Hide file tree
Showing 29 changed files with 90 additions and 77 deletions.
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Added a new alerting rule to `falco.rules.yml` to fire an alert for XZ-backdoor.
- Add `CiliumAPITooSlow`.

### Changed

- Review phoenix alerts towards Mimir.
- Moves cluster-autoscaler and vpa alerts to turtles.

### Fixed

- Fix cabbage alerts for multi-provider wcs.

### Removed

- cleanup: remove scrape timeout inhibition leftovers (documentation and labels)

## [4.1.2] - 2024-05-31

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -302,9 +302,9 @@ In order for Alertmanager inhibition to work we need 3 elements:
- an Inhibition definition mapping source labels to target labels in the alertmanager config file
- an Alert rule with some target labels
An alert having a target label will be inhibited whenever the condition specified in the target label's name is fulfilled. This is why target labels' names are most of the time prefixed by "cancel_if_" (e.g "cancel_if_scrape_timeout").
An alert having a target label will be inhibited whenever the condition specified in the target label's name is fulfilled. This is why target labels' names are most of the time prefixed by "cancel_if_" (e.g "cancel_if_outside_working_hours").
An alert with a source label will define the conditions under which the target label is effective. For example, if an alert with the "scrape_timeout" label were to fire, all other alerts having the corresponding target label, i.e "cancel_if_scrape_timeout" would be inhibited.
An alert with a source label will define the conditions under which the target label is effective. For example, if an alert with the "outside_working_hours" label were to fire, all other alerts having the corresponding target label, i.e "cancel_if_outside_working_hours" would be inhibited.
This is possible thanks to the alertmanager config file stored in the Prometheus-Meta-operator which defines the target/source labels coupling.
Expand Down
8 changes: 0 additions & 8 deletions helm/prometheus-rules/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,6 @@ true
{{- end -}}
{{- end -}}

{{- define "isClusterServiceInstalled" -}}
{{ not (eq .Values.managementCluster.provider.flavor "capi") }}
{{- end -}}

{{- define "isVaultBeingMonitored" -}}
{{ not (eq .Values.managementCluster.provider.flavor "capi") }}
{{- end -}}

{{- define "isBastionBeingMonitored" -}}
{{ not (eq .Values.managementCluster.provider.flavor "capi") }}
{{- end -}}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
Expand All @@ -21,7 +22,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: notify
team: {{ include "providerTeam" . }}
team: phoenix
topic: kubernetes
- alert: CalicoNodeMemoryHighUtilization
annotations:
Expand All @@ -36,6 +37,6 @@ spec:
area: kaas
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: notify
team: {{ include "providerTeam" . }}
team: phoenix
topic: kubernetes
{{- end }}
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{{- if eq (include "isClusterServiceInstalled" .) "true" }}
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand All @@ -23,7 +24,7 @@ spec:
labels:
area: storage
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: managementcluster
{{- if eq .Values.managementCluster.pipeline "testing" }}
- alert: TestClusterTooOld
Expand All @@ -33,5 +34,6 @@ spec:
for: 5m
labels:
severity: notify
team: phoenix
{{- end }}
{{- end }}
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
# newer clusters don't use docker anymore
apiVersion: monitoring.coreos.com/v1
Expand All @@ -22,6 +23,6 @@ spec:
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: observability
{{- end }}
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## TODO Remove with vintage
# This rule applies to vintage aws clusters
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{{- if eq (include "isVaultBeingMonitored" .) "true" }}
## TODO Remove with vintage
{{- if eq .Values.managementCluster.provider.flavor "vintage" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand All @@ -23,7 +24,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault
- alert: VaultIsSealed
annotations:
Expand All @@ -35,7 +36,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault
- alert: ClusterServiceVaultTokenAlmostExpired
annotations:
Expand All @@ -47,7 +48,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault
- alert: ClusterServiceVaultTokenAlmostExpiredMissing
annotations:
Expand All @@ -60,7 +61,7 @@ spec:
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
cancel_if_prometheus_agent_down: "true"
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault
- alert: CertOperatorVaultTokenAlmostExpired
annotations:
Expand All @@ -72,7 +73,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault
- alert: CertOperatorVaultTokenAlmostExpiredMissing
annotations:
Expand All @@ -84,7 +85,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault
- alert: VaultLatestEtcdBackupTooOld
annotations:
Expand All @@ -96,7 +97,7 @@ spec:
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault
- alert: VaultLatestEtcdBackupMetricsMissing
annotations:
Expand All @@ -108,7 +109,6 @@ spec:
area: kaas
cancel_if_outside_working_hours: "true"
severity: page
team: {{ include "providerTeam" . }}
team: phoenix
topic: vault

{{- end }}
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# This rule applies to all capi management clusters
{{- if eq .Values.managementCluster.provider.flavor "capi" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand All @@ -19,10 +20,11 @@ spec:
description: '{{`Control plane of cluster {{ $labels.cluster_id }} is not healthy.`}}'
expr: |-
capi_kubeadmcontrolplane_status_condition{cluster_type="management_cluster", type="ControlPlaneComponentsHealthy", status="False"} == 1
or capi_kubeadmcontrolplane_status_condition{cluster_type="management_cluster", type="EtcdClusterHealthy", status="False"} == 1
or capi_kubeadmcontrolplane_status_condition{cluster_type="management_cluster", type="Available", status="False"} == 1
or capi_kubeadmcontrolplane_status_condition{cluster_type="management_cluster", type="EtcdClusterHealthy", status="False"} == 1
or capi_kubeadmcontrolplane_status_condition{cluster_type="management_cluster", type="Available", status="False"} == 1
labels:
area: kaas
cluster_control_plane_unhealthy: "true"
team: turtles
topic: status
{{- end }}
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# This rule applies to all clusters
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand All @@ -8,15 +9,14 @@ metadata:
namespace: {{ .Values.namespace }}
spec:
groups:
- name: inhibit.all
- name: inhibit.kubelet
rules:
- alert: InhibitionKubeletDown
annotations:
description: '{{`Kubelet ({{ $labels.instance }}) is down.`}}'
expr: label_replace(up{app="kubelet"}, "ip", "$1", "instance", "(.+):\\d+") == 0
labels:
kubelet_down: "true"
area: kaas
topic: kubernetes
team: turtles
annotations:
description: '{{`Kubelet ({{ $labels.instance }}) is down.`}}'

Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ spec:
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_scrape_timeout: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: turtles
topic: observability
topic: autoscaling
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ spec:
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_scrape_timeout: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
Expand All @@ -63,7 +62,6 @@ spec:
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_scrape_timeout: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
Expand All @@ -81,7 +79,6 @@ spec:
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_scrape_timeout: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ spec:
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_scrape_timeout: "true"
cancel_if_cluster_has_no_workers: "true"
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
severity: page
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ spec:
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_status_updating: "true"
cancel_if_scrape_timeout: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ spec:
expr: avg(cilium_bpf_map_pressure) by (cluster_id, installation, pipeline, provider, map_name) * 100 > 80
for: 15m
labels:
area: managedservices
area: platform
cancel_if_outside_working_hours: "true"
severity: page
team: cabbage
Expand All @@ -29,7 +29,19 @@ spec:
expr: avg(cilium_bpf_map_pressure) by (cluster_id, installation, pipeline, provider, map_name) * 100 > 95
for: 15m
labels:
area: managedservices
area: platform
severity: page
team: cabbage
topic: cilium
- alert: CiliumAPITooSlow
annotations:
description: '{{`Cilium API processing time is >50s pod="{{ $labels.pod }}" node="{{ $labels.node }}" method="{{ $labels.method}}" path="{{ $labels.path }}"`}}'
opsrecipe: cilium-performance-issues/#slow-cilium-api
expr: avg(rate(cilium_agent_api_process_time_seconds_sum{}[5m])/rate(cilium_agent_api_process_time_seconds_count{}[5m]) > 50) by (cluster_id, node, pod, method, path, installation, pipeline, provider)
for: 20m
labels:
area: platform
cancel_if_outside_working_hours: "true"
severity: page
team: cabbage
topic: cilium
Expand All @@ -42,7 +54,7 @@ spec:
expr: max(rate(cilium_policy_change_total{outcome=~"fail.*"}[20m]) OR rate(cilium_policy_import_errors_total[20m])) by (cluster_id, installation, pipeline, provider) > 0
for: 10m
labels:
area: managedservices
area: platform
cancel_if_outside_working_hours: "true"
severity: page
team: cabbage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ spec:
sum(kube_deployment_status_replicas_available{deployment=~"coredns.*"}) by (cluster_id, deployment, installation, namespace, pipeline, provider) / (sum(kube_deployment_status_replicas_available{deployment=~"coredns.*"}) by (cluster_id, deployment, installation, namespace, pipeline, provider) + sum(kube_deployment_status_replicas_unavailable{deployment=~"coredns.*"}) by (cluster_id, deployment, installation, namespace, pipeline, provider))* 100 < 51
for: 10m
labels:
area: empowerment
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
Expand All @@ -41,7 +41,7 @@ spec:
)
for: 120m
labels:
area: empowerment
area: platform
cancel_if_outside_working_hours: "true"
severity: page
team: cabbage
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
{{- if (eq .Values.managementCluster.provider.kind "aws") }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand All @@ -18,10 +17,10 @@ spec:
annotations:
description: '{{`external-dns in namespace {{ $labels.namespace }}) can''t access registry (cloud service provider DNS service).`}}'
opsrecipe: external-dns-cant-access-registry/
expr: rate(external_dns_registry_errors_total[2m]) > 0
expr: rate(external_dns_registry_errors_total{provider="aws|capa|capz|eks"}[2m]) > 0
for: 15m
labels:
area: managedservices
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
Expand All @@ -33,10 +32,10 @@ spec:
annotations:
description: '{{`external-dns in namespace {{ $labels.namespace }}) can''t access source (Service or Ingress resource).`}}'
opsrecipe: external-dns-cant-access-source/
expr: rate(external_dns_source_errors_total[2m]) > 0
expr: rate(external_dns_source_errors_total{provider="aws|capa|capz|eks"}[2m]) > 0
for: 15m
labels:
area: managedservices
area: platform
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_outside_working_hours: {{ include "workingHoursOnly" . }}
Expand All @@ -48,10 +47,10 @@ spec:
annotations:
description: '{{`external-dns in namespace {{ $labels.namespace }}) is down.`}}'
opsrecipe: external-dns-down/
expr: label_replace(up{app=~"external-dns-(app|monitoring)"}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0
expr: label_replace(up{app=~"external-dns-(app|monitoring)", provider="aws|capa|capz|eks"}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0
for: 15m
labels:
area: managedservices
area: platform
cancel_if_outside_working_hours: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
Expand All @@ -60,4 +59,3 @@ spec:
severity: page
team: cabbage
topic: external-dns
{{- end }}
Loading

0 comments on commit c1999e2

Please sign in to comment.