Skip to content

Commit

Permalink
Add alerts for azure cloud components HelmReleases
Browse files Browse the repository at this point in the history
  • Loading branch information
fiunchinho committed Nov 19, 2024
1 parent 0c2a896 commit 5d9f2a8
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 3 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Add `aws-cloud-components.rules` to monitor the AWS cloud-controller and the ebs-csi-driver.
- Add `azure-cloud-components.rules` to monitor the Azure cloud-controller and the azure csi drivers.

## [4.26.1] - 2024-11-19

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@ metadata:
namespace: {{ .Values.namespace }}
spec:
groups:
- name: aws-cloud-controller-manager
- name: aws-cloud-components
rules:
- alert: FluxHelmReleaseFailed
annotations:
description: |-
{{`Flux HelmRelease {{ $labels.name }} in ns {{ $labels.exported_namespace }} on {{ $labels.installation }}/{{ $labels.cluster_id }} is stuck in Failed state.`}}
{{`Flux HelmRelease {{ $labels.name }} in ns {{ $labels.exported_namespace }} on {{ $labels.installation }}/{{ $labels.cluster_id }} is stuck in Failed state.`}}
opsrecipe: fluxcd-failing-helmrelease/
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="management_cluster", exported_namespace!="flux-giantswarm", name=~".*(aws-ebs-csi-driver|cloud-provider-aws)"} > 0
for: 20m
Expand All @@ -28,5 +28,5 @@ spec:
team: phoenix
topic: managementcluster
namespace: |-
{{`{{ $labels.exported_namespace }}`}}
{{`{{ $labels.exported_namespace }}`}}
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{{- if eq .Values.managementCluster.provider.kind "capz" }}
# This rule applies to capa management clusters only
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
labels:
{{- include "labels.common" . | nindent 4 }}
name: azure-cloud-components.rules
namespace: {{ .Values.namespace }}
spec:
groups:
- name: azure-cloud-components
rules:
- alert: FluxHelmReleaseFailed
annotations:
description: |-
{{`Flux HelmRelease {{ $labels.name }} in ns {{ $labels.exported_namespace }} on {{ $labels.installation }}/{{ $labels.cluster_id }} is stuck in Failed state.`}}
opsrecipe: fluxcd-failing-helmrelease/
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="management_cluster", exported_namespace!="flux-giantswarm", name=~".*(azure-cloud-controller-manager|azure-cloud-node-manager|azuredisk-csi-driver|azurefile-csi-driver)"} > 0
for: 20m
labels:
area: kaas
cancel_if_outside_working_hours: "true"
cancel_if_kube_state_metrics_down: "true"
cancel_if_monitoring_agent_down: "true"
severity: page
team: phoenix
topic: managementcluster
namespace: |-
{{`{{ $labels.exported_namespace }}`}}
{{- end }}

0 comments on commit 5d9f2a8

Please sign in to comment.