Releases: giantswarm/prometheus-rules
Releases · giantswarm/prometheus-rules
v4.3.2
Added
- Added new alerting rules to monitor the Prometheus reading data from Mimir and sending them to Grafana Cloud.
- Recording rule to send mimir memory usage and metrics amount to grafana cloud
v4.3.1
Changed
- Increase time in volume filled related alerts to allow node-problem-detector to shut down nodes properly.
Fixed
- Fix cert-exporter alerts to render the secret namespace and not the cert-exporter namespace in the alert description.
Removed
- Remove old kaas daemonset slos as they are now in sloth slos.
- Remove old cilium daemonset slos as they are now in sloth slos.
v4.3.0
Removed
- Remove old cloud-api slos as they are now in sloth slos.
- Remove old
Heartbeat
andMatchingNumberOfPrometheusAndCluster
on mimir-equipped installations.
v4.2.1
Fixed
- removed duplicate slo-target on AWS
Changed
- Finish reviewing
turles
alerts for multi-provider MCs and Mimir.- Prefix all vintage alerts with
vintage
to facilitate maintenance. - Fix kubelet container runtime alerts.
- Fix pod_name label to use pod instead.
- Prefix all vintage alerts with
v4.2.0
Added
- Added a new alerting rule to
falco.rules.yml
to fire an alert for XZ-backdoor. - Added
CiliumAPITooSlow
. - Added
CODEOWNERS
files.
Changed
- Restrict
grafana-agent-rules
CiliumNetworkPolicy. - Use
ready
replicas for Kyverno webhooks alert. - Sort out shared alert ownership by distributing them all to teams.
- Review and fix phoenix alerts towards Mimir and multi-provider MCs.
- Move core components alerts from phoenix to turtles (
cluster-autoscaler
,vertical-pod-autoscaler
,kubelet
,etcd-kubernetes-resources-count-exporter
,certificates
) - Split the phoenix job alert into 2:
- Add the aws specific job alerts in the
vintage.aws.management-cluster.rules
file. - Move the rest of
job.rules
to turtles because it is provider independent
- Add the aws specific job alerts in the
- Prefix all vintage alerts with
vintage
to facilitate maintenance. - Merge
kiam
andinhibit.kiam
into one file. - Support any AWS WC in the aws-load-balancer-controller alerts.
- Create a shared IRSA alerts rule file to avoid duplication between capa and vintage aws.
- Move core components alerts from phoenix to turtles (
- Review and fix cabbage alerts for multi-provider MCs and Mimir.
- Review and fix shield alerts for multi-provider MCs and Mimir.
- Review and fix honeybadger alerts for multi-provider MCs and Mimir.
- Review and fix bigmac alerts for multi-provider MCs and Mimir.
- Fix
ManagementClusterDexAppMissing
use of absent for mimir. - Update team bigmac rules based on the label changes
- Fix
- Review and fix atlas alerts for multi-provider MCs and Mimir.
- Fix alerts using absent metrics for Mimir.
- Review and fix turtles alerts for multi-provider MCs and Mimir.
- Fix alerts using absent metrics for Mimir.
- Reviewed turtles alerts labels.
Fixed
- Fixed usage of yq, and jq in check-opsrecipes.sh
- Fetch jq with make install-tools
- Fixed and improve the check-opsrecipes.sh script to support /_index.md based ops-recipes.
- Fixed all area alert labels.
- Fixed
cert-exporter
alerts to page on all providers. - Fixed
cilium
SLO recording rule, setting a proper threshold for the alert.
Removed
- cleanup: get rid of microendpoint alerts as it never fired and probably never will
- cleanup: remove scrape timeout inhibition leftovers (documentation and labels)
v4.1.2
Changed
- Updated
ContainerdVolumeSpaceTooLow
,KubeletVolumeSpaceTooLow
andLogVolumeSpaceTooLow
alerts to not trigger when the node-problem-detector is already remediating the issue.
v4.1.1
Changed
- Get rid of the
app
,role
andnode
external labels in Atlas rules.
v4.1.0
Added
- Add
aggregation:capi_infrastructure_crd_versions
metric to Grafana Cloud.
Fixed
- Fix remaining pint issues.
Removed
- Remove api-server from old SLO framework.
v4.0.0
Changed
- ! Breaking change !
- Folder architecture for rules changed to fit with areas and teams for a better overview (giantswarm/giantswarm#30769)
Added
- Add new alert to detect old and new prometheus-operator kubelet services in the same cluster (giantswarm/giantswarm#30888).
v3.15.0
Removed
- Remove atlas old slo alerts in favor of sloth alerts.
Changed
- pint tests: run automatically on CI. Also, target names have changed.
Fixed
- Fix node load alerts for CAPI clusters.
- Remove trailing spaces in rules.
- Fix
KubeStateMetricsNotRetrievingMetrics
on mimir. - Fix nginx ingress controller opsrecipe link.