Skip to content

Commit

Permalink
Merge branch 'master' into cert-manager-handover-
Browse files Browse the repository at this point in the history
  • Loading branch information
ubergesundheit authored Sep 27, 2023
2 parents 539e7fb + fe1411d commit babbd48
Show file tree
Hide file tree
Showing 31 changed files with 588 additions and 195 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/alert_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ jobs:
promtool-unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
with:
fetch-depth: "0"
- name: run promtool unit tests
run: make test-rules
inhibition-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
with:
fetch-depth: "0"
- name: run inhibition tests
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/zz_generated.check_values_schema.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DO NOT EDIT. Generated with:
#
# devctl@6.5.0
# devctl@6.9.0
#
name: 'Values and schema'
on:
Expand Down
12 changes: 6 additions & 6 deletions .github/workflows/zz_generated.create_release.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DO NOT EDIT. Generated with:
#
# devctl@6.5.0
# devctl@6.9.0
#
name: Create Release
on:
Expand All @@ -15,7 +15,7 @@ on:
jobs:
debug_info:
name: Debug info
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- name: Print github context JSON
run: |
Expand All @@ -24,7 +24,7 @@ jobs:
EOF
gather_facts:
name: Gather facts
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
outputs:
project_go_path: ${{ steps.get_project_go_path.outputs.path }}
ref_version: ${{ steps.ref_version.outputs.refversion }}
Expand Down Expand Up @@ -84,7 +84,7 @@ jobs:
echo "refversion=${refversion}" >> $GITHUB_OUTPUT
update_project_go:
name: Update project.go
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
if: ${{ needs.gather_facts.outputs.version != '' && needs.gather_facts.outputs.project_go_path != '' && needs.gather_facts.outputs.ref_version != 'true' }}
needs:
- gather_facts
Expand Down Expand Up @@ -146,7 +146,7 @@ jobs:
hub pull-request -f -m "${{ env.title }}" -b ${{ env.base }} -h ${{ env.branch }} -r ${{ github.actor }}
create_release:
name: Create release
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
needs:
- gather_facts
if: ${{ needs.gather_facts.outputs.version }}
Expand Down Expand Up @@ -194,7 +194,7 @@ jobs:

create-release-branch:
name: Create release branch
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
needs:
- gather_facts
if: ${{ needs.gather_facts.outputs.version }}
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/zz_generated.create_release_pr.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DO NOT EDIT. Generated with:
#
# devctl@6.5.0
# devctl@6.9.0
#
name: Create Release PR
on:
Expand Down Expand Up @@ -30,7 +30,7 @@ on:
jobs:
debug_info:
name: Debug info
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
steps:
- name: Print github context JSON
run: |
Expand All @@ -39,7 +39,7 @@ jobs:
EOF
gather_facts:
name: Gather facts
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
outputs:
repo_name: ${{ steps.gather_facts.outputs.repo_name }}
branch: ${{ steps.gather_facts.outputs.branch }}
Expand Down Expand Up @@ -136,7 +136,7 @@ jobs:
fi
create_release_pr:
name: Create release PR
runs-on: ubuntu-20.04
runs-on: ubuntu-22.04
needs:
- gather_facts
if: ${{ needs.gather_facts.outputs.skip != 'true' }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/zz_generated.gitleaks.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DO NOT EDIT. Generated with:
#
# devctl@6.5.0
# devctl@6.9.0
#
name: gitleaks

Expand Down
73 changes: 71 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,68 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Handover cert-manager alerts to BigMac

## [2.134.1] - 2023-09-26

### Fixed

- Improve InhibitionClusterIsNotRunningPrometheusAgent to keep paging if the kube-state-metrics metric is missing for 5 minutes (avoid flapping of inhibitions).

## [2.134.0] - 2023-09-21

### Changed

- Split `KubeStateMetricsDown` alert into 2 alerts : `KubeStateMetricsDown` and `KubeStateMetricsNotRetrievingMetrics`

## [2.133.0] - 2023-09-19

### Changed

- Add missing prometheus-agent inhibition to `KubeStateMetricsDown` alert
- Change time duration before `ManagementClusterDeploymentMissingAWS` pages because it is dependant on the `PrometheusAgentFailing` alert.

### Fixed

- Remove `cancel_if_outside_working_hours` from PrometheusAgent alerts.

## [2.132.0] - 2023-09-15

### Changed

- `PrometheusAgentFailing` and `PrometheusAgentShardsMissing`: keep alerts for 5min after it's solved

## [2.131.0] - 2023-09-12

### Changed

- Remove `DNSRequestDurationTooSlow` in favor of SLO alerting.

## [2.130.0] - 2023-09-12

### Changed

- Refactor the Kyverno policy reports recording rule to include missing apps from Team Overview dashboard.
- Change `ClusterUnhealthyPhase` severity to page, so that we get paged when a cluster is not working properly.

## [2.129.0] - 2023-09-11

### Changed

- Unit tests for `PrometheusAgentShardsMissing`
- fixes for `PrometheusAgentShardsMissing`

## [2.128.0] - 2023-09-05

### Added

- Unit tests for KubeStateMetricsDown

### Changed

- Loki alerts only during working hours
- `PrometheusAgentFailing` does not rely on KSM metrics anymore
- Prometheus-agent inhibition rework, run on the MC
- `ManagementClusterApp` alerts now check for default catalog as well

## [2.127.0] - 2023-08-21

### Changed
Expand Down Expand Up @@ -133,7 +195,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [2.115.0] - 2023-07-20


### Added

- New alert `KubeStateMetricsSlow` that inhibits KSM related alerts.
Expand Down Expand Up @@ -2135,7 +2196,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- Add existing rules from https://github.com/giantswarm/prometheus-meta-operator/pull/637/commits/bc6a26759eb955de92b41ed5eb33fa37980660f2

[Unreleased]: https://github.com/giantswarm/prometheus-rules/compare/v2.127.0...HEAD
[Unreleased]: https://github.com/giantswarm/prometheus-rules/compare/v2.134.1...HEAD
[2.134.1]: https://github.com/giantswarm/prometheus-rules/compare/v2.134.0...v2.134.1
[2.134.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.133.0...v2.134.0
[2.133.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.132.0...v2.133.0
[2.132.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.131.0...v2.132.0
[2.131.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.130.0...v2.131.0
[2.130.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.129.0...v2.130.0
[2.129.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.128.0...v2.129.0
[2.128.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.127.0...v2.128.0
[2.127.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.126.1...v2.127.0
[2.126.1]: https://github.com/giantswarm/prometheus-rules/compare/v2.126.0...v2.126.1
[2.126.0]: https://github.com/giantswarm/prometheus-rules/compare/v2.125.0...v2.126.0
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DO NOT EDIT. Generated with:
#
# devctl@6.5.0
# devctl@6.9.0
#

include Makefile.*.mk
Expand Down
2 changes: 1 addition & 1 deletion Makefile.gen.app.mk
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DO NOT EDIT. Generated with:
#
# devctl@6.5.0
# devctl@6.9.0
#

##@ App
Expand Down
4 changes: 2 additions & 2 deletions helm/prometheus-rules/templates/alerting-rules/app.rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ spec:
annotations:
description: '{{`Management Cluster App {{ $labels.name }}, version {{ $labels.version }} is {{if $labels.status }} in {{ $labels.status }} state. {{else}} not installed. {{end}}`}}'
opsrecipe: app-failed/
expr: app_operator_app_info{status!~"(?i:(deployed|cordoned))", catalog=~"control-plane-.*",team!~"^$|noteam"}
expr: app_operator_app_info{status!~"(?i:(deployed|cordoned))", catalog=~"(control-plane-.*|default)",team!~"^$|noteam", namespace=~".*gianstswarm"}
for: 30m
labels:
area: managedservices
Expand All @@ -30,7 +30,7 @@ spec:
annotations:
description: 'Current version of {{`App {{ $labels.name }} is {{ $labels.deployed_version }} but it should be {{ $labels.version }}.`}}'
opsrecipe: app-pending-update/
expr: app_operator_app_info{catalog=~"control-plane-.*", deployed_version!="", status="deployed", version_mismatch="true" ,team!~"^$|noteam"}
expr: app_operator_app_info{catalog=~"(control-plane-.*|default)", deployed_version!="", status="deployed", version_mismatch="true" ,team!~"^$|noteam", namespace=~".*gianstswarm"}
for: 40m
labels:
area: managedservices
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ spec:
description: '{{`Deployment {{ $labels.deployment }} is missing.`}}'
opsrecipe: management-cluster-deployment-is-missing/
expr: absent(kube_deployment_status_condition{namespace="giantswarm", condition="Available", deployment="aws-admission-controller"})
for: 5m
for: 15m
labels:
area: kaas
cancel_if_prometheus_agent_down: "true"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ spec:
labels:
area: kaas
cancel_if_outside_working_hours: {{include "workingHoursOnly" .}}
severity: notify
severity: page
team: {{include "providerTeam" .}}
topic: managementcluster
annotations:
Expand Down
14 changes: 0 additions & 14 deletions helm/prometheus-rules/templates/alerting-rules/coredns.rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,3 @@ spec:
topic: dns
annotations:
description: '{{`CoreDNS Deployment {{ $labels.namespace}}/{{ $labels.deployment }} has been scaled to its maximum replica count for too long.`}}'
- alert: DNSRequestDurationTooSlow
expr: histogram_quantile(0.99, sum(irate(coredns_dns_request_duration_seconds_bucket{app="coredns"}[5m])) by (le)) > 1
for: 15m
labels:
area: empowerment
severity: page
team: cabbage
topic: dns
annotations:
description: '{{`CoreDNS requests are taking more than 1 second to be responded.`}}'
opsrecipe: dns-request-duration-too-slow/
dashboard: Yu9tkufmk/dns


Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,31 @@ spec:
- name: inhibit.prometheus-agent
rules:
# this inhibition fires when a cluster is not running prometheus-agent.
# If we have prometheus-agent statefulset, it means prometheus-agent is installed on this cluster
# so, raise an inhibition unless prometheus-agent runs on the cluster
# we retrieve the list of existing cluster IDs from `kube_namespace_created`
# excluding the MC's one, because it's always using prometheus-agent and namespace is not named after cluster name
# then compare it with the list of deployed prometheus-agents from `app_operator_app_info`
#
# Will produce data (and inhibitions) on MC/WC.
# Will only produce data (and inhibitions) on MC because it's where app_operator is running
# but that's enough to have the inhibitions on the installation-global alertmanager
- alert: InhibitionClusterIsNotRunningPrometheusAgent
annotations:
description: '{{`Cluster ({{ $labels.cluster_id }}) is not running Prometheus Agent.`}}'
expr: (count by (cluster_id) (prometheus_build_info{app="prometheus"}) unless count by (cluster_id) (kube_statefulset_created{namespace="kube-system",statefulset=~"prometheus-prometheus-agent.*"} > 0))
expr: |-
count(
label_replace(
sum_over_time(
kube_namespace_created{namespace!="{{ .Values.managementCluster.name }}-prometheus", namespace=~".+-prometheus"}[5m]
), "cluster_id", "$1", "namespace", "(.+)-prometheus"
)
) by (cluster_id)
unless
count(
label_replace(
sum_over_time(
app_operator_app_info{app="prometheus-agent"}[5m]
), "cluster_id", "$1", "namespace", "(.*)"
)
) by (cluster_id)
labels:
cluster_is_not_running_prometheus_agent: "true"
area: empowerment
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,34 @@ spec:
groups:
- name: kube-state-metrics
rules:
- alert: KubeStateMetricsDown
annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is down.`}}'
opsrecipe: kube-state-metrics-down/
expr: |-
(
# modern clusters
label_replace(up{app="kube-state-metrics",instance=~".*:8080"}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0 or absent(up{app="kube-state-metrics",instance=~".*:8080"} == 1)
)
and
(
# vintage clusters without servicemonitor
label_replace(up{app="kube-state-metrics",container=""}, "ip", "$1.$2.$3.$4", "node", "ip-(\\d+)-(\\d+)-(\\d+)-(\\d+).*") == 0 or absent(up{app="kube-state-metrics",container=""} == 1)
)
for: 15m
labels:
area: kaas
cancel_if_apiserver_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
inhibit_kube_state_metrics_down: "true"
cancel_if_prometheus_agent_down: "true"
cancel_if_kubelet_down: "true"
cancel_if_outside_working_hours: "false"
severity: page
team: atlas
topic: observability
- alert: KubeStateMetricsSlow
annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is too slow.`}}'
Expand All @@ -28,6 +56,27 @@ spec:
severity: page
team: atlas
topic: observability
- alert: KubeStateMetricsNotRetrievingMetrics
annotations:
description: '{{`KubeStateMetrics ({{ $labels.instance }}) is not retrieving metrics.`}}'
opsrecipe: kube-state-metrics-down/
expr: |-
# When it looks up but we don't have metrics
count({app="kube-state-metrics"}) < 10
for: 20m
labels:
area: kaas
cancel_if_apiserver_down: "true"
cancel_if_cluster_status_creating: "true"
cancel_if_cluster_status_deleting: "true"
cancel_if_cluster_has_no_workers: "true"
inhibit_kube_state_metrics_down: "true"
cancel_if_kubelet_down: "true"
cancel_if_kube_state_metrics_down: "true"
cancel_if_outside_working_hours: "true"
severity: page
team: atlas
topic: observability
- alert: KubeConfigMapCreatedMetricMissing
annotations:
description: '{{`kube_configmap_created metric is missing for cluster {{ $labels.cluster_id }}.`}}'
Expand Down
Loading

0 comments on commit babbd48

Please sign in to comment.