Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus: add builtin alertmanager, labels.level for builtin alerts #784

Merged
merged 2 commits into from
Jul 8, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions base/prometheus/prometheus.ConfigMap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ data:
- source_labels: [__meta_kubernetes_service_name]
regex: alertmanager
action: keep
# bundled alertmanager, started by prom-wrapper
- static_configs:
- targets: ['127.0.0.1:9093']
path_prefix: /alertmanager

rule_files:
- '*_rules.yml'
Expand Down Expand Up @@ -226,6 +230,11 @@ data:
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

- job_name: 'builtin-alertmanager'
metrics_path: /alertmanager/metrics
static_configs:
- targets: ['127.0.0.1:9093']
alert_rules.yml: |
groups:
- name: alert.rules
Expand All @@ -234,6 +243,7 @@ data:
expr: app:up:ratio{app!=""} < 0.9
for: 10m
labels:
level: critical
severity: page
annotations:
description: 'Pods missing from {{`{{`}} $labels.app {{`}}`}}: {{`{{`}} $value
Expand All @@ -244,6 +254,7 @@ data:
expr: app:up:ratio{app!=""} < 0.1
for: 2m
labels:
level: critical
severity: page
annotations:
description: 'No pods are running for {{`{{`}} $labels.app {{`}}`}}: {{`{{`}}
Expand All @@ -254,6 +265,7 @@ data:
expr: histogram_quantile(0.9, sum by(le) (rate(src_http_request_duration_seconds_bucket{job="sourcegraph-frontend",route!="raw"}[10m])))
> 20
labels:
level: critical
severity: page
annotations:
description: 'Page load latency > 20s (90th percentile over all routes; current
Expand All @@ -263,6 +275,8 @@ data:
- alert: GoroutineLeak
expr: go_goroutines >= 10000
for: 10m
labels:
level: warn
annotations:
description: '{{`{{`}} $labels.app {{`}}`}} has more than 10k goroutines. This
is probably a regression causing a goroutine leak'
Expand All @@ -271,6 +285,7 @@ data:
- alert: FSINodesRemainingLow
expr: sum by(instance) (container_fs_inodes_total{pod_name!=""}) > 3e+06
labels:
level: critical
severity: page
annotations:
description: '{{`{{`}}$labels.instance{{`}}`}} is using {{`{{`}}humanize $value{{`}}`}}
Expand All @@ -279,13 +294,16 @@ data:
summary: '{{`{{`}}$labels.instance{{`}}`}} remaining fs inodes is low'
- alert: DiskSpaceLow
expr: node:k8snode_filesystem_avail_bytes:ratio < 0.1
labels:
level: warn
annotations:
help: Alerts when a node has less than 10% available disk space.
summary: '{{`{{`}}$labels.exported_name{{`}}`}} has less than 10% available
disk space'
- alert: DiskSpaceLowCritical
expr: node:k8snode_filesystem_avail_bytes:ratio{exported_name=~".*prod.*"} < 0.05
labels:
level: critical
severity: page
annotations:
help: Alerts when a node has less than 5% available disk space.
Expand All @@ -299,19 +317,24 @@ data:
- alert: GitserverDiskSpaceLowCritical
expr: src_gitserver_disk_space_available / src_gitserver_disk_space_total < 0.05
labels:
level: critical
slimsag marked this conversation as resolved.
Show resolved Hide resolved
severity: page
annotations:
help: Alerts when gitserverdisk space is critically low.
summary: Critical! gitserver {{`{{`}}$labels.instance{{`}}`}} disk space is less than 5% of available disk space
- alert: SearcherErrorRatioTooHigh
expr: searcher_errors:ratio10m > 0.1
for: 20m
labels:
level: warn
annotations:
help: Alerts when the search service has more than 10% of requests failing.
summary: Error ratio exceeds 10%
- alert: PrometheusMetricsBloat
expr: http_response_size_bytes{handler="prometheus",job!="kubernetes-apiservers",job!="kubernetes-nodes",quantile="0.5"}
> 20000
labels:
level: warn
annotations:
help: Alerts when a service is probably leaking metrics (unbounded attribute).
summary: '{{`{{`}}$labels.job{{`}}`}} in {{`{{`}}$labels.ns{{`}}`}} is probably
Expand Down