Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus: add builtin alertmanager, labels.level for builtin alerts #784

Merged
merged 2 commits into from
Jul 8, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions base/prometheus/prometheus.ConfigMap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ data:
- source_labels: [__meta_kubernetes_service_name]
regex: alertmanager
action: keep
# bundled alertmanager, started by prom-wrapper
- static_configs:
- targets: ['127.0.0.1:9093']
path_prefix: /alertmanager

rule_files:
- '*_rules.yml'
Expand Down Expand Up @@ -226,6 +230,11 @@ data:
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

- job_name: 'builtin-alertmanager'
metrics_path: /alertmanager/metrics
static_configs:
- targets: ['127.0.0.1:9093']
alert_rules.yml: |
groups:
- name: alert.rules
Expand All @@ -234,7 +243,7 @@ data:
expr: app:up:ratio{app!=""} < 0.9
for: 10m
labels:
severity: page
level: critical
annotations:
description: 'Pods missing from {{`{{`}} $labels.app {{`}}`}}: {{`{{`}} $value
{{`}}`}}'
Expand All @@ -244,7 +253,7 @@ data:
expr: app:up:ratio{app!=""} < 0.1
for: 2m
labels:
severity: page
level: critical
annotations:
description: 'No pods are running for {{`{{`}} $labels.app {{`}}`}}: {{`{{`}}
$value {{`}}`}}'
Expand All @@ -254,7 +263,7 @@ data:
expr: histogram_quantile(0.9, sum by(le) (rate(src_http_request_duration_seconds_bucket{job="sourcegraph-frontend",route!="raw"}[10m])))
> 20
labels:
severity: page
level: critical
annotations:
description: 'Page load latency > 20s (90th percentile over all routes; current
value: {{`{{`}}$value{{`}}`}}s)'
Expand All @@ -263,6 +272,8 @@ data:
- alert: GoroutineLeak
expr: go_goroutines >= 10000
for: 10m
labels:
level: warn
annotations:
description: '{{`{{`}} $labels.app {{`}}`}} has more than 10k goroutines. This
is probably a regression causing a goroutine leak'
Expand All @@ -271,22 +282,24 @@ data:
- alert: FSINodesRemainingLow
expr: sum by(instance) (container_fs_inodes_total{pod_name!=""}) > 3e+06
labels:
severity: page
level: critical
annotations:
description: '{{`{{`}}$labels.instance{{`}}`}} is using {{`{{`}}humanize $value{{`}}`}}
inodes'
help: Alerts when a node's remaining FS inodes are low.
summary: '{{`{{`}}$labels.instance{{`}}`}} remaining fs inodes is low'
- alert: DiskSpaceLow
expr: node:k8snode_filesystem_avail_bytes:ratio < 0.1
labels:
level: warn
annotations:
help: Alerts when a node has less than 10% available disk space.
summary: '{{`{{`}}$labels.exported_name{{`}}`}} has less than 10% available
disk space'
- alert: DiskSpaceLowCritical
expr: node:k8snode_filesystem_avail_bytes:ratio{exported_name=~".*prod.*"} < 0.05
labels:
severity: page
level: critical
annotations:
help: Alerts when a node has less than 5% available disk space.
summary: Critical! {{`{{`}}$labels.exported_name{{`}}`}} has less than 5% available
Expand All @@ -299,19 +312,23 @@ data:
- alert: GitserverDiskSpaceLowCritical
expr: src_gitserver_disk_space_available / src_gitserver_disk_space_total < 0.05
labels:
severity: page
level: critical
slimsag marked this conversation as resolved.
Show resolved Hide resolved
annotations:
help: Alerts when gitserverdisk space is critically low.
summary: Critical! gitserver {{`{{`}}$labels.instance{{`}}`}} disk space is less than 5% of available disk space
- alert: SearcherErrorRatioTooHigh
expr: searcher_errors:ratio10m > 0.1
for: 20m
labels:
level: warn
annotations:
help: Alerts when the search service has more than 10% of requests failing.
summary: Error ratio exceeds 10%
- alert: PrometheusMetricsBloat
expr: http_response_size_bytes{handler="prometheus",job!="kubernetes-apiservers",job!="kubernetes-nodes",quantile="0.5"}
> 20000
labels:
level: warn
annotations:
help: Alerts when a service is probably leaking metrics (unbounded attribute).
summary: '{{`{{`}}$labels.job{{`}}`}} in {{`{{`}}$labels.ns{{`}}`}} is probably
Expand Down