Skip to content

Commit

Permalink
Mixin: Update sidecar alert (thanos-io#2002)
Browse files Browse the repository at this point in the history
* Mixin: update sidecar alert

Signed-off-by: Xiang Dai <[email protected]>

* remove nonexistent alert

Signed-off-by: Xiang Dai <[email protected]>

* feedback

Signed-off-by: Xiang Dai <[email protected]>

* feedback

Signed-off-by: Xiang Dai <[email protected]>
  • Loading branch information
daixiang0 authored Feb 4, 2020
1 parent 56a1fb6 commit 7c02430
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 26 deletions.
39 changes: 13 additions & 26 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -236,39 +236,26 @@ rules:
## Sidecar
[//]: # "TODO(kakkoyun): Generate sidecar rules using thanos-mixin."
<!-- [embedmd]:# (../tmp/thanos-sidecar.rules.yaml yaml) -->
[embedmd]:# (../tmp/thanos-sidecar.rules.yaml yaml)
```yaml
name: thanos-sidecar.rules
rules:
- alert: ThanosSidecarPrometheusDown
expr: thanos_sidecar_prometheus_up{name="prometheus"} == 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Sidecar cannot connect to Prometheus
impact: Prometheus configuration is not being refreshed
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: SIDECAR_URL
- alert: ThanosSidecarBucketOperationsFailed
expr: rate(thanos_objstore_bucket_operation_failures_total{name="prometheus"}[5m]) > 0
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} cannot connect to Prometheus.
expr: |
sum by (job, pod) (thanos_sidecar_prometheus_up{job=~"thanos-sidecar.*"} == 0)
for: 5m
labels:
team: TEAM
severity: critical
- alert: ThanosSidecarUnhealthy
annotations:
summary: Thanos Sidecar bucket operations are failing
impact: We will lose metrics data if not fixed in 24h
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: SIDECAR_URL
- alert: ThanosSidecarGrpcErrorRate
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",name="prometheus"}[5m]) > 0
for: 5m
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value
}} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
labels:
team: TEAM
annotations:
summary: Thanos Sidecar is returning Internal/Unavailable errors
impact: Prometheus queries are failing
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: SIDECAR_URL
severity: critical
```
## Query
Expand Down
8 changes: 8 additions & 0 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,14 @@ groups:
severity: warning
- name: thanos-sidecar.rules
rules:
- alert: ThanosSidecarPrometheusDown
annotations:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} cannot connect to Prometheus.
expr: |
sum by (job, pod) (thanos_sidecar_prometheus_up{job=~"thanos-sidecar.*"} == 0)
for: 5m
labels:
severity: critical
- alert: ThanosSidecarUnhealthy
annotations:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{
Expand Down
13 changes: 13 additions & 0 deletions mixin/thanos/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,19 @@
{
name: 'thanos-sidecar.rules',
rules: [
{
alert: 'ThanosSidecarPrometheusDown',
annotations: {
message: 'Thanos Sidecar {{$labels.job}} {{$labels.pod}} cannot connect to Prometheus.',
},
expr: |||
sum by (job, pod) (thanos_sidecar_prometheus_up{%(selector)s} == 0)
||| % thanos.sidecar,
'for': '5m',
labels: {
severity: 'critical',
},
},
{
alert: 'ThanosSidecarUnhealthy',
annotations: {
Expand Down

0 comments on commit 7c02430

Please sign in to comment.