Skip to content

Commit

Permalink
[DPE-5658] Update COS alert rules (#651)
Browse files Browse the repository at this point in the history
* Update postgresql_rules.yaml

* Update pgbouncer_rules.yaml

* Update patroni_rules.yaml

* Update tox.ini to skip linter checks on alert rules

* Revert tox.ini change

* Fix PostgreSQL spelling
  • Loading branch information
a-velasco authored Oct 26, 2024
1 parent 8e98f55 commit fe49ea5
Show file tree
Hide file tree
Showing 3 changed files with 117 additions and 53 deletions.
15 changes: 10 additions & 5 deletions src/prometheus_alert_rules/patroni_rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@ groups:
rules:

- alert: PatroniPostgresqlDown
expr: "patroni_postgres_running == 0"
expr: 'patroni_postgres_running == 0'
for: 0m
labels:
severity: critical
annotations:
summary: Patroni Posrgresql Down (instance {{ $labels.instance }})
description: "Patroni Postgresql instance is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Patroni PostgreSQL instance {{ $labels.instance }} is down.
description: |
Check for errors in the Loki logs.
LABELS = {{ $labels }}
# 2.4.1
- alert: PatroniHasNoLeader
Expand All @@ -22,5 +24,8 @@ groups:
labels:
severity: critical
annotations:
summary: Patroni has no Leader (instance {{ $labels.instance }})
description: "A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: Patroni instance {{ $labels.instance }} has no leader node.
description: |
A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}.
Check for errors in the Loki logs.
LABELS = {{ $labels }}
18 changes: 12 additions & 6 deletions src/prometheus_alert_rules/pgbouncer_rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@ groups:
labels:
severity: warning
annotations:
summary: PGBouncer active connections (instance {{ $labels.instance }})
description: "PGBouncer pools are filling up\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PgBouncer instance {{ $labels.instance }} has > 200 active connections
description: |
Consider checking the client application responsible for generating those additional connections.
LABELS = {{ $labels }}
# 2.5.2
# 10 -> 3
Expand All @@ -24,8 +26,10 @@ groups:
labels:
severity: warning
annotations:
summary: PGBouncer errors (instance {{ $labels.instance }})
description: "PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PgBouncer instance {{ $labels.instance }} is logging errors.
description: |
This may be due to a a server restart or an admin typing commands at the PgBouncer console.
LABELS = {{ $labels }}
# 2.5.3
- alert: PgbouncerMaxConnections
Expand All @@ -34,5 +38,7 @@ groups:
labels:
severity: critical
annotations:
summary: PGBouncer max connections (instance {{ $labels.instance }})
description: "The number of PGBouncer client connections has reached max_client_conn.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PgBouncer instance {{ $labels.instance }} has reached `max_client_conn`.
description: |
Consider checking how many connections the client application is opening.
LABELS = {{ $labels }}
137 changes: 95 additions & 42 deletions src/prometheus_alert_rules/postgresql_rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@ groups:
labels:
severity: critical
annotations:
summary: Postgresql down (instance {{ $labels.instance }})
description: "Postgresql instance is down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} is down.
description: |
If you are not upgrading or configuring cross-region async replication clusters, check for errors in the Loki logs.
LABELS = {{ $labels }}
# 2.2.2
# critical -> info
Expand All @@ -24,8 +26,10 @@ groups:
labels:
severity: info
annotations:
summary: Postgresql restarted (instance {{ $labels.instance }})
description: "Postgresql restarted\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has restarted.
description: |
If you are not enabling/disabling TLS or upgrading or configuring cross-region async replication clusters, check for errors in the Loki logs.
LABELS = {{ $labels }}
# 2.2.3
- alert: PostgresqlExporterError
Expand All @@ -34,8 +38,10 @@ groups:
labels:
severity: critical
annotations:
summary: Postgresql exporter error (instance {{ $labels.instance }})
description: "Postgresql exporter is showing errors. A query may be buggy in query.yaml\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} is showing an exporter error.
description: |
There may be a buggy query in query.yaml
LABELS = {{ $labels }}
# 2.2.4
# 10 days -> 7 days
Expand All @@ -45,8 +51,11 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql table not auto vacuumed (instance {{ $labels.instance }})
description: "Table {{ $labels.relname }} has not been auto vacuumed for 7 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: A PostgreSQL table in instance {{ $labels.instance }} is not auto vacuumed.
description: |
Table {{ $labels.relname }} has not been auto vacuumed for 7 days.
Double-check your VACUUM settings.
LABELS = {{ $labels }}
# 2.2.5
# 10 days -> 7 days
Expand All @@ -56,8 +65,11 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql table not auto analyzed (instance {{ $labels.instance }})
description: "Table {{ $labels.relname }} has not been auto analyzed for 7 days\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: A PostgreSQL table in instance {{ $labels.instance }} is not auto analyzed.
description: |
Table {{ $labels.relname }} has not been auto analyzed for 7 days.
Double-check your AUTOVACUUM ANALYZE settings.
LABELS = {{ $labels }}
# 2.2.6
- alert: PostgresqlTooManyConnections
Expand All @@ -66,8 +78,10 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql too many connections (instance {{ $labels.instance }})
description: "PostgreSQL instance has too many connections (> 80%).\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} is using > 80% of the maximum connections.
description: |
Consider checking how many connections the client application is opening, or using PgBouncer in front of the database.
LABELS = {{ $labels }}
# 2.2.7
# warning -> info
Expand All @@ -77,8 +91,11 @@ groups:
labels:
severity: info
annotations:
summary: Postgresql not enough connections (instance {{ $labels.instance }})
description: "PostgreSQL instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} does not have enough connections.
description: |
PostgreSQL instance {{ $labels.instance }} should have more connections (> 5).
Consider double-checking how many connections the client application is opening and/or using PgBouncer in front of the database.
LABELS = {{ $labels }}
# 2.2.8
- alert: PostgresqlDeadLocks
Expand All @@ -87,8 +104,10 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql dead locks (instance {{ $labels.instance }})
description: "PostgreSQL has dead-locks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has dead locks.
description: |
See more details with the pg_locks view.
LABELS = {{ $labels }}
# 2.2.9
- alert: PostgresqlHighRollbackRate
Expand All @@ -97,8 +116,11 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql high rollback rate (instance {{ $labels.instance }})
description: "Ratio of transactions being aborted compared to committed is > 2 %\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has a high rollback rate instance.
description: |
The ratio of transactions being aborted compared to committed is > 2 %.
This is probably happening due to unoptimized configurations related to commit delay, connections, memory, and WAL files.
LABELS = {{ $labels }}
# 2.2.10
# critical -> info
Expand All @@ -108,8 +130,11 @@ groups:
labels:
severity: info
annotations:
summary: Postgresql commit rate low (instance {{ $labels.instance }})
description: "Postgresql seems to be processing very few transactions\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has a low commit rate.
description: |
PostgreSQL seems to be processing very few transactions.
Please check for long-running queries and configuration issues, like insufficient cache size.
LABELS = {{ $labels }}
# 2.2.11
# warning -> info
Expand All @@ -119,8 +144,11 @@ groups:
labels:
severity: info
annotations:
summary: Postgresql low XID consumption (instance {{ $labels.instance }})
description: "Postgresql seems to be consuming transaction IDs very slowly\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} shows low XID consumption.
description: |
PostgreSQL seems to be consuming transaction IDs very slowly.
Run ANALYZE to update the optimizer statistics, ensure that query plans are correct, and double-check your VACUUM settings.
LABELS = {{ $labels }}
# 2.2.12
- alert: PostgresqlHighRateStatementTimeout
Expand All @@ -129,8 +157,11 @@ groups:
labels:
severity: critical
annotations:
summary: Postgresql high rate statement timeout (instance {{ $labels.instance }})
description: "Postgres transactions showing high rate of statement timeouts\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} shows a high rate of statement timeout.
description: |
Either tune `statement_timeout` when sending queries or use EXPLAIN ANALYZE to understand how the queries can be improved.
VALUE = {{ $value }}
LABELS = {{ $labels }}
# 2.2.13
# critical -> warning
Expand All @@ -140,8 +171,10 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql high rate deadlock (instance {{ $labels.instance }})
description: "Postgres detected deadlocks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} shows a high deadlock rate.
description: |
More details can be obtained through the pg_locks view.
LABELS = {{ $labels }}
# 2.2.14
# warning -> info
Expand All @@ -151,8 +184,10 @@ groups:
labels:
severity: info
annotations:
summary: Postgresql unused replication slot (instance {{ $labels.instance }})
description: "Unused Replication Slots\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has unused replication slots.
description: |
Check if a replica is not using any of them before deleting it.
LABELS = {{ $labels }}
# 2.2.15
- alert: PostgresqlTooManyDeadTuples
Expand All @@ -161,8 +196,10 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql too many dead tuples (instance {{ $labels.instance }})
description: "PostgreSQL dead tuples is too large\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has too many dead tuples.
description: |
Double-check your VACUUM settings.
LABELS = {{ $labels }}
# 2.2.16
- alert: PostgresqlConfigurationChanged
Expand All @@ -171,8 +208,10 @@ groups:
labels:
severity: info
annotations:
summary: Postgresql configuration changed (instance {{ $labels.instance }})
description: "Postgres Database configuration change has occurred\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} configuration has changed.
description: |
PostgreSQL database configuration has changed.
LABELS = {{ $labels }}
# 2.2.17
# critical -> warning
Expand All @@ -182,8 +221,11 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql SSL compression active (instance {{ $labels.instance }})
description: "Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} SSL compression is active.
description: |
Database connections with SSL compression are enabled. This may add significant jitter in replication delay.
Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.
LABELS = {{ $labels }}
# 2.2.18
# critical -> warning
Expand All @@ -193,8 +235,10 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql too many locks acquired (instance {{ $labels.instance }})
description: "Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has acquired too many locks.
description: |
If this alert happens frequently, you may need to increase the PostgreSQL setting max_locks_per_transaction.
LABELS = {{ $labels }}
# 2.2.19
- alert: PostgresqlBloatIndexHigh(>80%)
Expand All @@ -203,8 +247,11 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql bloat index high (> 80%) (instance {{ $labels.instance }})
description: "The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has a high bloat index (> 80%).
description: |
The index {{ $labels.idxname }} is bloated.
Consider running `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`
LABELS = {{ $labels }}
# 2.2.20
- alert: PostgresqlBloatTableHigh(>80%)
Expand All @@ -213,8 +260,11 @@ groups:
labels:
severity: warning
annotations:
summary: Postgresql bloat table high (> 80%) (instance {{ $labels.instance }})
description: "The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }} has a high bloat table (> 80%).
description: |
The table {{ $labels.relname }} is bloated.
Consider running `VACUUM {{ $labels.relname }};`
LABELS = {{ $labels }}
# 2.2.21
# warning -> critical
Expand All @@ -224,5 +274,8 @@ groups:
labels:
severity: critical
annotations:
summary: Postgresql invalid index (instance {{ $labels.instance }})
description: "The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
summary: PostgreSQL instance {{ $labels.instance }})= has an invalid index.
description: |
The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}.
Consider running `DROP INDEX {{ $labels.indexrelname }};`
LABELS = {{ $labels }}

0 comments on commit fe49ea5

Please sign in to comment.