A Vault node exposes telemetry information that can be used to monitor and alert on the health and performance of a Vault cluster.
By default the Vault operator will configure each vault pod to publish statsd metrics.
The Vault operator runs a statsd-exporter container inside each Vault pod to convert and expose those metrics in the format for Prometheus.
curl
the /metrics
endpoint on port 9102
for any vault pod to get the Prometheus metrics:
$ VPOD=$(kubectl -n default get vault example -o jsonpath='{.status.vaultStatus.active}')
$ kubectl -n default exec -ti ${VPOD} --container=vault -- curl localhost:9102/metrics
# HELP vault_core_unseal Metric autogenerated by statsd_exporter.
# TYPE vault_core_unseal summary
vault_core_unseal{quantile="0.5"} NaN
vault_core_unseal{quantile="0.9"} NaN
vault_core_unseal{quantile="0.99"} NaN
vault_core_unseal_sum 2.077112
vault_core_unseal_count 1
. . .
The Vault operator also creates a service with the same name as the Vault cluster that exposes the /metrics
endpoint for the Vault nodes via the prometheus
port. So for a Vault cluster named example
the following service exists:
$ kubectl -n default get service example -o yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: vault
vault_cluster: example
name: example
namespace: default
...
spec:
ports:
- name: vault-client
port: 8200
protocol: TCP
targetPort: 8200
- name: vault-cluster
port: 8201
protocol: TCP
targetPort: 8201
- name: prometheus
port: 9102
protocol: TCP
targetPort: 9102
selector:
app: vault
vault_cluster: example
type: ClusterIP
...
The above service can be scraped to consume the Prometheus metrics for the Vault cluster.
Consult the Prometheus operator docs on how to setup and configure Prometheus with a ServiceMonitor
to consume the metrics for a target service.
A ServiceMonitor
with the following spec can be created to describe the above Vault service as target for Prometheus.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
...
spec:
selector:
matchLabels:
app: vault
vault_cluster: example
namespaceSelector:
matchNames:
- default
endpoints:
- interval: 30s
path: /metrics
port: prometheus
The following alert rules for some key metrics are provided as a guide for the best practice of alerting on Vault metrics.
The sample alert rules assume Prometheus is configured to monitor a Vault service named example
.
alert: VaultLeadershipLoss
expr: sum(increase(vault_core_leadership_lost_count{job="example"}[1h])) > 5
for: 1m
labels:
severity: critical
annotations:
summary: High frequency of Vault leadership losses
description: There have been more than 5 Vault leadership losses in the past 1h
alert: VaultLeadershipStepDowns
expr: sum(increase(vault_core_step_down_count{job="example"}[1h])) > 5
for: 1m
labels:
severity: critical
annotations:
summary: High frequency of Vault leadership step downs
description: There have been more than 5 Vault leadership step downs in the past 1h
alert: VaultLeadershipSetupFailures
expr: sum(increase(vault_core_leadership_setup_failed{job="example"}[1h])) > 5
for: 1m
labels:
severity: critical
annotations:
summary: High frequency of Vault leadership setup failures
description: There have been more than 5 Vault leadership setup failures in the past 1h
The above queries and parameters of the alert rules should be tuned for your particular use case. Read more on Prometheus queries and alerting rules to learn how to write the alerting rules as needed.