-
Notifications
You must be signed in to change notification settings - Fork 0
Faulty drive monitoring
1.- Check mc admin info
for faulty drives
Considering that a faulty drive will show up as "faulty" in mc admin info, identify it using
mc admin info acme --json | jq . | grep -i faulty -B 2
2.- env var _MINIO_DISK_MAX_TIMEOUT
Considering that a failed drive will show up as "offline" in mc admin info, use env var _MINIO_DISK_MAX_TIMEOUT
.
This is the max timeout settings is the time limit that you set as a threshold to take this drive offline
For example if you set _MINIO_DISK_MAX_TIMEOUT=20s
and if the drive takes more than 20secs to respond, the drive will be taken as offline. The default value for this timeout is 2mins.
You can check https://min.io/docs/minio/linux/operations/monitoring/metrics-and-alerts.html#drive-metrics - minio_node_drive_latency_us
to derive the correct value for _MINIO_DISK_MAX_TIMEOUT
. The minio-node.json https://github.com/minio/minio/blob/master/docs/metrics/prometheus/grafana/node/minio-node.json contains a graph for this metrics (seeGraph title Drive Latency (micro sec)).
3.- Check dmesg for problematic drives
For each node run dmesg -T
; save the outputs in directory dmseg-output and iterate over the outputs by running:
for i in $(echo dmseg-output/*); do found=$(sed -e 's/\x1b\[[0-9;]*m//g' ${i} | grep -i "critical medium" | awk {'print $11'} | sort -u | tr '\n' ' '); if [ -n "$found" ]; then echo "$i : ${found}"; fi; done
4.- Prometheus
docs/metrics/prometheus/list.md:| minio_node_drive_errors_timeout
| Total number of drive timeout errors since server start |
heuristic: >50 means bad drive
docs/metrics/prometheus/list.md:| minio_node_drive_errors_availability
| Total number of drive I/O errors, permission denied and timeouts since server start |
docs/metrics/prometheus/list.md:| minio_node_drive_io_waiting
| Total number I/O operations waiting on drive |
pending - separate out _ioerror
from minio_node_drive_errors_availability