Skip to content

Faulty drive monitoring

Allan Roger Reid edited this page Apr 4, 2024 · 1 revision

1.- Check mc admin info for faulty drives Considering that a faulty drive will show up as "faulty" in mc admin info, identify it using mc admin info acme --json | jq . | grep -i faulty -B 2

2.- env var _MINIO_DISK_MAX_TIMEOUT Considering that a failed drive will show up as "offline" in mc admin info, use env var _MINIO_DISK_MAX_TIMEOUT. This is the max timeout settings is the time limit that you set as a threshold to take this drive offline For example if you set _MINIO_DISK_MAX_TIMEOUT=20s and if the drive takes more than 20secs to respond, the drive will be taken as offline. The default value for this timeout is 2mins.

You can check https://min.io/docs/minio/linux/operations/monitoring/metrics-and-alerts.html#drive-metrics - minio_node_drive_latency_us to derive the correct value for _MINIO_DISK_MAX_TIMEOUT. The minio-node.json https://github.com/minio/minio/blob/master/docs/metrics/prometheus/grafana/node/minio-node.json contains a graph for this metrics (seeGraph title Drive Latency (micro sec)).

3.- Check dmesg for problematic drives For each node run dmesg -T; save the outputs in directory dmseg-output and iterate over the outputs by running:

for i in $(echo dmseg-output/*); do found=$(sed -e 's/\x1b\[[0-9;]*m//g' ${i} | grep -i "critical medium" | awk {'print $11'} | sort -u | tr '\n' ' '); if [ -n "$found" ]; then echo "$i : ${found}"; fi; done

4.- Prometheus docs/metrics/prometheus/list.md:| minio_node_drive_errors_timeout | Total number of drive timeout errors since server start | heuristic: >50 means bad drive

docs/metrics/prometheus/list.md:| minio_node_drive_errors_availability | Total number of drive I/O errors, permission denied and timeouts since server start |

docs/metrics/prometheus/list.md:| minio_node_drive_io_waiting | Total number I/O operations waiting on drive |

pending - separate out _ioerror from minio_node_drive_errors_availability

Clone this wiki locally