Add debug output to understand calculation of failure probability (FP) #16

saladpanda · 2023-10-13T15:57:45Z

Please don't actually merge this. I just wanted to publish it here for others facing the same confusion, that I faced.

When running snapraid smart on my machine I was surprised to see a failure probability of 84% even though all 5 SMART values mentioned in the SnapRAID FAQ suggested a perfectly healthy drive.
To understand where this high value came from I added some debug ouput along the code path for the calculation and saw that it was based on Load_Cycle_Count (193) which was not mentioned in the FAQ, nor in the Backblaze blogposts.

With this patch the output looks like this:

SnapRAID SMART report:

   Temp  Power   Error   FP Size
      C OnDays   Count        TB  Serial          Device    Disk
 -----------------------------------------------------------------------
     41    569       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=0||value/step=0||value=0||result=0.047450|calculated AFR for SMART value 187: 0.047450 (4.634185%)
|value=235466||value/step=362||value=255||result=1.822567|calculated AFR for SMART value 193: 1.822567 (83.838958%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                        84%  2.0  Z4Z4JCWX        /dev/sda  disk1
     32   2719       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=5093425||value/step=7848||value=255||result=1.822567|calculated AFR for SMART value 193: 1.822567 (83.838958%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                        84%  0.5  71UTC68RT       /dev/sdd  disk2
     44     39       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=0||value/step=0||value=0||result=0.047450|calculated AFR for SMART value 187: 0.047450 (4.634185%)
|value=162||value/step=0||value=0||result=0.000000|calculated AFR for SMART value 193: 0.000000 (0.000000%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                         5%  6.0  ZF200GE8        /dev/sdb  parity
      -      -       -  n/a    -  -               /dev/sdh  -
      -      -       -  n/a    -  -               /dev/sdg  -
      -      -       -  n/a    -  -               /dev/sde  -
      -      -       -  n/a    -  -               /dev/sdf  -
     30    606       0  SSD  0.2  Y5IB61BCKNSX    /dev/sdc  -
     31     36       0
|value=0||value/step=0||value=0||result=0.031633| calculated AFR for SMART value 5: 0.031633 (3.113823%)
|value=0||value/step=0||value=0||result=0.034067|calculated AFR for SMART value 197: 0.034067 (3.349293%)
|value=0||value/step=0||value=0||result=0.036500|calculated AFR for SMART value 198: 0.036500 (3.584191%)
                         4%  1.0  S2RXJ9FCB07612  /dev/sdi  -

The FP column is the estimated probability (in percentage) that the disk
is going to fail in the next year.

Probability that at least one disk is going to fail in the next year is 98%.

saladpanda · 2023-10-13T16:03:38Z

To compile on ubuntu:

apt install build-essential autoconf
autoreconf -i
./configure
make

# run the binary:
./snapraid

amadvance · 2023-10-20T08:06:05Z

Hi @saladpanda

If I read the output correctly, the Load Cycle Count of your disk is 235,466, which is indeed a high value. In the data I analyzed, this appears to be an indicator of potential failure. However, it's important to note that this doesn't necessarily mean your disk will fail, as each case is unique. It's a good idea to check your hard drive's specifications to see what it's rated for, just to be sure. Hard drives are typically rated for around 600,000 cycles.

See for example this discussion:

https://superuser.com/questions/840851/how-much-load-cycle-count-can-my-hard-drive-hypotethically-sustain

Quickly added some debug output to understand high FP values.

ac536e4

saladpanda mentioned this pull request Oct 13, 2023

Stick to the SMART values mentioned in the FAQ for failure probability calculation #17

Open

amadvance force-pushed the master branch 2 times, most recently from 9222213 to 79e8794 Compare January 10, 2024 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add debug output to understand calculation of failure probability (FP) #16

Add debug output to understand calculation of failure probability (FP) #16

saladpanda commented Oct 13, 2023

saladpanda commented Oct 13, 2023

amadvance commented Oct 20, 2023

Add debug output to understand calculation of failure probability (FP) #16

Are you sure you want to change the base?

Add debug output to understand calculation of failure probability (FP) #16

Conversation

saladpanda commented Oct 13, 2023

saladpanda commented Oct 13, 2023

amadvance commented Oct 20, 2023