Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S.M.A.R.T input plugin: Power On Hours on NVME device not reported #10907

Closed
coccobill1 opened this issue Mar 29, 2022 · 17 comments · Fixed by #10923
Closed

S.M.A.R.T input plugin: Power On Hours on NVME device not reported #10907

coccobill1 opened this issue Mar 29, 2022 · 17 comments · Fixed by #10923
Labels
area/smart bug unexpected problem or unintended behavior platform/windows

Comments

@coccobill1
Copy link

Relevant telegraf.conf

[[inputs.smart]]
    path_smartctl = "/Program Files/smartmontools/bin/smartctl.exe"
    interval = "1m"
    enable_extensions = ["auto-on"]
    attributes = true

Logs from Telegraf

C:\Program Files\Telegraf>telegraf --input-filter=smart --test
2022-03-29T07:00:43Z I! Using config file: C:\Program Files\Telegraf\telegraf.conf
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Critical_Warning,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,id=194,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Temperature_Celsius,serial_no=xxx raw_value=47i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Available_Spare,serial_no=xxx raw_value=83i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Available_Spare_Threshold,serial_no=xxx raw_value=10i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Percentage_Used,serial_no=xxx raw_value=5i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,id=12,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Power_Cycle_Count,serial_no=xxx raw_value=44i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Unsafe_Shutdowns,serial_no=xxx raw_value=28i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Media_and_Data_Integrity_Errors,serial_no=xxx raw_value=43i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Warning_Temperature_Time,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Critical_Temperature_Time,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Temperature_Sensor_1,serial_no=xxx raw_value=47i 1648537244000000000
> smart_attribute,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Temperature_Sensor_2,serial_no=xxx raw_value=48i 1648537244000000000
> smart_device,device=sde,host=CHRONUS,model=Samsung\ SSD\ 970\ EVO\ 500GB,serial_no=xxx exit_status=0i,health_ok=true,temp_c=47i 1648537244000000000

> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Critical_Warning,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,id=194,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Temperature_Celsius,serial_no=xxx raw_value=39i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Available_Spare,serial_no=xxx raw_value=100i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Available_Spare_Threshold,serial_no=xxx raw_value=10i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Percentage_Used,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,id=12,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Power_Cycle_Count,serial_no=xxx raw_value=14i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,id=9,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Power_On_Hours,serial_no=xxx raw_value=496i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Unsafe_Shutdowns,serial_no=xxx raw_value=5i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Media_and_Data_Integrity_Errors,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Error_Information_Log_Entries,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Warning_Temperature_Time,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Critical_Temperature_Time,serial_no=xxx raw_value=0i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Temperature_Sensor_1,serial_no=xxx raw_value=39i 1648537244000000000
> smart_attribute,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,name=Temperature_Sensor_2,serial_no=xxx raw_value=44i 1648537244000000000
> smart_device,device=sdd,host=CHRONUS,model=Samsung\ SSD\ 980\ PRO\ 2TB,serial_no=xxx exit_status=0i,health_ok=true,temp_c=39i 1648537244000000000

System info

Telegraf 1.22.0, smartctl 7.3, Windows 10 (19044,1620)

Docker

No response

Steps to reproduce

  1. Run telegraf with the input.smart plugin and a Samsung NVME drive (not sure about other vendors)

Expected behavior

Telegraf reports the Power On Hours smart_attribute correctly on all NVME drives.

Actual behavior

My Samsung 980 PRO:
Power On Hours: 496

My Samsung 970 EVO:
Power On Hours: 22 371

The 980 value is reported correctly, for the 970 the whole attribute is missing. Looks like smartctl adds an extra blank as the thousands separator in the reported value, which might be causing issues?

Additional info

No response

@coccobill1 coccobill1 added the bug unexpected problem or unintended behavior label Mar 29, 2022
@powersj
Copy link
Contributor

powersj commented Mar 29, 2022

Hi,

Is the actual behavior output directly from smartctl and is that where you noticed the extra space? If not, can you provide the output smartctl against your 970 EVO? In linux I would run something like:

smartctl /dev/nvme0n1 -a

I do have two NVMe with 1k+ hours, which report, but as you noticed, they use a comma and in the code we handle this, but not the space:

Power On Hours:                     3,767
Power On Hours:                     3,745

I do happen to have a 970 EVO, but it has not passed the 1000 hours mark yet :(

@powersj powersj added the waiting for response waiting for response from contributor label Mar 29, 2022
@coccobill1
Copy link
Author

coccobill1 commented Mar 29, 2022

Yup that's how it's reported, here's the smartctl output:

C:\Program Files\Telegraf>smartctl -a /dev/sde
smartctl 7.3 2022-02-28 r5338 [x86_64-w64-mingw32-w10-21H2] (sf-7.3-1)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 500GB
Serial Number:                      xxx
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500 107 862 016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500 107 862 016 [500 GB]
Namespace 1 Utilization:            370 363 879 424 [370 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5a81b10565
Local Time is:                      Tue Mar 29 16:41:34 2022 FLEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        61 Celsius
Available Spare:                    83%
Available Spare Threshold:          10%
Percentage Used:                    5%
Data Units Read:                    117 888 857 [60,3 TB]
Data Units Written:                 109 879 515 [56,2 TB]
Host Read Commands:                 1 332 906 459
Host Write Commands:                1 908 074 847
Controller Busy Time:               7 271
Power Cycles:                       44
Power On Hours:                     22 373
Unsafe Shutdowns:                   28
Media and Data Integrity Errors:    43
Error Information Log Entries:      13 857
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               61 Celsius
Temperature Sensor 2:               63 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0      13857     0  0x003f  0x4004      -            0     1     -
  1      13856     0  0x0027  0x4004      -            0     1     -
  2      13855     0  0x0001  0x4004      -            0     1     -
  3      13854     0  0x006a  0x4004      -            0     1     -
  4      13853     0  0x0058  0x4004      -            0     1     -
  5      13852     0  0x0042  0x4004      -            0     1     -
  6      13851     0  0x0024  0x4004      -            0     1     -
  7      13850     0  0x0010  0x4004      -            0     1     -
  8      13849     0  0x0079  0x4004      -            0     1     -
  9      13848     0  0x005d  0x4004      -            0     1     -
 10      13847     0  0x0044  0x4004      -            0     1     -
 11      13846     0  0x0034  0x4004      -            0     1     -
 12      13845     0  0x001d  0x4004      -            0     1     -
 13      13844     0  0x007d  0x4004      -            0     1     -
 14      13843     0  0x006a  0x4004      -            0     1     -
 15      13842     0  0x0055  0x4004      -            0     1     -
... (48 entries not read)

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 29, 2022
@powersj
Copy link
Contributor

powersj commented Mar 29, 2022

Thanks for that output. I'll look at putting something together for this and can re-use this output in a test case.

@powersj
Copy link
Contributor

powersj commented Mar 29, 2022

I built a test case using your output and was successfully able to parse the fields sucessfully. It looks like Telegraf uses this regex to parse these fields: ^([^:]+):\s+(.+)$. The second group is greedy and will match everything, including white space.

By chance, do you have the nvme binary installed on your system as well? I'm wondering if that is the output that is not getting parsed correctly.

@coccobill1
Copy link
Author

I don't have nvme-cli installed, so it must be the smartctl output it uses. The smartctl output format for the 980 PRO is identical, I thought it's maybe the blank character messing things up. The 980 PRO is brand new so less than 1000 hours still.

@zak-pawel
Copy link
Collaborator

It seems that output format from smartctl is region-specific (at least for Windows).

Here is my output for Samsung SSD 970 EVO 1TB for English (United States) regional format in Windows:

C:\Users\pzak>"C:\Program Files\smartmontools\bin\smartctl.exe" -a /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-w64-mingw32-w10-20H2] (sf-7.3-1)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 1TB
Serial Number:                      xxx
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            732,793,585,664 [732 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 590141cfa4
Local Time is:                      Wed Mar 30 13:25:09 2022
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    16,626,860 [8.51 TB]
Data Units Written:                 16,828,820 [8.61 TB]
Host Read Commands:                 205,867,941
Host Write Commands:                228,469,146
Controller Busy Time:               686
Power Cycles:                       779
Power On Hours:                     1,289
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      979
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               73 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        979     0  0x002a  0x4212  0x028            0     -     -

And here is my output for Samsung SSD 970 EVO 1TB for Polish regional format in Windows:

C:\Users\pzak>"C:\Program Files\smartmontools\bin\smartctl.exe" -a /dev/sdd
smartctl 7.3 2022-02-28 r5338 [x86_64-w64-mingw32-w10-20H2] (sf-7.3-1)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 1TB
Serial Number:                      xxx
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1 000 204 886 016 [1,00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1 000 204 886 016 [1,00 TB]
Namespace 1 Utilization:            732 789 374 976 [732 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 590141cfa4
Local Time is:                      Wed Mar 30 13:30:12 2022
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    16 626 888 [8,51 TB]
Data Units Written:                 16 829 004 [8,61 TB]
Host Read Commands:                 205 868 508
Host Write Commands:                228 472 943
Controller Busy Time:               686
Power Cycles:                       779
Power On Hours:                     1 290
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      979
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               47 Celsius
Temperature Sensor 2:               68 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        979     0  0x002a  0x4212  0x028            0     -     -

Fun fact, json output format doesn't seem to be region-specific, because numbers are treated as numbers without any separators:

C:\Users\pzak>"C:\Program Files\smartmontools\bin\smartctl.exe" -a /dev/sdd -j
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-w64-mingw32-w10-20H2",
    "build_info": "(sf-7.3-1)",
    "argv": [
      "smartctl",
      "-a",
      "/dev/sdd",
      "-j"
    ],
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1648639922,
    "asctime": "Wed Mar 30 13:32:02 2022 "
  },
  "device": {
    "name": "/dev/sdd",
    "info_name": "/dev/sdd",
    "type": "nvme",
    "protocol": "NVMe"
  },
  "model_name": "Samsung SSD 970 EVO 1TB",
  "serial_number": "xxx",
  "firmware_version": "2B2QEXE7",
  "nvme_pci_vendor": {
    "id": 5197,
    "subsystem_id": 5197
  },
  "nvme_ieee_oui_identifier": 9528,
  "nvme_total_capacity": 1000204886016,
  "nvme_unallocated_capacity": 0,
  "nvme_controller_id": 4,
  "nvme_version": {
    "string": "1.3",
    "value": 66304
  },
  "nvme_number_of_namespaces": 1,
  "nvme_namespaces": [
    {
      "id": 1,
      "size": {
        "blocks": 1953525168,
        "bytes": 1000204886016
      },
      "capacity": {
        "blocks": 1953525168,
        "bytes": 1000204886016
      },
      "utilization": {
        "blocks": 1431234648,
        "bytes": 732792139776
      },
      "formatted_lba_size": 512,
      "eui64": {
        "oui": 9528,
        "ext_id": 382273179556
      }
    }
  ],
  "user_capacity": {
    "blocks": 1953525168,
    "bytes": 1000204886016
  },
  "logical_block_size": 512,
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "smart_status": {
    "passed": true,
    "nvme": {
      "value": 0
    }
  },
  "nvme_smart_health_information_log": {
    "critical_warning": 0,
    "temperature": 43,
    "available_spare": 100,
    "available_spare_threshold": 10,
    "percentage_used": 0,
    "data_units_read": 16626990,
    "data_units_written": 16829057,
    "host_reads": 205869988,
    "host_writes": 228474699,
    "controller_busy_time": 686,
    "power_cycles": 779,
    "power_on_hours": 1290,
    "unsafe_shutdowns": 9,
    "media_errors": 0,
    "num_err_log_entries": 979,
    "warning_temp_time": 0,
    "critical_comp_time": 0,
    "temperature_sensors": [
      43,
      61
    ]
  },
  "temperature": {
    "current": 43
  },
  "power_cycle_count": 779,
  "power_on_time": {
    "hours": 1290
  }
}

@powersj
Copy link
Contributor

powersj commented Mar 30, 2022

@zak-pawel thanks for taking a look at this. Since you also show values space-separated, could you try reproducing with Telegraf under with Windows + Polish regional settings? It isn't clear to me if this is a parsing issue or something else yet.

@zak-pawel
Copy link
Collaborator

@powersj Sure, here is my telegraf.conf:

[[inputs.smart]]
  path_smartctl = "C:\\Program Files\\smartmontools\\bin\\smartctl.exe"
  enable_extensions = ["auto-on"]
  attributes = true
  devices = ["/dev/sdd"]

And here is output from Telegraf:

C:\Users\pzak\Downloads\telegraf-1.22.0>telegraf --config telegraf.conf
2022-03-30T16:11:32Z I! Starting Telegraf 1.22.0
2022-03-30T16:11:32Z I! Loaded inputs: smart
2022-03-30T16:11:32Z I! Loaded aggregators:
2022-03-30T16:11:32Z I! Loaded processors:
2022-03-30T16:11:32Z I! Loaded outputs: file
2022-03-30T16:11:32Z I! Tags enabled: host=zak
2022-03-30T16:11:32Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"zak", Flush Interval:10s
2022-03-30T16:11:32Z W! [inputs.smart] nvme not found: verify that nvme is installed and it is in your PATH (or specified in config) to gather vendor specific attributes: provided path does not exist: []
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Critical_Warning,serial_no=xxx raw_value=0i 1648656700000000000
smart_attribute,device=sdd,host=zak,id=194,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Celsius,serial_no=xxx raw_value=47i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Available_Spare,serial_no=xxx raw_value=100i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Available_Spare_Threshold,serial_no=xxx raw_value=10i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Percentage_Used,serial_no=xxx raw_value=0i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Controller_Busy_Time,serial_no=xxx raw_value=687i 1648656700000000000
smart_attribute,device=sdd,host=zak,id=12,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Power_Cycle_Count,serial_no=xxx raw_value=780i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Unsafe_Shutdowns,serial_no=xxx raw_value=9i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Media_and_Data_Integrity_Errors,serial_no=xxx raw_value=0i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Error_Information_Log_Entries,serial_no=xxx raw_value=980i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Warning_Temperature_Time,serial_no=xxx raw_value=0i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Critical_Temperature_Time,serial_no=xxx raw_value=0i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Sensor_1,serial_no=xxx raw_value=47i 1648656700000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Sensor_2,serial_no=xxx raw_value=73i 1648656700000000000
smart_device,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,serial_no=xxx exit_status=0i,health_ok=true,temp_c=47i 1648656700000000000

@powersj
Copy link
Contributor

powersj commented Mar 30, 2022

Thanks @zak-pawel and with English you get the field? or is it also missing?

@coccobill1
Copy link
Author

Lo and behold, switched to English UK region and:

smart_attribute,device=nvme0,host=CHRONUS,id=9,model=Samsung\ SSD\ 970\ EVO\ 500GB,name=Power_On_Hours,serial_no=xxx raw_value=22384i 1648661203000000000

@zak-pawel
Copy link
Collaborator

@powersj @coccobill1
In my case these additional metrics appeared after switching to English US region:

smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Data_Units_Read,serial_no=xxx raw_value=16628754i 1648667050000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Data_Units_Written,serial_no=xxx raw_value=16845546i 1648667050000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Host_Read_Commands,serial_no=xxx raw_value=205906036i 1648667050000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Host_Write_Commands,serial_no=xxx raw_value=228674306i 1648667050000000000
smart_attribute,device=sdd,host=zak,id=9,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Power_On_Hours,serial_no=xxx raw_value=1290i 1648667050000000000

All of them have raw_value >= 1000 :)

@powersj
Copy link
Contributor

powersj commented Mar 30, 2022

All of them have raw_value >= 1000 :)

Fantastic, thanks for this. I clearly see an issue with parseDataUnits so I'll go through these and add test cases when values are space separated.

@zak-pawel
Copy link
Collaborator

@powersj Remember that not only dot, comma or space can be used as a thousands separator.

Look at locale for Liechtenstein:

Data Units Read:                    16'628'976 [8.51 TB]
Data Units Written:                 16'846'735 [8.62 TB]
Host Read Commands:                 205'909'958
Host Write Commands:                228'710'711
Power On Hours:                     1'291

Maybe there are some locales with different thousands separator.

powersj added a commit to powersj/telegraf that referenced this issue Mar 31, 2022
The smartctl output can vary based on the localization set in Windows.
This means that some numbers can show up comma-seperated (english),
space seperated, or even apostrophe seperated. There are probably
others.

This updates one function that was always assuming comma seperated to
remove any non-numeric value and no longer split on white space, which
could be used as a seperator.

Fixes: influxdata#10907
@powersj
Copy link
Contributor

powersj commented Mar 31, 2022

I have put up #10923 which should fix some of the fields that were not correctly parsing. There is also a debug line added to print the fields that it finds and what it parsed them into. Could one of you grab a PR artifact, run it on a non-english locale, and provide the output?

Thanks!

@zak-pawel
Copy link
Collaborator

@powersj Here is output with Polish locale enabled:

C:\Users\pzak\Downloads\telegraf-1.23.0_31e3e06c_windows_amd64\telegraf-1.23.0>telegraf --config telegraf.conf
2022-03-31T19:01:06Z I! Starting Telegraf 1.23.0-31e3e06c
2022-03-31T19:01:06Z I! Loaded inputs: smart
2022-03-31T19:01:06Z I! Loaded aggregators:
2022-03-31T19:01:06Z I! Loaded processors:
2022-03-31T19:01:06Z I! Loaded outputs: file
2022-03-31T19:01:06Z I! Tags enabled: host=zak
2022-03-31T19:01:06Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"zak", Flush Interval:10s
2022-03-31T19:01:06Z W! [inputs.smart] nvme not found: verify that nvme is installed and it is in your PATH (or specified in config) to gather vendor specific attributes: provided path does not exist: []
Critical_Warning                  : '0x00' -> '0'
Temperature_Celsius               : '47 Celsius' -> '47'
Available_Spare                   : '100%' -> '100'
Available_Spare_Threshold         : '10%' -> '10'
Percentage_Used                   : '0%' -> '0'
Data_Units_Read                   : '16�681�841 [8,54 TB]' -> '16681841'
Data_Units_Written                : '16�872�671 [8,63 TB]' -> '16872671'
error parsing Host_Read_Commands: '206�397�114': strconv.ParseInt: parsing "206\xa0397\xa0114": invalid syntax
error parsing Host_Write_Commands: '229�063�886': strconv.ParseInt: parsing "229\xa0063\xa0886": invalid syntax
Controller_Busy_Time              : '688' -> '688'
Power_Cycle_Count                 : '783' -> '783'
error parsing Power_On_Hours: '1�292': strconv.ParseInt: parsing "1\xa0292": invalid syntax
Unsafe_Shutdowns                  : '9' -> '9'
Media_and_Data_Integrity_Errors   : '0' -> '0'
Error_Information_Log_Entries     : '983' -> '983'
Warning_Temperature_Time          : '0' -> '0'
Critical_Temperature_Time         : '0' -> '0'
Temperature_Sensor_1              : '47 Celsius' -> '47'
Temperature_Sensor_2              : '74 Celsius' -> '74'
2022-03-31T19:01:15Z I! [agent] Hang on, flushing any cached metrics before shutdown
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Critical_Warning,serial_no=xxx raw_value=0i 1648753271000000000
smart_attribute,device=sdd,host=zak,id=194,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Celsius,serial_no=xxx raw_value=47i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Available_Spare,serial_no=xxx raw_value=100i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Available_Spare_Threshold,serial_no=xxx raw_value=10i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Percentage_Used,serial_no=xxx raw_value=0i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Data_Units_Read,serial_no=xxx raw_value=16681841i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Data_Units_Written,serial_no=xxx raw_value=16872671i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Controller_Busy_Time,serial_no=xxx raw_value=688i 1648753271000000000
smart_attribute,device=sdd,host=zak,id=12,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Power_Cycle_Count,serial_no=xxx raw_value=783i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Unsafe_Shutdowns,serial_no=xxx raw_value=9i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Media_and_Data_Integrity_Errors,serial_no=xxx raw_value=0i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Error_Information_Log_Entries,serial_no=xxx raw_value=983i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Warning_Temperature_Time,serial_no=xxx raw_value=0i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Critical_Temperature_Time,serial_no=xxx raw_value=0i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Sensor_1,serial_no=xxx raw_value=47i 1648753271000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Sensor_2,serial_no=xxx raw_value=74i 1648753271000000000
smart_device,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,serial_no=xxx temp_c=47i,exit_status=0i,health_ok=true 1648753271000000000

@zak-pawel
Copy link
Collaborator

Italian (Switzerland):

C:\Users\pzak\Downloads\telegraf-1.23.0_31e3e06c_windows_amd64\telegraf-1.23.0>telegraf --config telegraf.conf
2022-03-31T19:12:29Z I! Starting Telegraf 1.23.0-31e3e06c
2022-03-31T19:12:29Z I! Loaded inputs: smart
2022-03-31T19:12:29Z I! Loaded aggregators:
2022-03-31T19:12:29Z I! Loaded processors:
2022-03-31T19:12:29Z I! Loaded outputs: file
2022-03-31T19:12:29Z I! Tags enabled: host=zak
2022-03-31T19:12:29Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"zak", Flush Interval:10s
2022-03-31T19:12:29Z W! [inputs.smart] nvme not found: verify that nvme is installed and it is in your PATH (or specified in config) to gather vendor specific attributes: provided path does not exist: []
Critical_Warning                  : '0x00' -> '0'
Temperature_Celsius               : '44 Celsius' -> '44'
Available_Spare                   : '100%' -> '100'
Available_Spare_Threshold         : '10%' -> '10'
Percentage_Used                   : '0%' -> '0'
Data_Units_Read                   : '16�681�860 [8.54 TB]' -> '16681860'
Data_Units_Written                : '16�872�882 [8.63 TB]' -> '16872882'
error parsing Host_Read_Commands: '206�397�906': strconv.ParseInt: parsing "206\x92397\x92906": invalid syntax
error parsing Host_Write_Commands: '229�071�956': strconv.ParseInt: parsing "229\x92071\x92956": invalid syntax
Controller_Busy_Time              : '688' -> '688'
Power_Cycle_Count                 : '783' -> '783'
error parsing Power_On_Hours: '1�292': strconv.ParseInt: parsing "1\x92292": invalid syntax
Unsafe_Shutdowns                  : '9' -> '9'
Media_and_Data_Integrity_Errors   : '0' -> '0'
Error_Information_Log_Entries     : '983' -> '983'
Warning_Temperature_Time          : '0' -> '0'
Critical_Temperature_Time         : '0' -> '0'
Temperature_Sensor_1              : '44 Celsius' -> '44'
Temperature_Sensor_2              : '69 Celsius' -> '69'
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Critical_Warning,serial_no=xxx raw_value=0i 1648753961000000000
smart_attribute,device=sdd,host=zak,id=194,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Celsius,serial_no=xxx raw_value=44i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Available_Spare,serial_no=xxx raw_value=100i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Available_Spare_Threshold,serial_no=xxx raw_value=10i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Percentage_Used,serial_no=xxx raw_value=0i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Data_Units_Read,serial_no=xxx raw_value=16681860i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Data_Units_Written,serial_no=xxx raw_value=16872882i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Controller_Busy_Time,serial_no=xxx raw_value=688i 1648753961000000000
smart_attribute,device=sdd,host=zak,id=12,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Power_Cycle_Count,serial_no=xxx raw_value=783i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Unsafe_Shutdowns,serial_no=xxx raw_value=9i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Media_and_Data_Integrity_Errors,serial_no=xxx raw_value=0i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Error_Information_Log_Entries,serial_no=xxx raw_value=983i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Warning_Temperature_Time,serial_no=xxx raw_value=0i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Critical_Temperature_Time,serial_no=xxx raw_value=0i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Sensor_1,serial_no=xxx raw_value=44i 1648753961000000000
smart_attribute,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,name=Temperature_Sensor_2,serial_no=xxx raw_value=69i 1648753961000000000
smart_device,device=sdd,host=zak,model=Samsung\ SSD\ 970\ EVO\ 1TB,serial_no=xxx exit_status=0i,health_ok=true,temp_c=44i 1648753961000000000

@powersj
Copy link
Contributor

powersj commented Mar 31, 2022

perfect, that looks like it fixes some of the fields

error parsing Power_On_Hours: '1�292': strconv.ParseInt: parsing "1\xa0292": invalid syntax
error parsing Power_On_Hours: '1�292': strconv.ParseInt: parsing "1\x92292": invalid syntax

This explains why these fields are not showing up. I will look at this next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/smart bug unexpected problem or unintended behavior platform/windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants