Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add plugin for SMART data from disk #1880

Closed
bricewge opened this issue Oct 11, 2016 · 20 comments
Closed

Add plugin for SMART data from disk #1880

bricewge opened this issue Oct 11, 2016 · 20 comments
Labels
feature request Requests for new plugin and for new features to existing plugins
Milestone

Comments

@bricewge
Copy link

It would be nice to have a plugin which support SMART data for monitoring disks health. Other collectors like collectd and diamond already have a plugin for it.

@sparrc
Copy link
Contributor

sparrc commented Oct 11, 2016

looks like diamond just parses the output of smartctl -A: https://github.com/python-diamond/Diamond/blob/master/src/collectors/smart/smart.py, should be a fairly simple plugin to write.

@sparrc sparrc added help wanted Request for community participation, code, contribution plugin request labels Oct 11, 2016
@j-vizcaino
Copy link
Contributor

j-vizcaino commented Nov 15, 2016

@sparrc Be aware that the implementation from Diamond is lacking proper support for SAS drives (at least from my personnal experience with Dell hardware). Considering that SAS drives are entreprise grade disks, I think this should be taken into consideration.

@sparrc
Copy link
Contributor

sparrc commented Nov 15, 2016

@j-vizcaino do you have any tips or docs for supporting SAS drives?

to me it seems reasonable to say that we would only support drives which smartctl can profile. I believe this is the only reasonable option because we can't make the build system depend on a third-party C library.

@j-vizcaino
Copy link
Contributor

j-vizcaino commented Nov 15, 2016

@sparrc The problem does not come from smartctl. smartctl can talk to SAS drives perfectly. The thing is, S.M.A.R.T attributes from SATA drives have a properly defined list of entries, whereas SAS drives tend to have no real normalization. I have a working code written in Python that handles both SAS, SATA HDD and SATA SSD. I will try to squeeze some time to port it to Go and submit a PR.

@j-vizcaino
Copy link
Contributor

To further help the discussion, I have uploaded 4 examples of output generated by smartctl for different kind of drives.
Outputs can be found here: https://gist.github.com/j-vizcaino/092f43d6c45e347919e37f287a4e5264
You can see that SATA drives have a nice, well formatted output (that's the info the Diamond scraper is expecting), whereas SAS output is more diverse...

@sebito91
Copy link
Contributor

sebito91 commented Jan 24, 2017

@j-vizcaino, @bricewge I have an initial version I'm about to submit as a PR (linux-only, sorry) as from your examples I'm pulling things like the read/write/verify error counters, etc. I'm not pulling in any of the self-test data, is that something people would care about?

Also, from the last two examples you provide what data there matters? The threshold information?

Here's some sample output...

NOTE: data tested with include set to "/dev/bus/0 -d megaraid,8"
[root@localhost sebtest]# ./telegraf --test --config /etc/telegraf/telegraf_seb.conf --input-filter smartctl
* Plugin: inputs.smartctl, Collection 1
> smartctl,name=/dev/bus/0_megaraid_8,transport=SAS,host=localhost,vendor=HGST,product=HUC101212CSS600,block_size=512,writeback=Disabled,env=production,read_cache=Enabled,sr=metrics,dc=carf,bu=linux,serial=L0G77T9G,rpm=10000,cls=server,trd=false ecc_corr_delay_write=0,total_err_corr_write=0,data_write=9372.49,ecc_corr_delay_verify=2,ecc_reverify=0,current_temp=26,ecc_reread=0,corr_algo_verify=412069,uncorr_err_verify=0,total_err_corr_read=0,data_read=41322.953,ecc_corr_fast_read=0,corr_algo_read=573068,uncorr_err_read=0,ecc_corr_fast_write=0,ecc_corr_fast_verify=0,data_verify=42.444,health=1,max_temp=85,corr_algo_write=10933,uncorr_err_write=0,total_err_corr_verify=2,ecc_corr_delay_read=0,ecc_rewrite=0 1485206631000000000

@j-vizcaino
Copy link
Contributor

@sebito91 My best guess would be that the SATA SMART attributes to be extracted should be configurable per-drive. For example, SSD drives have the remaining_lifetime_perc information that is crucial and this needs to be extracted. HDD do not have these, obviously.
Extracting attributes should be easy to implement since the output of SATA drive ID, Attribute_Name & Raw Value have predictable format.

@sebito91
Copy link
Contributor

sebito91 commented Jan 24, 2017

Interesting, I'm not seeing remaining lifetime in any of the standard attributes. In your example for the SSD there is data from the SMART self-test results data returned but not in any of the standard attributes.

Here is an example of what I'm working from...

@j-vizcaino
Copy link
Contributor

j-vizcaino commented Jan 24, 2017

@sebito91 I have updated my Gist (mentionned above) with an additional example taken from my local SSD drive. If you compare the attributes with the ones taken from the Intel SSD you will see that information regarding the lifecycle of the drive is expressed using different attributes.

My point is: this is far too chaotic to hardcode attributes of interest for the end-user because it is vendor (and model) dependent. Therefore, this should be configurable.

Also, you may find this link of interest. It helped me figuring out what ID refers to when smartctl prints Unknown_SSD_Attribute (for example)

@sebito91
Copy link
Contributor

Hmm, fair 'nuff...maybe the v2 version of the plugin. Would be good to get SOMETHING up and running and then take the plugin to the next level later.

@sparrc
Copy link
Contributor

sparrc commented Jan 24, 2017

@j-vizcaino it would be impossible to use configuration options to be able to parse arbitrary blocks of text. In that case the user should write their own shell/python/ruby script to parse and use the exec plugin.

If it's possible we should parse the most common formats similar to the existing collectd & diamond collectors.

@j-vizcaino
Copy link
Contributor

j-vizcaino commented Jan 24, 2017

@sparrc As far as SATA is concerned, this should be manageable using regular expressions and configured attribute ID. All the user has to configure is, for example, ID number, expected name, and target tag.
Using that, one would build a regexp matching ^\s+<id>\s+<attr>\s+[^ ]+0x\d+\s+([^ ]+\s+){6}(\d+) and extract, using the capture parenthesis, the raw value of the configured attribute.
Example of config entry: id=194, attr=Temperature_Celsius, tag=temperature

@sparrc
Copy link
Contributor

sparrc commented Jan 24, 2017

Sorry @j-vizcaino, but I disagree. I would absolutely reject any plugin that required users to build a regex that looked like that. If you need absolute configurability then why not write/build/script your own plugin?

In version 1.3 we'll also be adding the ability to have external plugins that are built independently of telegraf so that will make it even easier for users to write custom plugins.

@j-vizcaino
Copy link
Contributor

j-vizcaino commented Jan 25, 2017

I realize I wasn't clear in my previous comment. I completely agree that no user should have to write such regexp in order for the SMART plugin to extract SATA attributes. I was suggesting that the plugin uses the given template regexp to build regexp for each attribute that the user needed to extract.

For example:
In the config file:

{ "temperature": { "id": 194, "attr": "Temperature_Celsius" } }

The plugin would generate the regexp ^\s+(194)\s+(Temperature_Celsius)\s+[^ ]+0x\d+\s+([^ ]+\s+){6}(\d+) and try to match it against the output of smartctl -a. If it matches, it adds a temperature field with the extracted value.

@sebito91
Copy link
Contributor

FYI @j-vizcaino, @sparrc...I'm committing the plugin as-is today, if you take a look and let me know your thoughts (and also any additions) we can bake them in.

Remember, this is a first-pass at a plugin to satisfy the request. We can always build on it from here!

@danielnelson danielnelson removed the help wanted Request for community participation, code, contribution label Apr 18, 2017
@liquidox
Copy link

So sad this was pushed to 1.4.0 but still anxiously waiting for SMART support in Telegraf!

@stemwinder
Copy link

Yeah a bit of a disappointment. I'll probably just write my own Python script - I have plenty of those already anyway. Would be easier than moving everything over to a different collector.

@sebito91
Copy link
Contributor

sebito91 commented May 30, 2017 via email

@danielnelson danielnelson added feature request Requests for new plugin and for new features to existing plugins and removed plugin request labels Aug 12, 2017
@liquidox
Copy link

Nooo moved to 1.5.0 now? Sad :(

@danielnelson danielnelson added this to the 1.5.0 milestone Oct 4, 2017
@danielnelson
Copy link
Contributor

Thanks @rickard-von-essen and everyone who helped with this plugin, will be in the 1.5 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

No branches or pull requests

7 participants