Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds an alternate windows performance counter input plugin #1629

Closed
wants to merge 1 commit into from

Conversation

dzrw
Copy link
Contributor

@dzrw dzrw commented Aug 12, 2016

After getting Telegraf installed as a Windows Service earlier today, I noticed that the win_perf_counters plugin was generating a large amount of difficult to query series. So, I built this alternate input plugin that works to minimize the number of series generated and simplify queries.

From the README.md,

wpc vs win_perf_counters

The win_perf_counters plugin generates tags and fields using native Windows names. This can make it difficult to compare common measurements across heterogenous environments because Windows names tend towards complexity. For example, on Windows the performance counter "\Processor(*)%% User Time" is equivalent to the Linux metric "cpu.usage_user" - good luck displaying both series on the same plot in Grafana.

Additionally, win_perf_counters can generate an large number of series in an InfluxDB database due to the inclusion of the Windows Performance Counter Object Name (e.g. Processor, Processor Information, Memory, etc) in the tag list. According to the Hardware Sizing Guidelines, series cardinality strongly affects the amount of RAM required by the InfluxDB server. Therefore, there is a risk that heavily instrumented Windows machines can unduly impact the provisioning requirements of the InfluxDB server simply due to the use of win_perf_counters.

The wpc plugin mitigates these two potential issues by making Performance Counter queries field names explicit, and by transparently regrouping fully-qualified Performance Counter queries by instance to minimize the number of points generated.

I'm open to suggestions for improving it further.

Required for all PRs:

  • CHANGELOG.md updated
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

@sparrc
Copy link
Contributor

sparrc commented Aug 12, 2016

@politician thank you for the contribution but I don't think I can merge this in it's current state.

If you would like to change the windows perf counters, we will need to do a straight replacement. Windows support is still in an "experimental" state so currently it's OK to make breaking changes to the measurement schema.

FWIW, I completely support the idea behind this PR, but I was under the impression that windows users prefer the verbose and complicated names. This being said, we will need to open up a discussion and get input from other win_perf_counters users.

cc @TheFlyingCorpse @butitsnotme @ricardclau @steverweber @elvarb @cwegener @G-regL please let us know your thoughts on normalizing the win_perf_counters field and measurement names for simplicity of querying, and to be more similar to the schema of the linux plugins.

@G-regL
Copy link
Contributor

G-regL commented Aug 12, 2016

...but I was under the impression that windows users prefer the verbose and complicated names.

I'm inclined to agree with you Cam. I'm a user of both platforms in my environment, and I'm happy to use the names provided by each.

I can't speak to the use of Influx as a TSDB, but with Graphite, I use a relay to rewrite metrics names I don't like into ones I do. I also use tagexclude on the agent end to drop some of the more redundant or useless tags.

I actually rather like the current plugin.

That said, if I had to make a change, it would be geared towards the source of the metrics.
Instead of the current library, which uses the Performance Data Helper, I'd move to the StackExchange WMI library so that you can collect the output of raw WMI calls to any class. Sadly I don't have sufficient Go-fu to change the current code-base to use that. Besides, it would be a huge breaking change and I'm not sure the value of such a large change is really there.

@sparrc
Copy link
Contributor

sparrc commented Aug 12, 2016

@G-regL if you use the regular plugins (inputs.cpu, inputs.mem, etc.), those should use WMI.

The reason I decided to default to win_perf_counters on windows is that many other windows users told me (and there is plenty to read on the internet about this) that WMI is notorious for using large amounts of system resources itself.

@G-regL
Copy link
Contributor

G-regL commented Aug 12, 2016

@sparrc
I'll admit that I haven't even tried the regular plugins on Windows, but it still doesn't offer the level of control over which metrics, from which classes are being pulled.

win_perf_counters does that better, but being able to build your own WQL queries to be run against the system would be the ultimate.

@dzrw
Copy link
Contributor Author

dzrw commented Aug 12, 2016

@G-regL have you tried using a PowerShell script from the exec plugin?

@G-regL
Copy link
Contributor

G-regL commented Aug 12, 2016

@politician, no. I hadn't thought of it, and now that I have, I think it would be slower than having something built-in. Something to test though I suppose.

@steverweber
Copy link

steverweber commented Aug 12, 2016

We run a mixed environment of, mac, windows, Linux, systems... it be best if the metric names were uniform on all the os types. This will simplify queries to display the data... +1

@butitsnotme
Copy link
Contributor

I am inclined to believe that telegraf should produce as close to the same set of metrics across all platforms as possible, including using the same names. This means less re-work of the data to be able to compare across platforms.

@politician I have a pull request in progress which will remove carriage returns allowing the data from a powershell script (or other program) to be processed on Windows. See pull request #1606.

@dzrw
Copy link
Contributor Author

dzrw commented Aug 12, 2016

Thanks for the quick replies - it seems like there are two camps forming: folks who like origin names or can post-process metrics, and folks that prefer uniform names or don't want to post-process metrics.

This discussion might be raising the need for general transform plugins that can alter/regroup metrics before they're emitted to an output plugin. But short of that sort of large change, here are a couple of other ideas that I was playing around with before settling on the wpc approach:

  • a dedicated iis plugin with support for a PreserveWindowsNames boolean (default: false)
  • a win_lua plugin that exposes PDH to Lua scripts. I tend to agree with @G-regL's observation that invoking PowerShell or WSH every 10s is probably really slow; however, Lua contexts can be cached.
  • writing a server that gathers and transforms PDH counters and makes them available to a telegraf TCP input (which kind of usurps the role of telegraf, but provides maximum flexibility to support my needs).

That said, I suppose I could make the field rewriting aspect optional. It doesn't sound like the series minimization code is contentious.

@steverweber
Copy link

https://github.com/mozilla-services/heka
seems to have most of that already... perhaps should re-evaluate who is creating the wheel.

@dzrw
Copy link
Contributor Author

dzrw commented Aug 12, 2016

@steverweber I'll admit to never getting heka to actually work. On Windows or Linux, even with the default "Hello, World" example. On the other hand, telegraf worked right out of the box.

@steverweber
Copy link

i also found heka kinda frustrating to get working... that's why i'm here :) Seemed more simple.
However it does seem telegraf is starting be reworked to support some of the more advanced things that heka has.

@elvarb
Copy link

elvarb commented Aug 12, 2016

Graphite powershell https://github.com/MattHodge/Graphite-PowerShell-Functions has this feature of renaming metrics, it's just in the powershell code but very easy to modify there.

It is one way of doing this, have telegraf rename metrics before they are sent.

Regarding unified naming conventions between platforms I would be extremely cautious. Not all platforms report basic metrics on the same format, cpu load for example.

@sparrc
Copy link
Contributor

sparrc commented Aug 15, 2016

@steverweber please keep on-topic, your opinion about merging telegraf & heka has been heard many times by the telegraf committee (of one).....I think you can guess by now that it's not going to happen.

@sparrc
Copy link
Contributor

sparrc commented Aug 15, 2016

@politician what is an iis plugin?

I would support having the ability to specify arbitrary WQL statements

and lastly, remember that most of the regular system plugins work and produce the same names as the linux plugins (inputs.cpu, inputs.mem, etc). These were not made the default because WMI is resource-intensive.

if anyone has time & expertise to rewrite the code behind these to use windows perf counters instead of WMI, I'm sure that @shirou would appreciate it a lot: https://github.com/shirou/gopsutil

@dzrw
Copy link
Contributor Author

dzrw commented Aug 15, 2016

@politician what is an iis plugin?

@sparrc A primary use case is monitoring the standard Windows HTTP server, IIS. So, I briefly considered building a dedicated plugin for it (cf. nginx, etc).

@sparrc
Copy link
Contributor

sparrc commented Aug 15, 2016

My preference is for the format suggested here, but we would need to replace win_perf_counters rather than maintaining two plugins doing essentially the same thing.

Having a plugin-like interface for modifying metrics as they pass through the system is in the pipeline, and a high-priority.

@ricardclau
Copy link
Contributor

Sorry about the delay answering here

In our case, we never compare Linux and Windows metrics as they do completely different things in our setup so win_perf_counters is totally fine for us. I agree the names are a bit cumbersome but this is just how Windows stores them.

On the other hand, I agree, it is very difficult to show the same metric (even something as simple as Free Memory) for both Win and Linux hosts in the same Grafana dashboard.

If you ask me, I am happy with the way win_perf_counters plugin works but if you go ahead with this new plugin (which makes total sense, as Windows support is experimental) I would appreciate some comments with an easy migration guide for the telegraf.conf files.

We have hundreds of servers reporting metrics to our Grafana / InfluxDB setups and CI/CD pipelines to generate and install dashboards and this change can be a bit tricky for us :)

@toni-moreno
Copy link
Contributor

Hi to everybody.

I would like to contribute in this discussion.

We are currently working with Graphite Powershell https://github.com/MattHodge/Graphite-PowerShell-Functions . We are now renaming metric names to something more user friendly.

And We would like a lot to have this new capability also in telegraf. ( I think is really important to users like us that will need a migration from Graphite Powershell to telegraf in the future ) .

We have also need any way to get data from other windows sources , like WMI , we need by example to get the total physical memory in the system. ( not available in native performance counters in windows 2008/2012 servers ).

Thank you very much.

@shirou
Copy link
Contributor

shirou commented Aug 16, 2016

Hi all, gopsutil author here.

I noticed lxn/win is now not using cgo. Since gopsutil has "pure golang" policy, I could not use lxn/win, but now it looks changed.

I am thinking about gopsutil change to use lxn/win. But if someone make a PR, I really appreciate.

(Sorry not directly related to telegraf itself)

@dzrw
Copy link
Contributor Author

dzrw commented Aug 16, 2016

@sparrc It sounds like there is a general consensus in favor of the following:

  1. We want an optional mechanism for rewriting metrics in an output plugin independent way.
  2. We want win_perf_counters to remain as a general tool for querying Windows Performance Counters.
  3. We want a different windows plugin that supports querying WMI.

There hasn't been enough discussion to develop a consensus around the following questions:

  1. Should we add support to coalesce related points?
  2. Should we change the configuration of win_perf_counters to use fully-qualified counter queries or leave it as it?

Having a plugin-like interface for modifying metrics as they pass through the system is in the pipeline, and a high-priority.

I'd love to take a look at the progress on this - can you point me to any commits?

@sparrc
Copy link
Contributor

sparrc commented Aug 17, 2016

Should we add support to coalesce related points?

Can't this be done already via configuration?

Should we change the configuration of win_perf_counters to use fully-qualified counter queries or leave it as it?

I'm not sure....what would be the benefit? can you provide and example of how that would look vs. the current plugin?

I'd love to take a look at the progress on this - can you point me to any commits?

there is none so far

@dzrw
Copy link
Contributor Author

dzrw commented Aug 19, 2016

coalesce related points

I should have been more specific. The current plugin will coalesce points by objectname via configuration, but wpc goes further by discarding objectname and compressing queries based on instance alone. In this way, I can jam more data into each metric yet reduce series cardinality by, in some cases, 20%. That's important for me because I'm using InfluxDB.

Should we change the configuration of win_perf_counters to use fully-qualified counter queries or leave it as it?
I'm not sure....what would be the benefit? can you provide and example of how that would look vs. the current plugin?

The proposed wpc plugin uses fully-qualified queries as a means to jam more metrics into fewer series. The current win_perf_counters plugin issues queries by Object Name which means that the same metrics are spread out over a larger number of series. This potentially complicates comparisons or requires post-processing to ameliorate. Fixing the issue at the source seemed like a cheap win.

The sample below is slightly modified from the README.md.

 # A plugin to collect stats from Windows Performance Counters
 [[inputs.wpc]]
  ## If the system being polled for data does not have a particular Counter at startup 
  ## of the Telegraf agent, it will not be gathered.
  # Prints all matching performance counters (useful for debugging)
  # PrintValid = false

  [[inputs.wpc.template]]
    # Processor usage, alternative to native.
    Counters = [
      # Use double-backslashes to work around a TOML parsing issue.
      [ "usage_idle", "\\Processor(_Total)\\%% Idle Time" ],
      [ "usage_user", "\\Processor(_Total)\\%% User Time" ],
      [ "usage_system", "\\Processor(_Total)\\%% Processor Time" ],
      [ "available_bytes", "\\Memory\\Available Bytes" ]
    ]
    Measurement = "win_system"
    # Print out when the performance counter is missing from object, counter or instance.
    # WarnOnMissing = false

The current win_perf_counters plugin cannot mix Memory and Processor objects into the same metric (without post-processing). There are more substantial gains to be had when querying the IIS & .NET performance counters. The number of series is negatively correlated with InfluxDB performance (in fact, the documentation says that it's exponential), so that seems like something to avoid.

@sparrc
Copy link
Contributor

sparrc commented Aug 31, 2016

The current win_perf_counters plugin cannot mix Memory and Processor objects into the same metric (without post-processing). There are more substantial gains to be had when querying the IIS & .NET performance counters. The number of series is negatively correlated with InfluxDB performance (in fact, the documentation says that it's exponential), so that seems like something to avoid.

@politician putting all of your fields into a single measurement is not an encouraged way to setup your influxdb schema, and in fact fields do contribute to cardinality in InfluxDB. I believe the documentation might be a bit inaccurate because a "measurement" is considered the combination of the "measurement name" + "field name"

It's important to note that it's exponential but where

the exponent is between one and two:

So adding just a few more series to separate CPU and memory usage shouldn't have any significant impact.

The other consideration is that if the fields are part of the same series, then they can never be differentiated from each other, meaning that you can't separate out CPU usage of the various CPUs. It works if you are only differentiating based on the hostname, but it falls apart if you want any more granularity beyond that.

@sparrc
Copy link
Contributor

sparrc commented Sep 5, 2016

I'm closing this for now as I don't want to merge duplicate plugins.

If there is something lacking in the current win_perf_counters plugin, the proper way to go about requesting/discussing changes would be to open an issue. Then we can come to a consensus over whether we can introduce breaking changes if they would be of use to the community.

@sparrc sparrc closed this Sep 5, 2016
@dzrw
Copy link
Contributor Author

dzrw commented Sep 5, 2016

@sparrc Is there a contrib repository for plugins like this?

@sparrc
Copy link
Contributor

sparrc commented Sep 5, 2016

not at the moment, no, Go doesn't have a very good facility for doing this unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants