Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf issues with Windows multi instance processes #1827

Closed
willemdh opened this issue Sep 29, 2016 · 12 comments · Fixed by #2352
Closed

Telegraf issues with Windows multi instance processes #1827

willemdh opened this issue Sep 29, 2016 · 12 comments · Fixed by #2352
Labels
bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution platform/windows

Comments

@willemdh
Copy link

Bug report

I noticed in Grafana that some Windows process % Processor Time was incorrect on one server for one process. The process is called UserMonitor showed as using 100 % CPU for almost all the time.

image

Doublechecking the perfmon on the Windows Server showed that this process was actually only using +- 8 % CPU.

image

The config in Grafana:

image

And a query qhowing the UserMonitor process data coming in as 100%:

image

Relevant telegraf.conf:

  [[inputs.win_perf_counters.object]]
    ObjectName = "Process"
    Counters = ["% Processor Time","Pool Nonpaged Bytes","Pool Paged Bytes","Working Set - Private"]
    Instances = ["*"]
    Measurement = "win_processes"
    IncludeTotal=true

System info:

Telegraf 1.0.0
Windows Server 2012 R2

@sparrc sparrc added bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution platform/windows labels Sep 29, 2016
@sparrc sparrc added this to the Future Milestone milestone Sep 29, 2016
@willemdh
Copy link
Author

I'm seeing more abnormal Windows process results. One of my colleagues pointed out for example that Telegraf.exe was using more then 10 % CPU on some servers in Grafana. I doublechecked this on the server and it was only using 1 % CPU.
Is someone able to reproduce these issues?

@francoishill
Copy link

When there are multiple of the same processes it seems to only pick one of them, my config is also as follows:

  [[inputs.win_perf_counters.object]]
    # Process metrics, in this case for IIS only
    ObjectName = "Process"
    Counters = ["% Processor Time","Handle Count","Private Bytes","Thread Count","Virtual Bytes","Working Set"]
    # Instances = ["gitkraken"]
    Instances = ["*"]
    Measurement = "win_proc"
    IncludeTotal=true

@willemdh
Copy link
Author

willemdh commented Jan 11, 2017

Posted this problem on Google forums, but no answer. https://groups.google.com/forum/#!topic/influxdb/k2njOfflGmo

It seems there is no way atm to monitor processes which have multiple instances. Although perfMon is able to show data for each process instance, Telegraf fails to separate processes with the same name..

Changing the title of this issue for more clarity.

@willemdh willemdh changed the title Telegraf Windows process CPU usage sometimes incorrect? Telegraf issues with multi instance processes? Jan 11, 2017
@willemdh willemdh changed the title Telegraf issues with multi instance processes? Telegraf issues with Windows multi instance processes? Jan 11, 2017
@sparrc
Copy link
Contributor

sparrc commented Jan 11, 2017

That does appear to be the issue. PRs would be appreciated if anyone could find a solution for tagging the process with the Windows instance #. In unix we would call this a PID (Process ID), not sure what it's referred to in the Windows world.

@francoishill
Copy link

francoishill commented Jan 11, 2017

@sparrc, in Windows it is also called PID.

@sparrc sparrc changed the title Telegraf issues with Windows multi instance processes? Telegraf issues with Windows multi instance processes Jan 11, 2017
@discoduck2x
Copy link

+1 on getting this solved !

@bullshit
Copy link
Contributor

Hi, i managed to debug it on windows and it turned out that the filledBuf var after UTF16PtrToString(c.SzName) contains only a view characters. For example, i configured my Instance to be chrome#1 and c.Szname was ch and thats the reason why it never will be added,
because with metric.instance == s the name must match exactly.
For now i don't get it if the Problem is in the second PdhGetFormattedCounterArrayDouble call or if the UTF16PtrToString and with the syscalls and pointer casting i think i won't get any deeper information on it.

@bullshit
Copy link
Contributor

bullshit commented Mar 20, 2017

Hi, could anybody test my changes. i really would appreciate your help #2352 https://github.com/bullshit/telegraf/tree/fix-1827-win_perf_counters

@lucadistefano
Copy link

@bullshit , tried the fix and seems to work:
this is the output of

[[inputs.win_perf_counters]]
	PrintValid=true

	[[inputs.win_perf_counters.object]]
		ObjectName = "Process"
		Counters = ["Handle Count"]
		Instances = ["spoolsv","svchost","svchost#1","svchost#2"]
		Measurement = "win_proc1"
C:\Program Files\Telegraf>ptelegraf.exe --debug --config test-instancedash.conf
2017-03-23T19:46:51Z D! Attempting connection to output: influxdb
2017-03-23T19:46:51Z D! Successfully connected to output: influxdb
2017-03-23T19:46:51Z I! Starting Telegraf (version dev-92-g616b66f)
2017-03-23T19:46:51Z I! Loaded outputs: influxdb
2017-03-23T19:46:51Z I! Loaded inputs: inputs.win_perf_counters
2017-03-23T19:46:51Z I! Tags enabled: host=WIN-MONTEST
2017-03-23T19:46:51Z I! Agent Config: Interval:5s, Quiet:false, Hostname:"WIN-MONTEST", Flush Interval:10s
Valid: \Process(spoolsv)\Handle Count
Valid: \Process(svchost)\Handle Count
Valid: \Process(svchost#1)\Handle Count
Valid: \Process(svchost#2)\Handle Count
win_proc1,instance=spoolsv,objectname=Process,host=WIN-MONTEST Handle_Count=368 1490298415000000000

win_proc1,objectname=Process,host=WIN-MONTEST,instance=svchost Handle_Count=458 1490298415000000000

win_proc1,instance=svchost#1,objectname=Process,host=WIN-MONTEST Handle_Count=359 1490298415000000000

win_proc1,instance=svchost#2,objectname=Process,host=WIN-MONTEST Handle_Count=613 1490298415000000000

win_proc1,instance=spoolsv,objectname=Process,host=WIN-MONTEST Handle_Count=368 1490298420000000000

win_proc1,instance=svchost,objectname=Process,host=WIN-MONTEST Handle_Count=458 1490298420000000000

win_proc1,objectname=Process,host=WIN-MONTEST,instance=svchost#1 Handle_Count=359 1490298420000000000

win_proc1,instance=svchost#2,objectname=Process,host=WIN-MONTEST Handle_Count=613 1490298420000000000

win_proc1,instance=svchost,objectname=Process,host=WIN-MONTEST Handle_Count=458 1490298425000000000

win_proc1,objectname=Process,host=WIN-MONTEST,instance=svchost#1 Handle_Count=359 1490298425000000000

win_proc1,instance=svchost#2,objectname=Process,host=WIN-MONTEST Handle_Count=615 1490298425000000000

win_proc1,instance=spoolsv,objectname=Process,host=WIN-MONTEST Handle_Count=368 1490298425000000000

2017-03-23T19:47:05Z D! Output [influxdb] buffer fullness: 12 / 1000 metrics.
2017-03-23T19:47:05Z D! Output [influxdb] wrote batch of 12 metrics in 5.0058ms

by the way the following issues may be related to this one: #2210, #1546

thanks,
Luca

@danielnelson
Copy link
Contributor

Reopening based on #2879 (comment)

@barbarajoost
Copy link

It seems that the problem still exists. I have this issue on a Server 2008 R2 with Telegraf 1.8.

Windows sees

process
process#1
process#2
process#3

but in Grafana I found only process.

@danielnelson
Copy link
Contributor

This issue should be mostly solved by enabling wildcard expansion:

[[inputs.win_perf_counters]]
  UseWildcardsExpansion = true

  [[inputs.win_perf_counters.object]]
    ObjectName = "Process"
    Instances = ["chrome*"]
    Counters = [
      "% Processor Time"
    ]

This option has a side effect of localizing the counter names, which is usually unwanted, but is discussed on a different issue. #4280

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution platform/windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants