Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak with procstat #6807

Closed
mafinn opened this issue Dec 17, 2019 · 31 comments · Fixed by #7641
Closed

Memory Leak with procstat #6807

mafinn opened this issue Dec 17, 2019 · 31 comments · Fixed by #7641
Labels
area/procstat bug unexpected problem or unintended behavior
Milestone

Comments

@mafinn
Copy link

mafinn commented Dec 17, 2019

Relevant telegraf.conf:

[global_tags]
  sc = "daf"
  p = "32"
  custom_version = "1.x"
  os_type = "win2016s"

[agent]
  interval = "1s"
  round_interval = false
  metric_batch_size = 1000
  metric_buffer_limit = 5000
  collection_jitter = "1s"
  flush_interval = "1s"
  flush_jitter = "1s"
  precision = "s"
  debug = true
  quiet = false
  logfile = "/Program Files/Telegraf/telegraf.log"
  hostname = "win2016s"
  omit_hostname = false

[[outputs.influxdb]]
  urls = [ "http://1.1.1.1:8086" ]
  database = "telegraf"
  retention_policy = "24hours"
  precision = "m"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.mem]]  

[[inputs.procstat]]
  interval = "1s"
  exe = ".*"
  pid_finder = "native"

[[inputs.internal]]

System info:

image

image

Steps to reproduce:

No special steps, memory leak appears to be related to procstat input plugin:
image

image

@danielnelson
Copy link
Contributor

It looks like there are as many frees 16MB as allocs 16.1MB. What is the query behind procstat.mean (process_name: telegraf.exe)?

@danielnelson danielnelson added area/procstat bug unexpected problem or unintended behavior need more info labels Dec 17, 2019
@mafinn
Copy link
Author

mafinn commented Dec 17, 2019

The drops in this graph are due to restarting telegraf for troubleshooting:
image

Here I'm just trying to show how much memory is accumulated over time:
image

@mafinn
Copy link
Author

mafinn commented Dec 17, 2019

Eventually, telegraf will consume enough memory that it will be stopped:
image

@mafinn
Copy link
Author

mafinn commented Dec 17, 2019

It's interesting to see the change in slope of memory growth before and after the restart at ~10am. I modified the config from exe = ".*" to 7 instances of procstat, each calling a unique process name. This also decreased the gather time from 17 seconds to 3 seconds.

image

Before/After:
image

Before:
image

After:
image

@danielnelson
Copy link
Contributor

I've been running this configuration for about 4 hours and I'm not seeing this pattern, however I don't have any real load on the system.

Could you add the --pprof-addr=:6060 option when starting Telegraf and after the process RSS doubles from startup, go to http://localhost:6060/debug/pprof/heap and attach this file?

@mafinn
Copy link
Author

mafinn commented Dec 18, 2019

@danielnelson can do. Can I send it directly to you?

@danielnelson
Copy link
Contributor

Yes, email address is on my profile page.

@danielnelson
Copy link
Contributor

Can you show a couple hours of internal_memstats sys_bytes?

@mafinn
Copy link
Author

mafinn commented Dec 18, 2019

Mostly fixed at 19.81 MB

24 hours:
image

3 hours:
image

@danielnelson
Copy link
Contributor

The RSS is rising but Go doesn't seem to know about it, I didn't see any interesting objects in the memory profiles either. Just thinking aloud but could it be lost to Go or maybe it is leaked in a dll call, I'm not sure. I'm unable to replicate as well even when spawning new processes on the system.

Do you know if this is a new issue with Telegraf 1.13? Could you compare against 1.12.6 and 1.11.5?

@mafinn
Copy link
Author

mafinn commented Dec 19, 2019

image

@mafinn
Copy link
Author

mafinn commented Dec 19, 2019

image

@danielnelson
Copy link
Contributor

For the record, here are my numbers with Telegraf 1.13.0 (Windows 7):

2019-12-19-144203_1370x918_scrot

At first it trends up, but levels off around 55MB RSS. It seems to take about 3 hours before it maxes.

@mafinn
Copy link
Author

mafinn commented Dec 19, 2019

So far, looking across our different instances of windows, I've only seen this occur on windows server 2016.

@danielnelson
Copy link
Contributor

Related? go-ole/go-ole#135

@mafinn
Copy link
Author

mafinn commented Dec 19, 2019

2016:
image

2012:
image

@danielnelson
Copy link
Contributor

What should I be looking at here?

@mafinn
Copy link
Author

mafinn commented Dec 19, 2019

Just trying to find a way to capture the difference in DLL's in use between 2016 and not-2016.

@danielnelson
Copy link
Contributor

I'm going to try installing WMF 5.1 to see if it causes the error as suggested in go-ole/go-ole#135 (comment).

@mafinn
Copy link
Author

mafinn commented Dec 20, 2019

Quick update, I installed WMF 5.1 on a Windows 2012 (non-R2) box, and it didn't cause the memory leak.

@danielnelson
Copy link
Contributor

Same on my Windows 7 system, WMF 5.1 had no effect.

@danielnelson
Copy link
Contributor

Still not reproducing the leak with a Windows 10 Pro VM:

2020-01-08-220402_1332x874_scrot

@danielnelson
Copy link
Contributor

I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.

@romanblachman
Copy link

romanblachman commented Feb 24, 2020

This still continues to be an issue with the latest Windows Server 2016 - WMI based metrics cause memory leak. We have been using WMI queries and are able to bypass this by calling CoInitializeEx only once per thread, but it seems that telegraf leaks based on the number of WMI metrics, probably since CoInitializeEx is called for every query? Has anyone submitted a bug to Microsoft about this?

image
In the attached screenshot, the win_proc for telegraf (Working Set Private metric) increases to 100MB with-in 7 days.

@romanblachman
Copy link

@danielnelson how does win_proc access WMI metrics? We have fixed the issue by calling CoInitializeEx once per thread to avoid the leak from happening when using WMI metrics in our code, instead of calling it once per WMI query.

@danielnelson danielnelson added this to the planned milestone Mar 27, 2020
@romanblachman
Copy link

@danielnelson Looks like datadog had the same issue with their WMI sampler DataDog/integrations-core#3987 which tells clearly that this issue is with Windows 2016 memory leak when calling CoInitalize for each WMI query.

After reviewing the code for telegraf it seems like you rely on win_pdh library that does the actual Win32 calls, and I couldn't find the call to the CoInitalize so I'm not sure how to help.

@Jaeyo
Copy link
Contributor

Jaeyo commented May 14, 2020

any update? @danielnelson

@danielnelson
Copy link
Contributor

Can you retest with this build telegraf-1.15.0~d78dfac1_windows_amd64.zip?

@romanblachman
Copy link

Will do!

@romanblachman
Copy link

Looks like this has been resolved, the WMI leak is gone on Windows Server 2016!

48 hours running with 1.12.6:

image

24 hours running with 1.15.0:

image

thank you @danielnelson

what's the eta on 1.15.0?

@danielnelson
Copy link
Contributor

Great news, thanks for testing.

I expect 1.15.0 to be released sometime in the first half of July.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/procstat bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants