Memory Leak with procstat #6807

mafinn · 2019-12-17T03:31:20Z

Relevant telegraf.conf:

[global_tags]
  sc = "daf"
  p = "32"
  custom_version = "1.x"
  os_type = "win2016s"

[agent]
  interval = "1s"
  round_interval = false
  metric_batch_size = 1000
  metric_buffer_limit = 5000
  collection_jitter = "1s"
  flush_interval = "1s"
  flush_jitter = "1s"
  precision = "s"
  debug = true
  quiet = false
  logfile = "/Program Files/Telegraf/telegraf.log"
  hostname = "win2016s"
  omit_hostname = false

[[outputs.influxdb]]
  urls = [ "http://1.1.1.1:8086" ]
  database = "telegraf"
  retention_policy = "24hours"
  precision = "m"

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.mem]]  

[[inputs.procstat]]
  interval = "1s"
  exe = ".*"
  pid_finder = "native"

[[inputs.internal]]

System info:

Steps to reproduce:

No special steps, memory leak appears to be related to procstat input plugin:

danielnelson · 2019-12-17T05:36:40Z

It looks like there are as many frees 16MB as allocs 16.1MB. What is the query behind procstat.mean (process_name: telegraf.exe)?

mafinn · 2019-12-17T13:57:05Z

The drops in this graph are due to restarting telegraf for troubleshooting:

Here I'm just trying to show how much memory is accumulated over time:

mafinn · 2019-12-17T14:00:45Z

Eventually, telegraf will consume enough memory that it will be stopped:

mafinn · 2019-12-17T23:56:09Z

It's interesting to see the change in slope of memory growth before and after the restart at ~10am. I modified the config from exe = ".*" to 7 instances of procstat, each calling a unique process name. This also decreased the gather time from 17 seconds to 3 seconds.

Before/After:

Before:

After:

danielnelson · 2019-12-18T00:37:44Z

I've been running this configuration for about 4 hours and I'm not seeing this pattern, however I don't have any real load on the system.

Could you add the --pprof-addr=:6060 option when starting Telegraf and after the process RSS doubles from startup, go to http://localhost:6060/debug/pprof/heap and attach this file?

mafinn · 2019-12-18T02:01:37Z

@danielnelson can do. Can I send it directly to you?

danielnelson · 2019-12-18T02:20:49Z

Yes, email address is on my profile page.

danielnelson · 2019-12-18T19:29:52Z

Can you show a couple hours of internal_memstats sys_bytes?

mafinn · 2019-12-18T19:56:52Z

Mostly fixed at 19.81 MB

24 hours:

3 hours:

danielnelson · 2019-12-18T23:16:09Z

The RSS is rising but Go doesn't seem to know about it, I didn't see any interesting objects in the memory profiles either. Just thinking aloud but could it be lost to Go or maybe it is leaked in a dll call, I'm not sure. I'm unable to replicate as well even when spawning new processes on the system.

Do you know if this is a new issue with Telegraf 1.13? Could you compare against 1.12.6 and 1.11.5?

mafinn · 2019-12-19T22:03:46Z

mafinn · 2019-12-19T22:24:19Z

danielnelson · 2019-12-19T22:50:01Z

For the record, here are my numbers with Telegraf 1.13.0 (Windows 7):

At first it trends up, but levels off around 55MB RSS. It seems to take about 3 hours before it maxes.

mafinn · 2019-12-19T23:07:42Z

So far, looking across our different instances of windows, I've only seen this occur on windows server 2016.

danielnelson · 2019-12-19T23:28:12Z

Related? go-ole/go-ole#135

mafinn · 2019-12-19T23:29:28Z

2016:

2012:

danielnelson · 2019-12-19T23:43:40Z

What should I be looking at here?

mafinn · 2019-12-19T23:48:17Z

Just trying to find a way to capture the difference in DLL's in use between 2016 and not-2016.

danielnelson · 2019-12-19T23:51:49Z

I'm going to try installing WMF 5.1 to see if it causes the error as suggested in go-ole/go-ole#135 (comment).

mafinn · 2019-12-20T22:20:40Z

Quick update, I installed WMF 5.1 on a Windows 2012 (non-R2) box, and it didn't cause the memory leak.

danielnelson · 2019-12-23T18:11:46Z

Same on my Windows 7 system, WMF 5.1 had no effect.

danielnelson · 2020-01-09T06:16:56Z

Still not reproducing the leak with a Windows 10 Pro VM:

danielnelson · 2020-01-13T23:13:19Z

I have been able to reproduce this on a Windows 2016 VM running in Azure. Will update if I can find a way to reduce or eliminate the leaked memory.

romanblachman · 2020-02-24T10:32:54Z

This still continues to be an issue with the latest Windows Server 2016 - WMI based metrics cause memory leak. We have been using WMI queries and are able to bypass this by calling CoInitializeEx only once per thread, but it seems that telegraf leaks based on the number of WMI metrics, probably since CoInitializeEx is called for every query? Has anyone submitted a bug to Microsoft about this?

In the attached screenshot, the win_proc for telegraf (Working Set Private metric) increases to 100MB with-in 7 days.

romanblachman · 2020-03-18T09:27:32Z

@danielnelson how does win_proc access WMI metrics? We have fixed the issue by calling CoInitializeEx once per thread to avoid the leak from happening when using WMI metrics in our code, instead of calling it once per WMI query.

romanblachman · 2020-04-10T20:57:44Z

@danielnelson Looks like datadog had the same issue with their WMI sampler DataDog/integrations-core#3987 which tells clearly that this issue is with Windows 2016 memory leak when calling CoInitalize for each WMI query.

After reviewing the code for telegraf it seems like you rely on win_pdh library that does the actual Win32 calls, and I couldn't find the call to the CoInitalize so I'm not sure how to help.

Jaeyo · 2020-05-14T07:31:06Z

any update? @danielnelson

danielnelson · 2020-06-06T00:45:31Z

Can you retest with this build telegraf-1.15.0~d78dfac1_windows_amd64.zip?

romanblachman · 2020-06-06T08:55:13Z

Will do!

romanblachman · 2020-06-07T14:24:23Z

Looks like this has been resolved, the WMI leak is gone on Windows Server 2016!

48 hours running with 1.12.6:

24 hours running with 1.15.0:

thank you @danielnelson

what's the eta on 1.15.0?

danielnelson · 2020-06-08T21:46:01Z

Great news, thanks for testing.

I expect 1.15.0 to be released sometime in the first half of July.

danielnelson added area/procstat bug unexpected problem or unintended behavior need more info labels Dec 17, 2019

danielnelson added ready and removed need more info labels Mar 19, 2020

danielnelson added this to the planned milestone Mar 27, 2020

romanblachman mentioned this issue May 22, 2020

Windows Server 2016 WMI performance counters memory leak lxn/win#105

Open

danielnelson mentioned this issue Jun 6, 2020

Update to github.com/shirou/gopsutil v2.20.5 #7641

Merged

3 tasks

danielnelson closed this as completed Jun 8, 2020

danielnelson modified the milestones: Planned, 1.15.0 Jun 8, 2020

haojhcwa mentioned this issue Aug 30, 2020

Update to github.com/shirou/gopsutil v2.20.5 (#7641) aws/telegraf#2

Merged

This was referenced Jan 6, 2022

Possible memory leak in coinitService for CallMethod StackExchange/wmi#61

Open

Memory Leak StackExchange/wmi#55

Open

High memory usage on .16.0 prometheus-community/windows_exporter#813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak with procstat #6807

Memory Leak with procstat #6807

mafinn commented Dec 17, 2019 •

edited

Loading

danielnelson commented Dec 17, 2019

mafinn commented Dec 17, 2019 •

edited

Loading

mafinn commented Dec 17, 2019

mafinn commented Dec 17, 2019 •

edited

Loading

danielnelson commented Dec 18, 2019

mafinn commented Dec 18, 2019

danielnelson commented Dec 18, 2019

danielnelson commented Dec 18, 2019

mafinn commented Dec 18, 2019

danielnelson commented Dec 18, 2019

mafinn commented Dec 19, 2019

mafinn commented Dec 19, 2019

danielnelson commented Dec 19, 2019

mafinn commented Dec 19, 2019 •

edited

Loading

danielnelson commented Dec 19, 2019

mafinn commented Dec 19, 2019

danielnelson commented Dec 19, 2019

mafinn commented Dec 19, 2019

danielnelson commented Dec 19, 2019

mafinn commented Dec 20, 2019

danielnelson commented Dec 23, 2019

danielnelson commented Jan 9, 2020

danielnelson commented Jan 13, 2020

romanblachman commented Feb 24, 2020 •

edited

Loading

romanblachman commented Mar 18, 2020

romanblachman commented Apr 10, 2020

Jaeyo commented May 14, 2020 •

edited

Loading

danielnelson commented Jun 6, 2020

romanblachman commented Jun 6, 2020

romanblachman commented Jun 7, 2020

danielnelson commented Jun 8, 2020

Memory Leak with procstat #6807

Memory Leak with procstat #6807

Comments

mafinn commented Dec 17, 2019 • edited Loading

Relevant telegraf.conf:

System info:

Steps to reproduce:

danielnelson commented Dec 17, 2019

mafinn commented Dec 17, 2019 • edited Loading

mafinn commented Dec 17, 2019

mafinn commented Dec 17, 2019 • edited Loading

danielnelson commented Dec 18, 2019

mafinn commented Dec 18, 2019

danielnelson commented Dec 18, 2019

danielnelson commented Dec 18, 2019

mafinn commented Dec 18, 2019

danielnelson commented Dec 18, 2019

mafinn commented Dec 19, 2019

mafinn commented Dec 19, 2019

danielnelson commented Dec 19, 2019

mafinn commented Dec 19, 2019 • edited Loading

danielnelson commented Dec 19, 2019

mafinn commented Dec 19, 2019

danielnelson commented Dec 19, 2019

mafinn commented Dec 19, 2019

danielnelson commented Dec 19, 2019

mafinn commented Dec 20, 2019

danielnelson commented Dec 23, 2019

danielnelson commented Jan 9, 2020

danielnelson commented Jan 13, 2020

romanblachman commented Feb 24, 2020 • edited Loading

romanblachman commented Mar 18, 2020

romanblachman commented Apr 10, 2020

Jaeyo commented May 14, 2020 • edited Loading

danielnelson commented Jun 6, 2020

romanblachman commented Jun 6, 2020

romanblachman commented Jun 7, 2020

danielnelson commented Jun 8, 2020

mafinn commented Dec 17, 2019 •

edited

Loading

mafinn commented Dec 17, 2019 •

edited

Loading

mafinn commented Dec 17, 2019 •

edited

Loading

mafinn commented Dec 19, 2019 •

edited

Loading

romanblachman commented Feb 24, 2020 •

edited

Loading

Jaeyo commented May 14, 2020 •

edited

Loading