High memory usage on .16.0 #813

jesus-velez · 2021-07-01T20:09:04Z

Hello!

I have a couple of servers reporting high memory usage by the windows_exporter. I am using the default configuration/collectors. Is there anyway to limit the resource usage on the exporter?

Exporter version:
windows_exporter, version 0.16.0 (branch: master, revision: f316d81d50738eb0410b0748c5dcdc6874afe95a) build user: appveyor-vm\appveyor@appveyor-vm build date: 20210225-10:52:19 go version: go1.15.6 platform: windows/amd64

OS Name: Microsoft Windows Server 2016 Standard
OS Version: 10.0.14393 N/A Build 14393

Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName

1260 142 12748860 12496424 110,102.69 11388 0 windows_exporter

The text was updated successfully, but these errors were encountered:

breed808 · 2021-07-02T02:58:33Z

How long has the exporter been running on the affected hosts? When initially started, how much memory does the exporter use?

Graphing go_memstats_alloc_bytes, go_memstats_alloc_bytes_total and go_memstats_frees_total for this host should provide more info.

I'd also recommend navigating to http://hostname:9182/debug/pprof/heap and uploading the output here.

jesus-velez · 2021-07-02T15:15:29Z

Thanks for following up with me. The box has been up for 49 days so at least that much. I there any metric that records that information?

go_memstats_alloc_bytes

go_memstats_alloc_bytes_total

go_memstats_frees_total

heap.zip

datamuc · 2021-07-07T13:03:05Z

I have a similar thing, but we are on 0.15:
The go_memstat* things look pretty normal to me:

go_memstats.*bytes are very constant during the week:

go_memstats_alloc_bytes and frees_total:

What goes through the roof is windows_process_private_bytes:

And for what it's worth, pprof/heap and pprof/allocs:
pprof.zip

jesus-velez · 2021-07-07T13:41:02Z

I have both versions deployed and .15 also displays the same behavior. Whatever I can do to help please do let me know. @breed808.

breed808 · 2021-07-08T06:59:53Z

Thanks all, this is really helpful info. I'll get through the heap/alloc dumps, and will try to reproduce the issue.

datamuc · 2021-07-09T10:35:06Z

It doesn't happen on every machine, but if it happens it seems to be the service collector that is leaking memory.

In general the windows_exporter is leaking memory anyways, it's just that it may leak with the service collector a huge amount of memory.

jesus-velez · 2021-07-12T14:21:46Z

@breed808, I don't know what @datamuc experience is but it seems like mostly servers with sql server installed get this issue. I have the exporter on application servers and database servers and I have yet to notice anything on the application servers. Another point to add is that I have seeing it on both VM and physical servers.

breed808 · 2021-07-13T10:58:39Z

Unfortunately the heap dumps aren't too helpful: they're showing similar results to the go_memstats_alloc_bytes_total metric. @datamuc I wasn't able to view the dumps you provided, could you re-generate and upload them? Yours are ASCII text files, while the dump @jesus-velez submitted was gzip compressed data.

Are we able to confirm if the excessive memory consumption occurs when disabling all collectors that use WMI as a metric source? These are currently cpu_info, fsrmquota, msmq, mssql, remote_fx, service, system, terminal_services, and thermalzone

datamuc · 2021-07-14T12:16:42Z

Unfortunately the heap dumps aren't too helpful: they're showing similar results to the go_memstats_alloc_bytes_total metric. @datamuc I wasn't able to view the dumps you provided, could you re-generate and upload them? Yours are ASCII text files, while the dump @jesus-velez submitted was gzip compressed data.

Ok, I've accessed the urls with a browser that gave me the text files. Now I used Invoke-Webrequest and I got the binary stuff:

pprof.zip

I did some more tests, in any case there was an infinite powershell loop running that hit the /metrics endpoint.

In the zip there are 2 directories, host2 are the debug/pprof infos of a host that doesn't leak so much. I've done 2 snapshots in every test.

host1 is a host that has a pretty high leak, pprof1 and pprof2 where done with the following config:

collectors:
    enabled: "cpu,cs,logical_disk,net,os,system,textfile,memory,logon,tcp,terminal_services,thermalzone,iis,process"

and pprof3 and pprof4 where done with:

collectors:
    enabled: "cs,logical_disk,net,os,textfile,memory,logon,tcp,iis,process"

Are we able to confirm if the excessive memory consumption occurs when disabling all collectors that use WMI as a metric source? These are currently cpu_info, fsrmquota, msmq, mssql, remote_fx, service, system, terminal_services, and thermalzone

I've disabled all the mentioned collectors (pprof3/4) but it was still leaking.

windows_process_private_bytes while the endless loop was running:

And here without the loop, just normal prometheus scraping:

breed808 · 2021-07-15T11:41:37Z

Hmm, those heap dumps are similar to those submitted earlier, with only a few MB allocated 😞

@datamuc from previous comments, it appears the service collector may be responsible, but host1 in your previous comment doesn't have it enabled?
It would be worth identifying the collectors responsible for the leaking. Are you free to do that?

I'll see if I can identify the commit or exporter version where the leaking was introduced.

datamuc · 2021-07-15T11:59:54Z

Hmm, those heap dumps are similar to those submitted earlier, with only a few MB allocated 😞

@datamuc from previous comments, it appears the service collector may be responsible, but host1 in your previous comment doesn't have it enabled?
It would be worth identifying the collectors responsible for the leaking. Are you free to do that?

I'll see if I can identify the commit or exporter version where the leaking was introduced.

Yes, we have deactivated the service collector globally because it contributed a lot to the leaks. It is better now, but still not good, so my guess is that it has something to do with the number of metrics returned? service collector has a lot of metrics.

I can do some more testing tomorrow I guess.

datamuc · 2021-07-15T15:29:03Z

The config:

log:
  level: fatal
collectors:
  enabled: "process"
collector:
  process:
    whitelist: "windows_exporter|exporter_exporter"
    blacklist: ""

And I did run this powershell script for an hour or so:

while($true) {
   $request = Invoke-Webrequest -URI http://127.77.1.1:5758/metrics
   $line = $request.RawContent -split "[`r`n]" | select-string -Pattern 'windows_process_private_bytes.*windows_exporter'
   ($line -split " ")[-1]
}

That resulted in the following graph of windows_process_private_bytes:

So, if it is a collector then process collector is one of them.

breed808 · 2021-07-17T11:14:12Z

I've been able to reproduce this with the script. The memory leak is present in the process collector prior to 4f89133, so I'm inclined to believe it may be an exporter-wide leak that isn't specific to one collector.

Alternatively, it may be a leak on the Windows side when certain WMI and/or Perflib queries are submitted. The Go memory stats aren't showing any leaks, which isn't too helpful.

I'll continue testing to see if I can identify the commit that introduced the leak.

datamuc · 2021-07-20T08:05:52Z

I've did some more testing. If I only enable the textfile exporter, then there is no leak.

I tried to investigate a bit. I've seen that the windows_exporter uses a library from the perflib_exporter to access most of the performance_values. Maybe both implementations, the perflib and the old WMI one are leaking?

What follows maybe wrong, I've tried to reason about the code:

I've looked into what telegraf is doing, because it does a similar thing. They open pdh.dll and call functions from this library. Before they open a new query they close the old one?

The Close call leads to this code where they explicitly mention to free some memory.

Then I looked into the perflib_exporter, it comes with a perlfib package which provides access to the performance counters. Unlike telegraf they are not using pdh.dll because it is too high level?

Anyway, it does syscall.HKEY_PERFORMANCE_DATA to get the counters. So I googled a bit for it and found this page:

To obtain performance data from a remote system, call the RegConnectRegistry function. Use the computer name of the remote system and use HKEY_PERFORMANCE_DATA as the key. This call retrieves a key representing the performance data for the remote system. Use this key rather than HKEY_PERFORMANCE_DATA key to retrieve the data.

Be sure to use the RegCloseKey function to close the handle to the key when you are finished obtaining the performance data. This is important for both the local and remote cases:

RegCloseKey(HKEY_PERFORMANCE_DATA) does not actually close a registry handle, but it clears all cached data and frees the loaded performance DLLs.

RegCloseKey(hkeyRemotePerformanceData) closes the handle to the remote machine's registry.

But I cannot find the word "close" or "Close" anywhere in perflib_exporters codebase. So I thought, hooray, found the culprit and started a perflib_exporter which doesn't seem to leak at all. Long story short, I still have no clue what is going on. I think it is still possible that the perflib_exporter uses the perflib library a bit differently? Like reusing some datastructures while the windows_exporter is always asking for a new one or so?

datamuc · 2021-09-13T13:22:14Z

Does anybody have an idea? We are considering to move to telegraf and scrape that for the windows metrics.

jesus-velez · 2021-09-13T13:31:29Z

My priorities have shifted for the time being, but I have to come back to this issue pretty soon. I resorted to schedule a daily exporter bounce in the mean time. I can say that I found this issue has been around since the wmi exporter days. I recently logged into a server that we somehow missed on our exporter upgrades and the memory usage was really high.

carlpett · 2021-09-18T19:59:21Z

@datamuc That looks like a very good find, and even if the perflib exporter appears to not leak, I'd argue it it still incorrect not to close the key. It might require a bit of locking to do this right though, otherwise I think overlapping scrapes might lead to issues.
I'm not sure when I could commit to trying this out, but if anyone else wants to give it a go without the locking bits, I think we'd basically just need to add defer syscall.RegCloseKey(syscall.HKEY_PERFORMANCE_DATA) here.

datamuc · 2021-10-08T09:30:14Z

@carlpett I've added the mentioned line and compiled a binary. The change didn't help anything regarding the leak. 😞

datamuc · 2021-10-14T19:42:58Z

What do you think? Should we try to bring this issue up in the Prometheus & The Ecosystem Community Meeting? Maybe sombody there can help or knows somebody who can help?

datamuc · 2021-10-20T20:40:17Z

Are we able to confirm if the excessive memory consumption occurs when disabling all collectors that use WMI as a metric source? These are currently cpu_info, fsrmquota, msmq, mssql, remote_fx, service, system, terminal_services, and thermalzone

This is not true, other collectors also use wmi. I've started the exporter with

windows_exporter.exe --collectors.enabled=memory,os,net,time,cpu,cache

And this doesn't leak at all. So I'm pretty sure now that it leaks somewhere in github.com/StackExchange/wmi or github.com/go-ole/go-ole. The first one is definitly unmaintained, not so sure about the second one.

breed808 · 2021-10-27T09:33:20Z

That's a good find! I'll see if I can identify which of those libraries are the cause of the leak.

this should save a leak in windows_exporter: https://docs.microsoft.com/en-us/windows/win32/perfctrs/using-the-registry-functions-to-consume-counter-data > Be sure to use the RegCloseKey function to close the handle to the key > when you are finished obtaining the performance data. This is > important for both the local and remote cases: prometheus-community/windows_exporter#813 (comment)

geraudster · 2021-11-18T16:46:27Z

Hi, same problem here.
So I tried to build a new package with the commit of @datamuc on perflib and performed some tests (100 req/s on /metrics during 30 minutes) to compare both versions, the configured collectors are cpu,cs,logical_disk,net,os,system,textfile,process.

The patched version (the curve on the right) seems to do better than v0.16.0 (the one on the left) in terms of memory usage:

But a leak is still here. Do you have any other idea to fix this ?

Edit: the test was performed on master branch, so only the process collector was using StackExchange/wmi, which explains lower memory usage. After removing process collector, memory usage is constant around 20-30MB.

datamuc · 2021-11-19T09:18:15Z

Hi, same problem here. So I tried to build a new package with the commit of @datamuc on perflib and performed some tests (100 req/s on /metrics during 30 minutes) to compare both versions, the configured collectors are cpu,cs,logical_disk,net,os,system,textfile,process.

But a leak is still here. Do you have any other idea to fix this ?

We removed the process and service collector from our configuration (and added the tcp collector, so it is: [defaults] - service + tcp) now the memory usage is stable. It seems that every collector that makes use of github.com/StackExchange/wmi leaks.

this should save a leak in windows_exporter: https://docs.microsoft.com/en-us/windows/win32/perfctrs/using-the-registry-functions-to-consume-counter-data > Be sure to use the RegCloseKey function to close the handle to the key > when you are finished obtaining the performance data. This is > important for both the local and remote cases: prometheus-community/windows_exporter#813 (comment)

datamuc · 2021-12-05T09:06:28Z

The patch above was merged into the perflib_exporter library with
leoluk/perflib_exporter#34.

Can sombody take care of updating the dependency in windows_exporter please?

jesus-velez · 2021-12-10T16:10:52Z

@datamuc Thanks for all the work that your are putting in on this.

jesus-velez · 2022-01-13T19:47:51Z

@datamuc

I found that telegraf was having a similar issue on Windows Server 2016. influxdata/telegraf#6807 (comment)

I don't know enough to be able to diagnose this myself, but looking at github.com/StackExchange/wmi, they are indeed using CoInitializeEx. What are you thoughts about this?

datamuc · 2022-01-13T22:24:12Z

Sounds promising to look deeper, but to be honest I have no idea. I was just lucky finding the leak in perflib_exporter, I have no idea of windows related programming at all, and just know a little bit go...

audriuz · 2022-01-20T16:19:06Z

Same issue keeps happening in our environment since implementation as well, latest windows_exporter version didn't solve it. Was not able to identify which exact software, feature or service may impact this happening on some random servers, but part of domain controllers impacted for sure (Windows Server 2016). Those servers are impacted heavily, check attachment.

It looks scary, but those gigabytes of memory mostly are taken not from physical RAM, but from virtual memory. As mentioned above in this conversation, seems that removing "process" and "service" collectors solves the issue or at least improving situation significantly. Even removing "service" collector alone, situation improves a lot, please check another attachment.

.

YoByMe · 2022-05-24T13:09:13Z

same problem here ... I have 2 servers with this problem.

any news ?

breed808 · 2022-05-29T01:03:15Z

It's been a while since I've last looked at this, but I think disabling the WMI queries in the process collector by default may improve the situation.

Would anyone be able to test the branch in #998 to see if the process collector continues to experience memory leaks?

exe-r · 2022-09-09T08:35:48Z

Experiencing similar issue with leaks on latest 0.19.0, windows server 2016. Any permanent fix yet or should we disable process collector? @breed808 I think in my case #998 does not solve the issue cause i have two servers now, 1 which does not have leaks and one that does, collector.process.iis is enabled on both.

Have the following enabled on both

collectors:
  enabled: cpu,cs,logical_disk,net,os,service,system,memory,tcp,vmware,process,iis

The only difference is collector.process.whitelist, the affected server does have more processes included.

Non affected

    whitelist: "xService.?|windows_exporter.?"

Affected

whitelist: "xService.?|windows_exporter.?|app.+|antivirus.+|w3wp|Scheduler|xConnector|Ccm.+|xClient|inetinfo|.+agent|.+Agent"

I will try to remove few processes and see if it improves

Image of the leak

breed808 · 2022-09-13T10:11:01Z

It's been some time since I last looked at this, but I believe my intention with #998 was to remove a potential leak in the process collector.
I suspect we have multiple leaks in each of the collectors. A fix for one won't resolve them all.

I've reopened #998 in #1062 if anyone would like to test the branch.

exe-r · 2022-09-14T09:37:00Z

It's been some time since I last looked at this, but I believe my intention with #998 was to remove a potential leak in the process collector. I suspect we have multiple leaks in each of the collectors. A fix for one won't resolve them all.

I've reopened #998 in #1062 if anyone would like to test the branch.

@breed808 Made a few tests and the most stable config was disabling scheduler_task collector completely. Its been 1 day since its running ok, will report back in case it changes. We only monitor 2 scheduled tasks so not sure why such aggressive leak is happening.

# working config
collectors:
  enabled: cpu,cs,logical_disk,net,os,service,system,memory,tcp,vmware,process,iis,netframework_clrexceptions
collector:
  process:
    whitelist: "xService.?|windows_exporter.?|app.+|antivirus.+|w3wp|Scheduler|xConnector|Ccm.+|xClient|inetinfo|.+agent|.+Agent"
  service:
    services-where: Name LIKE 'appname%'

exe-r · 2022-09-28T05:57:53Z

Coming back to the test @breed808. After 2 weeks having 6 servers running the above config I can confirm we stable at around 40-50mb~ of memory usage with no leaks. As soon as we enable scheduled_task collector the leak is very aggressive.

7 days usage example

audriuz · 2022-09-28T12:26:58Z

We're not using "scheduled_task" collector, but main problem with memory increase is "service" collector here. Only part of the servers are affected. Tried versions 0.18.1 and 0.19.0, same results. After removing "service" collector, huge difference, check the screenshot:

breed808 · 2022-10-24T22:53:34Z

scheduled_task memory leak has been addressed in #1080. I'll try to investigate the service collector when there's time.

breed808 · 2022-10-25T05:24:31Z

I've not been able to reproduce the service collector memory leak. Graph below shows resident memory with the service collector enabled with curl hitting the metrics endpoint constantly:

kryztoval · 2022-11-10T15:05:39Z

@breed808 I can't replicate it reliably but I am able to replicate it. I am seeing a bit of a trend where one of the instances is gettingmore and more memory as time passes, this started exactly at the problem when I added a rasberry pi as a prometheus server and targetted the windows_exporter with it to collect data every second. I am also hitting the same windows_exporter every second with a local instance of prometheus and a netdata collector on another raspberry pi. Having a single instance did not cause the memory leak but having both made it start happening. My guess would be there is something that is leaving the connection open or it is forcefully closed without letting the server clean up.

So for me to reproduce it I had to hit the windows_exporter more than 3 times per second at the very least.

linuxlzj · 2023-07-26T06:39:36Z

in windows 2016， i find use wmi query exist mem leak also

github-actions · 2023-11-25T02:08:55Z

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

kryztoval · 2023-11-25T02:15:41Z

This is not stale, the proccess leaks memory, specially and faster if it is being polled more than 3 times per second for whatever reason.

datamuc · 2023-11-29T10:01:18Z

Just an idea:

An option that might relieve the situation might be that the exporter has an option like --max-memory-consumption and monitors itself, and if it reaches that limit it just restarts itself or something like that.

jkroepke · 2023-11-30T21:04:48Z

Can someone provide some details?

Version of windows_exporter

Output of

as file attachment.

If possible, generate a trace:

curl -o trace.out http://localhost:9182/debug/pprof/trace?seconds=30

datamuc · 2023-12-01T07:05:05Z

We already had this here: #813 (comment)

But it doesn't help, because the memory is lost when interacting with Windows API somewhere. We found the leak in perflib. But there is at least another one in WMI. I can tell because the leakage stops if I disable all the collectors that make use of WMI:

#813 (comment)

jkroepke · 2023-12-01T08:36:57Z

@datamuc I saw there are pprof exists already exists.

However they coming from 0.16 and I would like to start at least form the lastest releases. In mean time, we included perflib libraries into windows_exporter and not longer depends on external contributors (https://github.com/prometheus-community/windows_exporter/tree/master/pkg/perflib)

datamuc · 2023-12-01T09:14:17Z

In mean time, we included perflib libraries into windows_exporter and not longer depends on external contributors (https://github.com/prometheus-community/windows_exporter/tree/master/pkg/perflib)

Interesting, I will try to come up with something next week.

datamuc · 2023-12-05T07:41:34Z

In mean time, we included perflib libraries into windows_exporter and not longer depends on external contributors (https://github.com/prometheus-community/windows_exporter/tree/master/pkg/perflib)

Interesting, I will try to come up with something next week.

Sorry, I don't have access to many windows machines anymore. The two that are left don't leak. So, if somebody still has leaks they have to do #813 (comment) themself.

github-actions · 2024-03-05T02:03:10Z

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

breed808 added the info needed label Jul 2, 2021

datamuc mentioned this issue Dec 4, 2021

save memory by calling RegCloseKey leoluk/perflib_exporter#34

Merged

SouenMazouin mentioned this issue Dec 16, 2021

windows_iis_worker_request_errors_total doesn't work anymore??? #835

Closed

This was referenced Sep 17, 2022

Windows_exporter v0.19.0 memory leak #1063

Closed

windows_exporter service failed to start on reboot #551

Closed

breed808 added the collector/service label Oct 25, 2022

github-actions bot added the Stale label Nov 25, 2023

github-actions bot removed the Stale label Nov 25, 2023

github-actions bot added the Stale label Mar 5, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2024

High memory usage on .16.0 #813

High memory usage on .16.0 #813

Comments

jesus-velez commented Jul 1, 2021 • edited Loading

breed808 commented Jul 2, 2021

jesus-velez commented Jul 2, 2021

datamuc commented Jul 7, 2021 • edited Loading

jesus-velez commented Jul 7, 2021

breed808 commented Jul 8, 2021

datamuc commented Jul 9, 2021

jesus-velez commented Jul 12, 2021

breed808 commented Jul 13, 2021

datamuc commented Jul 14, 2021

breed808 commented Jul 15, 2021

datamuc commented Jul 15, 2021

datamuc commented Jul 15, 2021

breed808 commented Jul 17, 2021

datamuc commented Jul 20, 2021 • edited Loading

datamuc commented Sep 13, 2021

jesus-velez commented Sep 13, 2021

carlpett commented Sep 18, 2021

datamuc commented Oct 8, 2021

datamuc commented Oct 14, 2021

datamuc commented Oct 20, 2021

breed808 commented Oct 27, 2021

geraudster commented Nov 18, 2021 • edited Loading

datamuc commented Nov 19, 2021 • edited Loading

datamuc commented Dec 5, 2021

jesus-velez commented Dec 10, 2021

jesus-velez commented Jan 13, 2022

datamuc commented Jan 13, 2022

audriuz commented Jan 20, 2022 • edited Loading

YoByMe commented May 24, 2022

breed808 commented May 29, 2022

exe-r commented Sep 9, 2022

breed808 commented Sep 13, 2022

exe-r commented Sep 14, 2022

exe-r commented Sep 28, 2022 • edited Loading

audriuz commented Sep 28, 2022 • edited Loading

breed808 commented Oct 24, 2022

breed808 commented Oct 25, 2022

kryztoval commented Nov 10, 2022

linuxlzj commented Jul 26, 2023

github-actions bot commented Nov 25, 2023

kryztoval commented Nov 25, 2023

datamuc commented Nov 29, 2023

jkroepke commented Nov 30, 2023 • edited Loading

datamuc commented Dec 1, 2023

jkroepke commented Dec 1, 2023

datamuc commented Dec 1, 2023

datamuc commented Dec 5, 2023

github-actions bot commented Mar 5, 2024

jesus-velez commented Jul 1, 2021 •

edited

Loading

datamuc commented Jul 7, 2021 •

edited

Loading

datamuc commented Jul 20, 2021 •

edited

Loading

geraudster commented Nov 18, 2021 •

edited

Loading

datamuc commented Nov 19, 2021 •

edited

Loading

audriuz commented Jan 20, 2022 •

edited

Loading

exe-r commented Sep 28, 2022 •

edited

Loading

audriuz commented Sep 28, 2022 •

edited

Loading

jkroepke commented Nov 30, 2023 •

edited

Loading