Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows_exporter service failed to start on reboot #551

Closed
f1-outsourcing opened this issue Jun 21, 2020 · 117 comments
Closed

windows_exporter service failed to start on reboot #551

f1-outsourcing opened this issue Jun 21, 2020 · 117 comments
Labels

Comments

@f1-outsourcing
Copy link

f1-outsourcing commented Jun 21, 2020

After updates and rebooting the server, the windows_exporter service was not running

The windows_exporter service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.

When I look at the recovery options of the windows_exporter service they are not as other 'standard' windows services. Looks like none has set reset fail count after:0 and restart service after: 0

exporter:
exporter

other examples:
workstation
server
firewall
nla

I am not really an expert on the settings of recovery of services, but maybe someone should look at these. Maybe it is better to put this minutes on 3 or 5?

https://docs.microsoft.com/en-us/archive/blogs/jcalev/some-tricks-with-service-restart-logic
https://social.microsoft.com/Forums/ro-RO/3db76753-4607-4a20-97a0-790c73e379cc/the-actions-after-system-service-failure?forum=winserver8gen

@vvvvoid
Copy link

vvvvoid commented Jun 25, 2020

After updates and rebooting the server, the windows_exporter service was not running

+1, we had to restart service after updates

I think startup type should be Automatic(Delayed start) instead of Automatic

@carlpett
Copy link
Collaborator

@f1-outsourcing Note that those services have a "Subsequent failures" set to "Take no action", meaning it will simply stop trying if it fails to start twice. The first number, reset after, doesn't matter much when subsequent failures is set to restart. We could possibly set the restart interval to something higher to space restarts out, but before this report, we've never heard of this being a problem.
As I've asked in the other issue, any logs that can be found about why it is failing is crucial to solving this, rather than attempting to work around it by changing settings. The exporter really shouldn't need anything else to be running to be able to start, so without any indication what is going wrong, we can't really troubleshoot it.

@daltonjprice
Copy link

@carlpett I've also noticed this behavior. I'm able to reproduce this consistently by rebooting one of the servers I manage. I'd fully expect to see logs in the "Application" event queue from the source "windows_exporter" when the service fails to start, but I don't. All I see is the same thing reported by @f1-outsourcing. Events are created for the service failing to start due to a timeout.

It's also worth noting that I've seen this issue on pretty much all 200~ windows machines we have.

See the following screenshots:

The service fails to start due to timeout:
image

The service manager fails the service:
image

The application event queue has no windows_exporter entries in this time period:
image

Should I circumvent event viewer? I know stdout is a logger option but I didn't see an option to log to a flat file. If you've got some ideas for troubleshooting this I'd be willing to run whatever is needed. This issue has been quite troublesome for us during patching.

@babunatarajan
Copy link

I do have the same issue on Windows Server 2016.
EventViewer Warning: " Collection timed out, still waiting for [cs os service] "
windows_exporter (version=0.13.0, branch=master, revision=c62fe4477fb5072e569abb44144b77f1c6154016)

@cb3inco
Copy link

cb3inco commented Jun 27, 2020

Same issue here.

@bpickhardt
Copy link

Same issue on Server 2019. Fresh installed machines running the windows_exporter agent do not start the agent on reboot. Playing with the automatic restart options did not resolve the issue.

@JDA88
Copy link
Contributor

JDA88 commented Jul 3, 2020

I think startup type should be Automatic(Delayed start) instead of Automatic

I agree. and at the very least you should have a the First and Second failure set to Restart the Service with a delay of 1min

@advorsky73
Copy link

same issue on windows 8.1

@carlpett
Copy link
Collaborator

carlpett commented Jul 6, 2020

I've still been unable to reproduce this, unfortunately, so anything you can find about why it is happening on your systems, but not all, would be useful.
The only thing that can fail during startup in the exporter code is really where we bind to the network interface, so potentially if the network hasn't come up yet. That'd lead to the exporter exiting though, not a timeout...

@babunatarajan You seem to have a completely different issue, since your error is a timeout during metric collection from a running exporter.

@advorsky73
Copy link

@carlpett if i see this correctly, it works with Delayed start, so i my best guess is that the windows_exporter service starts and immediately exits again during its first try, probably because a dependency is not fulfilled at that early stage of boot time.
maybe the network, i dont know. however the service after installation shows no dependency, neither restart options are set, so one fail during start and it stays off, which is not good...
my suggestion: Installer change to make the service Automatic (Delayed) and set 1 day clear, 5 minutes each retry as restart options. then this will work.

@babunatarajan
Copy link

I already set the Delayed Start as soon as it failed to start at the boot, but never really tested just because it is prod environment.
Did someone set the Delayed Start and rebooted the server? if it works we can keep this as a workaround.

Thanks

@bpickhardt
Copy link

I set my servers to delayed start and it seemed to at least start correctly when Windows started up. I'm unsure if it would restart on failure correctly or not though.

@carlpett
Copy link
Collaborator

carlpett commented Jul 6, 2020

There's a lot of different threads flying here, and a few misconceptions.
First off, regarding restarts. We already configure service to restart on failure, and delay the restarts by five seconds. This is visible via sc qfailure windows_exporter, but the Services UI appears to only work with minutes, so it shows zero (it would probably make sense to bump this to 60 seconds to reduce confusion)

Then, on the topic of Delayed starts. I'm not in principle against it (it will mean you will not have metrics for ~2 minutes longer than otherwise after a reboot, but that is probably not a huge deal in most cases), but there seems to be a mixed bag of experiences reported on whether it helps or not. I've now tried booting completely without networking and related services enabled, and it does not appear to prevent the windows_exporter from starting. So there's something deeper going on. Are any of you overriding the service account for the service, so you could have a dependency on Active Directory being available?

@bpickhardt
Copy link

The 2019 machines I was seeing the problem on are AD joined and hardened with the CIS guidelines. I never had issues last year when I was still using Windows Server 2012 R2 and an older version of the exporter with the service starting correctly on reboot so maybe it's a 2019 Server issue?

@SupraOva
Copy link

SupraOva commented Jul 8, 2020

Hi everyone,

I was able to get through this issue by running this command :

Delayed start

sc config windows_exporter start= delayed-auto

Restart option

sc failure windows_exporter actions= restart/60000/restart/60000/""/60000 reset= 86400

Tested on Windows Server 2012 R2 / 2016 / 2019.

Hope its help.

@dry4ng
Copy link

dry4ng commented Jul 16, 2020

I have the same problem on freshly provisioned Azure Windows VMs: windows_exporter fails to start after VM reboot.

enabled_collectors: "cpu,cs,iis,logical_disk,memory,net,os,service,system"

@josephB
Copy link

josephB commented Aug 18, 2020

solved for me with a folder exclusion rule on Windows Defender
use of windows_expoter v0.13
Problem appears with August Windows update on Windows 2016 servers

@chinhodado
Copy link

chinhodado commented Aug 26, 2020

Same issue here.

The windows_exporter service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.

A timeout was reached (180000 milliseconds) while waiting for the windows_exporter service to connect.

I can confirm setting the service to Delayed Start fixed the issue. Why can't this be set to Delayed Start by default?

@majerus1223
Copy link
Contributor

@josephB Good call on the exclusion, in our case looks like our AV tools needed an exception following aug updates.

@carlpett
Copy link
Collaborator

@chinhodado As I mention in my comment above, it doesn't seem to solve it very reliably. If we could figure out why it fixed it for you, that'd be a big step forward towards making a change. If it is related to antivirus starting up, as indicated by some other commenters lately, we'd be much better served by setting the correct service dependency.

@majerus1223
Copy link
Contributor

Ill see if i can get more detail.

@dry4ng
Copy link

dry4ng commented Sep 3, 2020

Setting delayed start doesn't help. Until it's fixed, I'm using a scheduled task which starts windows_exporter if it's not running every 5 mins.

@carlpett
Copy link
Collaborator

carlpett commented Sep 3, 2020

@dry4ng It'd be interesting to see if your case is solved with an exception in Windows Defender as mentioned above?

@bpickhardt
Copy link

In my case, almost all my Windows Server 2016/2019 machines will start the service with the automatic delayed startup after a reboot. I seem to always have a few that do not and I have to go manually start them once I get alerted. I can confirm that I've removed the Windows Defender feature from my Windows 2019 servers because I am using a third-party AV software. I was also thinking of having some kind of work around to start up the service when it is stopped but had been hesitant to put one in place so far.

@chinhodado
Copy link

Is there any log that we can look at to debug why the service doesn't start? AFAIK the service doesn't generate any log file.

@bpickhardt
Copy link

I installed 0.15 yesterday because I noticed added a dependency for the Windows service on the WMI service. I experienced the same problem where the service would not start with 0.15 when the start up type is set to Automatic. When I changed the start up type to Automatic (Delayed Start) after upgrading to 0.15 the service did start correctly after a reboot.

I noticed looking in the event viewer that the windows_exporter service did start but had problems collecting metrics, and I guess stopped itself, before the event that says the "Windows Management Instrumentation" service was started. Maybe this is the service that should be the dependency instead of or in addition to "WMI Performance Adapter"?

@safster123
Copy link

Hi all, #1047 has been merged. Would anyone here be able to run a test deployment from master and confirm if this has resolved the issue? I'd prefer to test this across a larger sample size before claiming this has been resolved

I can provide pre-built EXEs or MSIs if that is preferable.

I'm happy to test this but I'm not great with GIT so would be great if you could provide an MSI

@breed808
Copy link
Contributor

See attached. I've included both the EXE and MSI built from master.
Let me know if there are any issues.

windows_exporter.zip

@JsBergbau
Copy link

JsBergbau commented Aug 31, 2022

Can confirm works now. Especially with an older server where even delayed start and setting sc.exe config windows_exporter depend= Winmgmt didn't help.

@safster123
Copy link

Finally got a chance to test this. Happy to say that it appears fixed from my testing. I was able to get it to consistently fail with previous versions but the provided version above seems to have done the trick and it now starts successfully. Thanks to all involved in getting this over the line.

@breed808
Copy link
Contributor

Thanks all. I'll aim to get a new release with this fix out in the next few days, then hopefully we can close this one off 🤞

@matthewsc05
Copy link

Hi @breed808 tested the windows_exporter.zip provided above. It fixed the cpu usage issues and timeouts which I was having. What I noticed however is that I am experiencing a memory leak. At one point agent hit 1GB ram usage.

@breed808
Copy link
Contributor

@matthewsc05 is the memory leak present on the latest version or just on the build I provided earlier?

@Andy-Techical
Copy link

Hi @breed808 @matthewsc05

I've had the build from the post above (on 26th August by breed808) installed for the last week or so on 3 servers (Windows Server 2016) and can't see any high memory from it. I've compared it to the rest of my servers running an older version of Windows Exporter and the memory levels look similar across the versions.

Thanks

@matthewsc05
Copy link

Hi @breed808 its with the previous build provided above windows_exporter.zip

Could this be related to a particular windows version? This was tested on windows server 2019 - we had to remove the agent due to the high memory usage.

@breed808
Copy link
Contributor

@matthewsc05 it's more likely to be the collectors you have enabled. We've identified some problem collectors using WMI as a metric source in #813, and there's been a recently identified leak in the scheduled_task collector in #1063.

That said, if you're running the same collectors between versions and there's a noticeable difference in the new version, we'll need to investigate.

I'm concerned that we may be introducing a new issue in the next release while trying to fix this one.

@matthewsc05
Copy link

@matthewsc05 it's more likely to be the collectors you have enabled. We've identified some problem collectors using WMI as a metric source in #813, and there's been a recently identified leak in the scheduled_task collector in #1063.

That said, if you're running the same collectors between versions and there's a noticeable difference in the new version, we'll need to investigate.

I'm concerned that we may be introducing a new issue in the next release while trying to fix this one.

Hi @breed808 I agree, for me I was using the default configuration, so everything was enabled. I am moving to a dedicated configuration so this outcome might change for me soon.

Fix is still important in my opinion as in extreme cases the 30s timeout is being hit. For me when I had this using the above provided package and deleting registry keys of previous installation fixed my issues until I hit this memory issue which I am looking into improving.

@breed808
Copy link
Contributor

Fair enough. If we can't identify the cause of the issue in the next few days, I'll cut a release and list it as a known bug.

Let me know if you find anything while using a dedicated configuration.

@fsiler
Copy link

fsiler commented Sep 27, 2022

@breed808 really appreciate your attention on this. Do you have any timeframe on an update? Thanks!

@breed808
Copy link
Contributor

breed808 commented Oct 6, 2022

Apologies for the delay, life got in the way again. I've released v0.20.0, but I'll keep this issue open for week or so in case anything has been missed.

@breed808 breed808 closed this as completed Dec 5, 2022
breed808 added a commit to breed808/windows_exporter that referenced this issue Dec 21, 2022
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
breed808 added a commit to breed808/windows_exporter that referenced this issue Dec 21, 2022
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
breed808 added a commit to breed808/windows_exporter that referenced this issue Dec 21, 2022
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
breed808 added a commit to breed808/windows_exporter that referenced this issue Mar 25, 2023
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
breed808 added a commit to breed808/windows_exporter that referenced this issue Mar 25, 2023
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
breed808 added a commit to breed808/windows_exporter that referenced this issue Apr 1, 2023
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
@breed808 breed808 unpinned this issue Apr 19, 2023
mansikulkarni96 pushed a commit to mansikulkarni96/prometheus-community-windows_exporter that referenced this issue Oct 31, 2023
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
anubhavg-icpl pushed a commit to anubhavg-icpl/windows_exporter that referenced this issue Sep 22, 2024
Behaviour of init functions has been centralised in `collector/init.go`,
and can be called during exporter startup. This allows the exporter to
control the timing of collector initialisation, rather than relying on
the import & `init()` method.

This should reduce unexpected behaviour arising from the use of
`init()`, such as prometheus-community#551.

Signed-off-by: Ben Reedy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests