Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows_exporter v0.14 don't auto start #637

Closed
hieu15 opened this issue Oct 22, 2020 · 28 comments
Closed

windows_exporter v0.14 don't auto start #637

hieu15 opened this issue Oct 22, 2020 · 28 comments
Labels

Comments

@hieu15
Copy link

hieu15 commented Oct 22, 2020

Windows_exporter v0.14 don't auto start service when restart server. I must start service manual.

@fischerman
Copy link
Contributor

I'm experiencing the same problem. I didn't use any version before 0.14 so I don't know whether this version introduced the problem.

A timeout was reached (30000 milliseconds) while waiting for the "windows_exporter" service to connect. 

followed by

The "windows_exporter" service failed to start due to the following error:
The service did not respond to the start or control request in a timely fashion.

Unfortunately, the Windows exporter does not log anything at that time, so there is not much to work with. Starting it manually afterwards has always succeeded. This happens consistently on Windows 2012R2 (scheduled restarts over night) while 2016 and 2019 seem to work fine. I will try this on a clean 2012R2.

Might this be a missing dependency?

@fischerman
Copy link
Contributor

@hieu15 Can you verify if you see the same error messages in the event viewer? Otherwise this might be a different problem.

@fischerman
Copy link
Contributor

I can confirm that this happens on a fresh 2012R2

I've tried changing the start up type to "automatic delayed". The exporter starts fine in that case.

@fischerman
Copy link
Contributor

I've tested on a fresh 2016 and it doesn't show any problems, so I've tested on 2012R2 a bit more. This is where things get a bit crazy. Version 0.13 shows the same problems. However, v0.12 works perfectly fine. I've rebooted multiple times, it's always consistent. I did a clean uninstall of the exporter before installing a new one.

According to the release notes the only change in 0.13 is the project name. This seems crazy, but could the name have an effect on the start up order?

@frittentheke
Copy link

frittentheke commented Oct 22, 2020

@fischerman any relation to #544 or #599 maybe? They all seem to be about not starting exporters (after reboot).

@carlpett
Copy link
Collaborator

@fischerman That's very odd, but thanks for the exhaustive testing!
I just did a check on the full diff between those versions, and there really isn't anything apart from the name changes. There could potentially be some differences coming from the CI build machines updating something (Go compiler could have incremented some patch version, possibly).
Would you be able to run a test with the latest version on master? #630 bumped the Go version in order to fix a compiler bug which shouldn't be related to this, but...

@fischerman
Copy link
Contributor

@carlpett I didn't find an installer of master so I installed from 0.14 and swapped the .exe with https://ci.appveyor.com/project/prometheus-community/windows-exporter/builds/35870173/artifacts

As suspected this is not the cause.

@frittentheke I don't think they are related. But it gave me the idea to test 32 bit. Same problem.

@carlpett
Copy link
Collaborator

carlpett commented Oct 22, 2020

Thanks for checking! Your comment

This seems crazy, but could the name have an effect on the start up order?

... has been bouncing around in the back of my head during the day. And after managing a repro I now have a working theory. Could you test if running sc config windows_exporter depend= wmiApSrv fixes it for you?
I can't say this cleanly explains every report, but my thinking goes like this:

Some reasons it could work on other OSes would be that they have something earlier in the startup phase which starts whatever it is we need to be started.

I'm not 100% confident the wmiApSrv is the right dependency, or if it just happens to initialize something as a side effect. It'd had been a much neater explanation if it was also in the Auto start group (it's manual), and if it'd stayed running (on my repro-machine at least it stops quite quickly).
Reason I point to the logging initialization is that I can't seem to get anything to log even if I throw it before initWbem, so we must get stuck earlier. It seems a bit odd that logging wouldn't work, though.

But would be super interesting to see if this fixes it for you as well!

(As a side note, renaming the service to eg xwindows_exporter also seems to fix it on my VM)

@fischerman
Copy link
Contributor

I've added the dependency and did 3 restarts. The service came up fine every time. After adding the dependency, the WMI Performance adapter stays running in my case, possibly indefinitely (also in manual). However, in the field I see the latter service constantly flapping.

If the initialization of the event logger is the problem can we temporarily write the error message into a file?

@carlpett
Copy link
Collaborator

carlpett commented Nov 1, 2020

Thanks for the test! It's a bit unsatisfying to not really understand why, but since it does seem to fix the problem, I think we should probably add it to the next release.
The current logger implementation doesn't have a file logger, but we could probably add one fairly easy.

carlpett added a commit that referenced this issue Nov 1, 2020
This appears to fix service startup issues on certain systems, eg #637

Signed-off-by: Calle Pettersson <[email protected]>
carlpett added a commit that referenced this issue Nov 1, 2020
This appears to fix service startup issues on certain systems, eg #637

Signed-off-by: Calle Pettersson <[email protected]>
breed808 pushed a commit that referenced this issue Nov 1, 2020
This appears to fix service startup issues on certain systems, eg #637

Signed-off-by: Calle Pettersson <[email protected]>
@fischerman
Copy link
Contributor

I did a little bit more testing. I manually wrote stuff in log files. In all the test cases, no log file was written during Windows start up, but were showing up fine when starting the service afterwards. The only exception was a hello world program (built with promu) with which I replaced the windows exporter. That one always produced logs, so filesystem is generally available.

I believe the problem occurs before the main function, hence in one of init functions. I had my eye on the MapCounterToIndex which calls the perflib library internally (the perflib library may panic) but it appears that the problem is even before that (no call to registerCollectors occurs). The only init function in the project that is special is cpu.go, but even after removing the version check, no log is produced. So my best guess is that there is a panic in the init function of one of the dependencies.

I also wrapped the exporter in nssm to get the stderr, but that one starts fine...

@pandatracs
Copy link

@fischerman I had this same issue with windows_exporter v0.13 on Win2012R2. I found two culprits which contributed to the issue.

  1. Set the windows_exporter service startup to Automatic (Delayed Start).
  2. Apply all Windows Update patches on Win2012R2
    In most cases Perf: Add volume name label #1 is all it took, but I also discovered stability issues if Perf: Make exclusion of _Total label hardcoded #2 had not been done. I was unable to determine which patch was affecting windows_exporter, but since fully patching things have worked fine.

@fischerman
Copy link
Contributor

Delayed start is also my current workaround. I'm still hoping we can find the root cause. However, if my suspicion is correct and an init function of a dependency causes the crash it will be hard to discover. One idea I just thought of is to generate a core dump using GOTRACEBACK=crash.

@antong
Copy link

antong commented Nov 25, 2020

I also have this issue. Thank you @fischerman for all the testing! I was going to suggest moving the windows service handler stuff to the beginning of main() as I believe it is the control messages from the service that makes Windows understand that the service is there. But, if it really crashes before main(), then it is no use.

One idea I just thought of is to generate a core dump using GOTRACEBACK=crash.

Good idea! I think Go doesn't yet produce minidumps without patching:
golang/go#20498

It could perhaps be possible to use GOTRACEBACK=crash (or =system) and in some way get stderr redirected to a file to see the tracebacks. I know there are wrappers like winsw that can do that, but it sounds like quite a hassle.

@theelog
Copy link
Contributor

theelog commented Dec 17, 2020

I experience the issue too but what's interesting is that the service crashed within a second but the log message show that it tried for 30 seconds.

A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

The windows_exporter service failed to start due to the following error: The service did not respond to the start or control request in a timely fashion.

Update: I could only get it fixed by restarting the vm.

@fischerman
Copy link
Contributor

@basroovers Which version do you use? How do you know that the service crashed after one second?

@theelog
Copy link
Contributor

theelog commented Dec 17, 2020

@fischerman I have hundreds of instances running on v0.15.0.

Most of these didn't autostart after the last Windows update cycle (reboots).
One of them couldn't even get the service started and needed the reboot

Here is an export of the Windows event log of that VM:

Level,Date and Time,Source,Event ID,Task Category
Error,17-12-2020 08:19:45,Service Control Manager,7000,None,"The windows_exporter service failed to start due to the following error: 
The service did not respond to the start or control request in a timely fashion."
Error,17-12-2020 08:19:45,Service Control Manager,7009,None,A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

@EmmaTinten
Copy link

The problem occured yesterday on a 2016 server with 0.14. System has now been configured to use the delayed start.

@olvsa
Copy link

olvsa commented Feb 12, 2021

I also have same issue from time to time using or not "delayed start" – it doesn't matter.
And I have an idea about this "time to time".
I found that exporter's start failure is more likely to occur in case of reboot after installed updates.
Taking into account that process of installing updates continues after restart I think that exporter can not start because some exporter's dependencies are still updating or even not yet started.

Exporter didn't start tonight after automatic reboot in reason of installing updates for Windows Server 2019 Std, but later, after manual reboot it started well.

Updates were installed:
February 9, 2021-KB4601887 Cumulative Update for .NET Framework 3.5, 4.7.2 and 4.8 for Windows 10, version 1809 and Windows Server, version 2019
http://support.microsoft.com/kb/4601887
February 9, 2021—KB4601345 (OS Build 17763.1757)
https://support.microsoft.com/help/4601345
KB4601393: Servicing stack update for Windows 10, version 1809: February 9, 2021
https://support.microsoft.com/help/4601393

Hope it helps.

@theelog
Copy link
Contributor

theelog commented Apr 19, 2021

We've updated to v0.16.0, but we experienced the same thing again last weekend. Several Windows updates were installed after which the windows_exporter service was not running.

Same messages:

The windows_exporter service failed to start due to the following error: 
The service did not respond to the start or control request in a timely fashion.

A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.

Perhaps good to check if this is a VMware specific issue? We are running VMware.

@tbiles
Copy link

tbiles commented Apr 19, 2021 via email

@davideverall
Copy link

Just to confirm, I ran into this issue with v0.16.0

Exporter failed to start on 2 out of 10 servers after cumulative updates.

  • Server2012R2
  • Vmware virtual machine
  • Service set to Automatic (Delayed Start)

David

@tbiles
Copy link

tbiles commented May 12, 2021 via email

@Andy-Techical
Copy link

I see this issue frequently, occurring typically when there are alot of Windows updates released and available for install.

After the updates are installed and the server is restarted, if the server is under heavy CPU load whilst the updates are installing, then the Windows_exporter service does not start up.

Usually, it is the Windows Modules Installer Worker(TiWorker.exe) service which is using up high CPU on the server to install the updates. I use mainly Windows Server 2016.

Would it be possible to put some logic in whereby the "Windows_exporter" service is checked every x minutes to check its up and running, and if not, to start it up?

@fsemti
Copy link

fsemti commented Apr 13, 2022

Just to report it, it seems to be happen with v0.16. still, on server 2019...

Get-Package *exporter*

Name                           Version          Source                           ProviderName                                                                                                                                                                     
                                                                                                                                                                                                                 
----                           -------          ------                           ------------                                                                                                                                                                     
                                                                                                                                                                                                                 
windows_exporter               0.16.0   
Get-ComputerInfo | select WindowsProductName, WindowsVersion, OsHardwareAbstractionLayer

WindowsProductName           WindowsVersion OsHardwareAbstractionLayer
------------------           -------------- --------------------------
Windows Server 2019 Standard 1809           10.0.17763.2686   
TimeCreated                     Id LevelDisplayName Message                                                                                                                                                                                                       
                                                                                                                                                                                                                 
-----------                     -- ---------------- -------                                                                                                                                                                                                       
                                                                                                                                                                                                                 
13/04/2022 09:02:53           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                       
                                                                                                                                                                                                                 
13/04/2022 03:35:04           7000 Error            The windows_exporter service failed to start due to the following error: ...                                                                                                                                  
                                                                                                                                                                                                                 
13/04/2022 03:35:04           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.                                                                                                         
                                                                                                                                                                                                                 
12/04/2022 13:23:43           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                       
                                                                                                                                                                                                                 
12/04/2022 06:11:32           7036 Information      The windows_exporter service entered the stopped state.                                                                                                                                                       
                                                                                                                                                                                                                 
15/03/2022 11:37:37           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                       
                                                                                                                                                                                                                 
09/03/2022 03:32:47           7000 Error            The windows_exporter service failed to start due to the following error: ...                                                                                                                                  
                                                                                                                                                                                                                 
09/03/2022 03:32:47           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.                                                                                                         
                                                                                                                                                                                                                 
09/03/2022 03:30:42           7036 Information      The windows_exporter service entered the stopped state.                                                                                                                                                       
                                                                                                            

Event at 12/04/2022 13:23:43 was a system start, event at 13/04/2022 09:02:53 is me starting the service. Apparently there were at least one earlier occurance too.

Also it is not isolated, at least one others server shows this behaviour (that one is the same OS version):

TimeCreated                     Id LevelDisplayName Message                                                                                                                                                                                                      
-----------                     -- ---------------- -------                                                                                                                                                                                                      
17/03/2022 10:28:10           7036 Information      The windows_exporter service entered the running state.                                                                                                                                                      
17/03/2022 00:32:36           7000 Error            The windows_exporter service failed to start due to the following error: ...                                                                                                                                 
17/03/2022 00:32:36           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.                                                                                                        
17/03/2022 00:31:40           7036 Information      The windows_exporter service entered the stopped state. 

I realise we do not run the latest release, so if this issue supposed to be fixed by that, can you confirm - maybe point me to the right direction - so we can consider upgrade?

@darknesssheep
Copy link

Hello,

Same issue for me on Windows server 2019:

-> A timeout was reached (30000 milliseconds) while waiting for the windows_exporter service to connect.
-> The windows_exporter service failed to start due to the following error: 
The service did not respond to the start or control request in a timely fashion.

Exporter: 0.18.1

time="2022-08-18T08:08:59Z" level=warning msg="No where-clause specified for service collector. This will generate a very large number of metrics!" source="service.go:48"
time="2022-08-18T08:08:59Z" level=info msg="Enabled collectors: net, os, service, system, textfile, cpu, cs, logical_disk" source="exporter.go:348"
time="2022-08-18T08:08:59Z" level=info msg="Starting windows_exporter ( version=0.18.1, branch=heads/tags/v0.18.1, revision=e07b2053af3ca708db3489f41e2fcde9941860f4)" source="exporter.go:400"
time="2022-08-18T08:08:59Z" level=info msg="Build context (go=go1.17.6, user=runneradmin@xxxxx date=20220205-08:18:35)" source="exporter.go:401"
time="2022-08-18T08:08:59Z" level=info msg="Starting server on :9182" source="exporter.go:404"
time="2022-08-18T08:08:59Z" level=info msg="TLS is disabled." source="gokit_adapter.go:38"

Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Nov 25, 2023
@breed808
Copy link
Contributor

This should be resolved as of #1047. Let me know if it's still an issue.

anubhavg-icpl pushed a commit to anubhavg-icpl/windows_exporter that referenced this issue Sep 22, 2024
This appears to fix service startup issues on certain systems, eg prometheus-community#637

Signed-off-by: Calle Pettersson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests