Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper shutdown on Windows can cause Elastic Defend to rescan all files on reboot #1525

Closed
AndersonQ opened this issue Oct 14, 2022 · 10 comments · Fixed by #2434
Closed

Improper shutdown on Windows can cause Elastic Defend to rescan all files on reboot #1525

AndersonQ opened this issue Oct 14, 2022 · 10 comments · Fixed by #2434
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.7.0

Comments

@AndersonQ
Copy link
Member

AndersonQ commented Oct 14, 2022

The agent listens for a shutdown command from a few different places, one of them is signal.Notify which on Windows will capture CTRL_CLOSE_EVENT, CTRL_LOGOFF_EVENT or CTRL_SHUTDOWN_EVENT and send as a syscall.SIGTERM. Another is the Windows SCM (Service Control Manager), it's analogous to systemd on linux. The SCM keeps the agent running, and if the agent isn't running, it restarts it.

The correct shutdown procedure is the agent to update its status with the SCM, so the SCM won't restart the agent on a legit shutdown.
Even thought the agent uses the elastic-agent-libs.service#HandleSignals to manage its status with the SCM, a race condition exists, which may cause the agent to do not update its status correctly, leading the SCM to try to restart the agent during a system shutdown, which in turns might cause problems with some of the agent integrations. Also the when the agent exits it calls os.Exit which does not run defered functions. The agent and elastic-agent-libs uses defer as part of its shutdown process, which includes managing its state with the SCM, thus failing to correctly notify its status to the SCM.

Last, but not least, the agent listen multiple times, in different parts of the code, to the same shutdown signals through signal.Notify. It does it on its on and thorough elastic-agent-libs.service#HandleSignals, which might cause the agent to never try to update its status with the SCM.

The agent needs to act only on the first shutdown signal it receives, regardless from where it comes, signal.Notify, SCM or the CLI, and ignore any shutdown signal after that. Also it needs to ensure to correctly update its status with the SCM.

For confirmed bugs, please report:

  • Version: 7.17, 8.4.x, 8.5.x, main
  • Operating System: Windows
  • Steps to Reproduce:

Relates to:

@AndersonQ AndersonQ added bug Something isn't working backport-v8.4.0 Automated backport with mergify backport-v8.5.0 Automated backport with mergify labels Oct 14, 2022
@softengchick
Copy link

Thanks @bjmcnic for seeing this through all the way to the root cause and not just putting a bandaid on Endpoint.

@gabriellandau
Copy link

I am encountering this problem with 8.7.0 BC3 on my Windows workstation when restarting the machine.

Sometimes during Windows shutdown, Agent terminates itself:

signal "terminated" received
Shutting down Elastic Agent and sending last events...
failed accept conn info connection: accept tcp 127.0.0.1:6788: use of closed network connection
Possible transient error during checkin with fleet-server, retrying
checkin retry loop was stopped
Shutting down completed.
Stats endpoint (127.0.0.1:6791) finished: accept tcp 127.0.0.1:6791: use of closed network connection

Because Agent doesn't inform the Windows Service Control Manager (SCM) of its intent to shut down, the SCM perceives this as an unexpected service termination and restarts Agent. This results in an error in the Windows Event Log:
image

A new Agent instance then starts up and attempts to uninstall Endpoint while the system is shutting down. Because the SCM is shutting down, several service-related APIs used internally by elastic-endpoint.exe verify fail.

APM instrumentation disabled
Gathered system information
Detected available inputs and outputs
Capabilities file not found in C:\Program Files\Elastic\Agent\capabilities.yml
Determined allowed capabilities
Parsed configuration and determined agent is managed by Fleet
Starting stats endpoint
Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)
Docker provider skipped, unable to connect: protocol not available
restoring current policy from disk
Source URI changed from "https://artifacts.elastic.co/downloads/" to "https://artifacts.elastic.co/downloads/"
Updating running component model
Spawned new component endpoint-default: Starting: endpoint service runtime
Spawned new unit endpoint-default: Starting: endpoint service runtime
Spawned new unit endpoint-default-e1c7bcc4-4df6-43b1-bf9d-83ba60c3e8b9: Starting: endpoint service runtime
2023-02-27 17:17:21: info: Main.cpp:390 Verifying existing installation
2023-02-27 17:17:21: info: InstallLib.cpp:630 Running [C:\Program Files\Elastic\Endpoint\elastic-endpoint.exe] [version --log stdout]
2023-02-27 17:17:21: info: InstallLib.cpp:669 Installed endpoint is expected version (version: 8.7.0, compiled: Tue Feb 21 22:00:00 2023, branch: 8.7, commit: b28e1d429625af2c534fc3479616c9c6ead91bd6)
2023-02-27 17:17:21: info: Main.cpp:364 Upgrading existing installation (protected)
2023-02-27 17:17:21: info: InstallLib.cpp:511 Attempting to create a rollback package
2023-02-27 17:17:21: info: InstallLib.cpp:348 Running [C:\Program Files\Elastic\Agent\data\elastic-agent-e88a5b\componentspreviouselastic-endpoint.exe] [uninstall --rollback --log stdout]
Fleet gateway started
Updating running component model
signal "terminated" received
Shutting down Elastic Agent and sending last events...

This sequence of events results in Endpoint being uninstalled and reinstalled, destroying Endpoint's logs and file cache (C:\Program Files\Elastic\Endpoint\state\cache.db). The loss of this cache results in Endpoint computing hashes and signatures for files that it has previously checked, leading to unnecessary CPU and I/O from elastic-endpoint.exe, and a less-responsive system for the user. In two tests on my workstation, this was approximately 2GB of unnecessary I/O, and system startup was laggy.

Can we fix this in 8.7? I have the full logs available with timestamps if they will provide any additional clarity.

cc @AndersonQ @cmacknz @pierrehilbert @bjmcnic

@gabriellandau
Copy link

@intxgo highlighted another problem - uninstalling Endpoint also deletes its quarantine store. Users lose the ability to restore any files that were previously quarantined.

@cmacknz
Copy link
Member

cmacknz commented Feb 28, 2023

Will see if we can get this prioritized for 8.7. This seems like it is significantly annoying for Windows users running Elastic Defend.

@cmacknz cmacknz added v8.7.0 and removed backport-v8.6.0 Automated backport with mergify labels Feb 28, 2023
@cmacknz cmacknz changed the title Improper shutdown on Windows causing issues with integrations Improper shutdown on Windows can cause Elastic Defend to rescan all files on reboot Mar 24, 2023
@gabriellandau
Copy link

Thanks @michalpristas @cmacknz

@michalpristas
Copy link
Contributor

@gabriellandau if you see this happening again please reopen this issue

@cmacknz
Copy link
Member

cmacknz commented Apr 5, 2023

Going to reopen this because the fix in elastic/elastic-agent-libs#113 hasn't been ported to the agent itself yet, only the elastic-agent-libs dependency.

The agent itself is still using v0.3.6 which does not include this fix:

github.com/elastic/elastic-agent-libs v0.3.6

We can also update Beats which are also able to run as Windows services, although the consequences there are less severe. Beats is using v0.3.3:
https://github.com/elastic/beats/blob/804e9c521aaee92de4321bcbbd3b920c33a03976/go.mod#L197

We should also make sure we manually test that this is fixed. We can have our QA team do this if we can give them detailed steps to reproduce the problem here.

@cmacknz cmacknz reopened this Apr 5, 2023
@michalpristas
Copy link
Contributor

closing as 0.3.6 is a tag currently pointing to top main.

@michalpristas michalpristas added the QA:Ready For Testing Code is merged and ready for QA to validate label Apr 5, 2023
@cmacknz
Copy link
Member

cmacknz commented Apr 5, 2023

🤦‍♂️ yes I can't read apparently.

@amolnater-qasource
Copy link

Hi @michalpristas

We have collected below logs as per our understanding of this issue on latest 8.8.0 SNAPSHOT and had below observations:

  • On restarting Elastic Agent from Services, the logs are collected under Event Viewer:
    11
    12

  • Further an error after few minutes of installation under Event viewer:
    8
    10

Build details:
VERSION: 8.8.0 SNAPSHOT
BUILD: 62040
COMMIT: 608886ff862f89ca30d1d4126a64877168f4976e
Artifact Link: https://snapshots.elastic.co/8.8.0-e41a7b29/downloads/beats/elastic-agent/elastic-agent-8.8.0-SNAPSHOT-windows-x86_64.zip

Agent Logs:
elastic-agent-diagnostics-2023-04-06T10-19-33Z-00.zip

Could you please share the detailed steps to reproduce this scenario so that we can share the exact observation details with you.

cc: @cmacknz
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.7.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants