Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

Closed
Tracked by #2176
ycombinator opened this issue Jul 25, 2023 · 5 comments · Fixed by #3166
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team

Comments

@ycombinator
Copy link
Contributor

This bug was noticed when writing an integration test where the upgraded Agent failed to start up (PR).

For confirmed bugs, please report:

  • Version: main / 8.10.0-SNAPSHOT
  • Operating System: Linux
  • Steps to Reproduce:
  1. To reproduce this bug, we need to upgrade to an Agent whose binary crashes upon start. The integration test in this PR builds such a failing, fake Agent binary, and packages it up.

  2. Try to upgrade to the Agent on Linux using elastic-agent upgrade <version> --source-uri file:///path/to/fake/failing/agent/package.tgz.

  3. While the upgrade is in progress, monitor the status of the Elastic Agent service:

    $ watch -n1 systemctl status elastic-agent.service
    
    Every 1.0s: systemctl status elastic-agent.service                                                                  tdwijr: Fri Jul 14 18:51:38 2023
    
    ● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
         Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
         Active: deactivating (stop-sigterm) (Result: exit-code) since Fri 2023-07-14 18:50:50 UTC; 48s ago
        Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101)
       Main PID: 97938 (code=exited, status=101)
          Tasks: 6 (limit: 4637)
         Memory: 105.8M
            CPU: 3.386s
         CGroup: /system.slice/elastic-agent.service
                 └─98040 /opt/Elastic/Agent/data/elastic-agent-903287/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent
    
  4. Note the Main PID in the systemctl status elastic-agent.service command's output. In the above example, it is 97938.

  5. Now look at the Crash Checker's logs in the Upgrade Watcher's log and note the PID mentioned in there:

    $ grep -i 'crash.*service pid' /opt/Elastic/Agent/data/elastic-agent-903287/logs/elastic-agent-watcher-20230714-1.ndjson
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:00.213Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:10.214Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:20.215Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:30.216Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:40.217Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:51:50.218Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:52:00.219Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:52:10.220Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    {"log.level":"debug","@timestamp":"2023-07-14T18:52:20.222Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
    

    The PID in the Crash Checker logs is 0, which is not the PID of the Agent process (which should be the value of Main PID in the systemctl status elastic-agent.service command's output). The PID detected by the Crash Checker needs to be the Agent's PID since the Crash Checker detects that the Agent process has crashed based on how many times its PID changes within a certain time interval.

@ycombinator ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jul 25, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator ycombinator changed the title Upgrade Watcher's Crash Checker is not detecting the correct PID from systemd for the Agent process Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd Jul 25, 2023
@cmacknz
Copy link
Member

cmacknz commented Jul 27, 2023

@pierrehilbert pulling into the current sprint as this blocks #2176

@ycombinator
Copy link
Contributor Author

ycombinator commented Aug 1, 2023

The Upgrade Watcher's Crash Checker reads the Agent process's PID from the MainPID service property, in the scenario where the Agent service is being managed by systemd:

prop, err := p.dbusConn.GetServicePropertyContext(ctx, sn, "MainPID")

However, the "Main PID" that is shown in the output of systemctl status elastic-agent.service actually appears to be coming from the ExecMainPID service property.

Reading the dbus documentation:

That means that ExecMainPID and MainPID differ in the way that the latter immediately reflects whether a main process is currently running while the latter possible contains information collected from the last run even if the process is no longer around.

(Confusingly, I think the second "latter" should actually be "former", referring to ExecMainPID).

In any case, I verified this theory by running systemctl show elastic-agent.service | grep MainPID when the Agent process had returned an error and quit. Here's the output:

systemctl show elastic-agent.service | grep MainPID
GuessMainPID=yes
MainPID=0
ExecMainPID=279275

From my testing, it appears that when the Agent process returns an error and quits, systemd waits for 2 minutes and then tries to spawn a new Agent process. During this 2-minute interval, MainPID is set to 0 while ExecMainPID is set to the PID of the last Agent process that ran (and, in our case, errored and quit).

Given our implementation today, the Crash Checker keeps reading the MainPID as 0 every 10s and, since that value doesn't change in 6 consecutive reads (so over a period of 60s, which is less than the 2-minute interval between systemd's restarts of the Agent process), thinks that the Agent process is stable, i.e. hasn't crashed.

We have two ways to fix this problem:

  1. Treat a PID of 0 as suspect and if it remains at 0 in 6 consecutive reads, consider the Agent as crashed.
  2. Read from the ExecMainPID property instead of the MainPID property in the GetPID() method of the systemd/dbus PID provider.

I'm leaning towards option 1, given that 0 is a special PID — both on *nix and Windows systems — and also because I believe the intent of the Crash Checker is to read the PID of the currently-running Agent process, assuming there is one.

But I'd love to hear others opinions on which option to implement for the fix, particularly @cmacknz @michalpristas.

@michalpristas
Copy link
Contributor

michalpristas commented Aug 1, 2023

what you're describing was actually an idea, treating 0 as a valid state in case we are slow on restarts (we were restarting agents e.g. on log level change, as part of the updgrade (now we reexec into new process), or restart may come from user). it was just an effort to avoid false positives.

we were seeing some slow starts on windows after system reboot but this should not affect watcher as in this case watcher is spawned after main process.

considering this i think it could be safe to treat 0 as a crash and evaluate it in that way. now we allow 2 crashes out of 6 so if we keep it this way it would mean we have 20s to restart. in normal circumstances this should be plenty

@cmacknz
Copy link
Member

cmacknz commented Aug 1, 2023

Agree with Michal, treating PID 0 as a crash makes sense here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team
Projects
None yet
4 participants