Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

ycombinator · 2023-07-25T18:19:04Z

This bug was noticed when writing an integration test where the upgraded Agent failed to start up (PR).

For confirmed bugs, please report:

Version: main / 8.10.0-SNAPSHOT
Operating System: Linux
Steps to Reproduce:

To reproduce this bug, we need to upgrade to an Agent whose binary crashes upon start. The integration test in this PR builds such a failing, fake Agent binary, and packages it up.
Try to upgrade to the Agent on Linux using elastic-agent upgrade <version> --source-uri file:///path/to/fake/failing/agent/package.tgz.

While the upgrade is in progress, monitor the status of the Elastic Agent service:

$ watch -n1 systemctl status elastic-agent.service

Every 1.0s: systemctl status elastic-agent.service                                                                  tdwijr: Fri Jul 14 18:51:38 2023

● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: deactivating (stop-sigterm) (Result: exit-code) since Fri 2023-07-14 18:50:50 UTC; 48s ago
    Process: 97938 ExecStart=/usr/bin/elastic-agent (code=exited, status=101)
   Main PID: 97938 (code=exited, status=101)
      Tasks: 6 (limit: 4637)
     Memory: 105.8M
        CPU: 3.386s
     CGroup: /system.slice/elastic-agent.service
             └─98040 /opt/Elastic/Agent/data/elastic-agent-903287/elastic-agent watch --path.config /opt/Elastic/Agent --path.home /opt/Elastic/Agent

Note the Main PID in the systemctl status elastic-agent.service command's output. In the above example, it is 97938.

Now look at the Crash Checker's logs in the Upgrade Watcher's log and note the PID mentioned in there:

$ grep -i 'crash.*service pid' /opt/Elastic/Agent/data/elastic-agent-903287/logs/elastic-agent-watcher-20230714-1.ndjson
{"log.level":"debug","@timestamp":"2023-07-14T18:51:00.213Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:10.214Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:20.215Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:30.216Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:40.217Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:51:50.218Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:52:00.219Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:52:10.220Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-07-14T18:52:20.222Z","log.origin":{"file.name":"upgrade/crash_checker.go","file.line":82},"message":"retrieved service PID [0] changed 1 times within 6","ecs.version":"1.6.0"}

The PID in the Crash Checker logs is 0, which is not the PID of the Agent process (which should be the value of Main PID in the systemctl status elastic-agent.service command's output). The PID detected by the Crash Checker needs to be the Agent's PID since the Crash Checker detects that the Agent process has crashed based on how many times its PID changes within a certain time interval.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-07-25T18:19:06Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2023-07-27T13:57:28Z

@pierrehilbert pulling into the current sprint as this blocks #2176

ycombinator · 2023-08-01T00:28:36Z

The Upgrade Watcher's Crash Checker reads the Agent process's PID from the MainPID service property, in the scenario where the Agent service is being managed by systemd:

elastic-agent/internal/pkg/agent/application/upgrade/service.go

Line 180 in b60b8b0

prop, err := p.dbusConn.GetServicePropertyContext(ctx, sn, "MainPID")

However, the "Main PID" that is shown in the output of systemctl status elastic-agent.service actually appears to be coming from the ExecMainPID service property.

Reading the dbus documentation:

That means that ExecMainPID and MainPID differ in the way that the latter immediately reflects whether a main process is currently running while the latter possible contains information collected from the last run even if the process is no longer around.

(Confusingly, I think the second "latter" should actually be "former", referring to ExecMainPID).

In any case, I verified this theory by running systemctl show elastic-agent.service | grep MainPID when the Agent process had returned an error and quit. Here's the output:

systemctl show elastic-agent.service | grep MainPID
GuessMainPID=yes
MainPID=0
ExecMainPID=279275

From my testing, it appears that when the Agent process returns an error and quits, systemd waits for 2 minutes and then tries to spawn a new Agent process. During this 2-minute interval, MainPID is set to 0 while ExecMainPID is set to the PID of the last Agent process that ran (and, in our case, errored and quit).

Given our implementation today, the Crash Checker keeps reading the MainPID as 0 every 10s and, since that value doesn't change in 6 consecutive reads (so over a period of 60s, which is less than the 2-minute interval between systemd's restarts of the Agent process), thinks that the Agent process is stable, i.e. hasn't crashed.

We have two ways to fix this problem:

Treat a PID of 0 as suspect and if it remains at 0 in 6 consecutive reads, consider the Agent as crashed.
Read from the ExecMainPID property instead of the MainPID property in the GetPID() method of the systemd/dbus PID provider.

I'm leaning towards option 1, given that 0 is a special PID — both on *nix and Windows systems — and also because I believe the intent of the Crash Checker is to read the PID of the currently-running Agent process, assuming there is one.

But I'd love to hear others opinions on which option to implement for the fix, particularly @cmacknz @michalpristas.

michalpristas · 2023-08-01T07:24:35Z

what you're describing was actually an idea, treating 0 as a valid state in case we are slow on restarts (we were restarting agents e.g. on log level change, as part of the updgrade (now we reexec into new process), or restart may come from user). it was just an effort to avoid false positives.

we were seeing some slow starts on windows after system reboot but this should not affect watcher as in this case watcher is spawned after main process.

considering this i think it could be safe to treat 0 as a crash and evaluate it in that way. now we allow 2 crashes out of 6 so if we keep it this way it would mean we have 20s to restart. in normal circumstances this should be plenty

cmacknz · 2023-08-01T13:45:17Z

Agree with Michal, treating PID 0 as a crash makes sense here.

ycombinator added bug Something isn't working Team:Elastic-Agent Label for the Agent team labels Jul 25, 2023

ycombinator changed the title ~~Upgrade Watcher's Crash Checker is not detecting the correct PID from systemd for the Agent process~~ Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd Jul 25, 2023

ycombinator mentioned this issue Jul 25, 2023

[Integration Test] New Agent after upgrade fails to start #3085

Merged

7 tasks

cmacknz mentioned this issue Jul 27, 2023

Create tests to ensure that Elastic Agent's upgrade are working as expected #2176

Closed

15 tasks

cmacknz assigned ycombinator Jul 27, 2023

ycombinator mentioned this issue Aug 1, 2023

[Upgrade Watcher][Crash Checker] Consider Agent process as crashed if its PID remains 0 #3166

Merged

7 tasks

ycombinator closed this as completed in #3166 Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

ycombinator commented Jul 25, 2023

elasticmachine commented Jul 25, 2023

cmacknz commented Jul 27, 2023

ycombinator commented Aug 1, 2023 •

edited

Loading

michalpristas commented Aug 1, 2023 •

edited

Loading

cmacknz commented Aug 1, 2023

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124

Comments

ycombinator commented Jul 25, 2023

elasticmachine commented Jul 25, 2023

cmacknz commented Jul 27, 2023

ycombinator commented Aug 1, 2023 • edited Loading

michalpristas commented Aug 1, 2023 • edited Loading

cmacknz commented Aug 1, 2023

ycombinator commented Aug 1, 2023 •

edited

Loading

michalpristas commented Aug 1, 2023 •

edited

Loading