-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Watcher's Crash Checker is not detecting the correct PID for the Agent process from systemd #3124
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
@pierrehilbert pulling into the current sprint as this blocks #2176 |
The Upgrade Watcher's Crash Checker reads the Agent process's PID from the
However, the "Main PID" that is shown in the output of Reading the
(Confusingly, I think the second "latter" should actually be "former", referring to In any case, I verified this theory by running
From my testing, it appears that when the Agent process returns an error and quits, Given our implementation today, the Crash Checker keeps reading the We have two ways to fix this problem:
I'm leaning towards option 1, given that But I'd love to hear others opinions on which option to implement for the fix, particularly @cmacknz @michalpristas. |
what you're describing was actually an idea, treating 0 as a valid state in case we are slow on restarts (we were restarting agents e.g. on log level change, as part of the updgrade (now we reexec into new process), or restart may come from user). it was just an effort to avoid false positives. we were seeing some slow starts on windows after system reboot but this should not affect watcher as in this case watcher is spawned after main process. considering this i think it could be safe to treat 0 as a crash and evaluate it in that way. now we allow 2 crashes out of 6 so if we keep it this way it would mean we have 20s to restart. in normal circumstances this should be plenty |
Agree with Michal, treating PID 0 as a crash makes sense here. |
This bug was noticed when writing an integration test where the upgraded Agent failed to start up (PR).
For confirmed bugs, please report:
main
/8.10.0-SNAPSHOT
To reproduce this bug, we need to upgrade to an Agent whose binary crashes upon start. The integration test in this PR builds such a failing, fake Agent binary, and packages it up.
Try to upgrade to the Agent on Linux using
elastic-agent upgrade <version> --source-uri file:///path/to/fake/failing/agent/package.tgz
.While the upgrade is in progress, monitor the status of the Elastic Agent service:
Note the
Main PID
in thesystemctl status elastic-agent.service
command's output. In the above example, it is97938
.Now look at the Crash Checker's logs in the Upgrade Watcher's log and note the PID mentioned in there:
The PID in the Crash Checker logs is
0
, which is not the PID of the Agent process (which should be the value ofMain PID
in thesystemctl status elastic-agent.service
command's output). The PID detected by the Crash Checker needs to be the Agent's PID since the Crash Checker detects that the Agent process has crashed based on how many times its PID changes within a certain time interval.The text was updated successfully, but these errors were encountered: