SIGHUP multiple times results in supervising process N which is not our child errors and no connectivity. #11052

jdconti · 2022-03-11T01:09:31Z

Description

What happened: RPM upgrade from teleport 8.3.1 to teleport 8.3.4 prevented any new sessions (old sessions remained unaffected). This node was reloaded/SIGHUP'd previously and the latest upgrade + reload resulted in errors on any new connections.

$ tsh ssh foo-bar
fork/exec /usr/local/bin/teleport;622a7e9a: no such file or directory
ERROR: ssh: could not start shell

What you expected to happen: New connections to the upgraded node should always be successful.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

Connect 10+ times to node in question (# likely doesn't matter).
systemctl reload teleport
yum update teleport-ent-8.3.1 to teleport-ent-8.3.4
systemctl reload teleport
Attempt a new connection.

Server Details

Teleport version (run teleport version): Teleport Enterprise v8.3.4 git:v8.3.4-0-g010bea1 go1.17.3
Server OS (e.g. from /etc/os-release): Red Hat Enterprise Linux Server 7.7 (Maipo)
Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): Dedicated Hardware
Additional details: N/A

Client Details

Tsh version (tsh version): Teleport v8.3.1 git:v8.3.1-0-g64812df go1.17.3
Computer OS (e.g. Linux, macOS, Windows): Linux and Mac
Browser version (for UI-related issues): N/A
Installed via (e.g. apt, yum, brew, website download): yum
Additional details: N/A

Debug Logs

Please include or attach debug logs, when appropriate. Obfuscate sensitive information!

Feb 24 16:47:16 foo-bar systemd[1]: Reloading Teleport SSH Service.
Feb 24 16:47:16 foo-bar systemd[1]: Reloaded Teleport SSH Service.
Feb 24 16:47:17 foo-bar teleport[95058]: [NODE]         Service 8.3.1:v8.3.1-0-g64812df is starting on 0.0.0.0:3022.
Mar 10 16:47:53 foo-bar systemd[1]: Reloading Teleport SSH Service.
Mar 10 16:47:53 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:47:53 foo-bar systemd[1]: Reloaded Teleport SSH Service.
Mar 10 16:49:00 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:49:00 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:49:59 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:49:59 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:50:04 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:50:04 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 19:01:20 foo-bar systemd[1]: Stopping Teleport SSH Service...
Mar 10 19:02:50 foo-bar systemd[1]: teleport.service stop-sigterm timed out. Killing.
Mar 10 19:02:50 foo-bar systemd[1]: Stopped Teleport SSH Service.
Mar 10 19:02:50 foo-bar systemd[1]: Unit teleport.service entered failed state.
Mar 10 19:02:50 foo-bar systemd[1]: teleport.service failed.
Mar 10 19:02:50 foo-bar systemd[1]: Started Teleport SSH Service.
Mar 10 19:02:50 foo-bar teleport[53924]: [NODE]         Service 8.3.4:v8.3.4-0-g010bea1 is starting on 0.0.0.0:3022.

USR1 trace available upon request.

The text was updated successfully, but these errors were encountered:

espadolini · 2022-03-18T14:06:44Z

Was this Teleport instance running only the node (i.e. ssh_service) role?

espadolini · 2022-03-18T15:19:12Z

@jdconti do you happen to also have the process logs around the time of one of the failed attempts to tsh ssh?

espadolini · 2022-03-18T18:45:15Z

I'm not entirely sure how Teleport ended up with a /proc/self/exe pointing to /usr/local/bin/teleport;622a7e9a - that ;random suffix is used by RPM during the installation/upgrade process, but the sequence (on the RPM on centos 7 that I've tested, at least) is that the file is first written to path;random and then renamed onto the correct path - so there should never be a situation in which we try to execve a path with the ;random suffix. Perhaps after the file with the intended path got deleted (during the upgrade) the linux VFS decided that the correct name for the /proc/self/exe symlink was the one with the suffix?

Either way, that should be fixed on Linux by setting the command path to /proc/self/exe in lib/srv/reexec_linux.go:reexecCommandOSTweaks(), which will always reexec the currently running Teleport binary rather than one that might have been replaced as part of an upgrade - which would also avoid problems in case we happen to change the behavior of the internal teleport exec/teleport forward/teleport checkhomedir commands across versions.

The messages about Supervising process N which is not our child. We'll most likely not notice when it exits. are somewhat harmless - they happen when systemd notices that the pidfile has changed (and thus the new MainPID for the service should be updated) but the new process is still a child of the old one; systemd will be able to start tracking the new process correctly as soon as the old process terminates - either way, that's not actually what caused this issue. We could directly interact with systemd through its notify socket mechanism so it (and the user) gets informed about reloads and PID changes, but the pidfile system works fine - and part of the automatic upgrades system (wip, #10690) might involve running a process-level supervisor for teleport subprocesses, which would mean that no more PID confusion would ensue, anyway.

jdconti · 2022-03-22T16:01:57Z

@espadolini I don't have the process logs handy but can reproduce this readily if you still need them or can't reproduce this yourself. Thanks!

espadolini · 2022-03-25T10:37:05Z

@jdconti While I'm still baffled by how RPM could've actually caused this, the general class of problems caused by the teleport binary not being accessible where we think it is is fixed in #11283 in master, and will be fixed in the next point releases of teleport - even deleting the binary will not prevent tsh ssh from working, now.

jdconti added the bug label Mar 11, 2022

russjones assigned espadolini Mar 11, 2022

russjones added platform robustness Resistance to crashes and reliability and removed platform labels Mar 19, 2022

espadolini mentioned this issue Mar 21, 2022

Reexec with /proc/self/exe on Linux #11283

Merged

espadolini closed this as completed in #11283 Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGHUP multiple times results in supervising process N which is not our child errors and no connectivity. #11052

SIGHUP multiple times results in supervising process N which is not our child errors and no connectivity. #11052

jdconti commented Mar 11, 2022

espadolini commented Mar 18, 2022

espadolini commented Mar 18, 2022

espadolini commented Mar 18, 2022 •

edited

Loading

jdconti commented Mar 22, 2022

espadolini commented Mar 25, 2022

SIGHUP multiple times results in supervising process N which is not our child errors and no connectivity. #11052

SIGHUP multiple times results in supervising process N which is not our child errors and no connectivity. #11052

Comments

jdconti commented Mar 11, 2022

Description

Reproduction Steps

Server Details

Client Details

Debug Logs

espadolini commented Mar 18, 2022

espadolini commented Mar 18, 2022

espadolini commented Mar 18, 2022 • edited Loading

jdconti commented Mar 22, 2022

espadolini commented Mar 25, 2022

espadolini commented Mar 18, 2022 •

edited

Loading