Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGHUP multiple times results in supervising process N which is not our child errors and no connectivity. #11052

Closed
jdconti opened this issue Mar 11, 2022 · 5 comments · Fixed by #11283
Assignees
Labels
bug robustness Resistance to crashes and reliability

Comments

@jdconti
Copy link

jdconti commented Mar 11, 2022

Description

What happened: RPM upgrade from teleport 8.3.1 to teleport 8.3.4 prevented any new sessions (old sessions remained unaffected). This node was reloaded/SIGHUP'd previously and the latest upgrade + reload resulted in errors on any new connections.

$ tsh ssh foo-bar
fork/exec /usr/local/bin/teleport;622a7e9a: no such file or directory
ERROR: ssh: could not start shell

What you expected to happen: New connections to the upgraded node should always be successful.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

  1. Connect 10+ times to node in question (# likely doesn't matter).
  2. systemctl reload teleport
  3. yum update teleport-ent-8.3.1 to teleport-ent-8.3.4
  4. systemctl reload teleport
  5. Attempt a new connection.

Server Details

  • Teleport version (run teleport version): Teleport Enterprise v8.3.4 git:v8.3.4-0-g010bea1 go1.17.3
  • Server OS (e.g. from /etc/os-release): Red Hat Enterprise Linux Server 7.7 (Maipo)
  • Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): Dedicated Hardware
  • Additional details: N/A

Client Details

  • Tsh version (tsh version): Teleport v8.3.1 git:v8.3.1-0-g64812df go1.17.3
  • Computer OS (e.g. Linux, macOS, Windows): Linux and Mac
  • Browser version (for UI-related issues): N/A
  • Installed via (e.g. apt, yum, brew, website download): yum
  • Additional details: N/A

Debug Logs

Please include or attach debug logs, when appropriate. Obfuscate sensitive information!

Feb 24 16:47:16 foo-bar systemd[1]: Reloading Teleport SSH Service.
Feb 24 16:47:16 foo-bar systemd[1]: Reloaded Teleport SSH Service.
Feb 24 16:47:17 foo-bar teleport[95058]: [NODE]         Service 8.3.1:v8.3.1-0-g64812df is starting on 0.0.0.0:3022.
Mar 10 16:47:53 foo-bar systemd[1]: Reloading Teleport SSH Service.
Mar 10 16:47:53 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:47:53 foo-bar systemd[1]: Reloaded Teleport SSH Service.
Mar 10 16:49:00 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:49:00 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:49:59 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:49:59 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:50:04 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 16:50:04 foo-bar systemd[1]: teleport.service: Supervising process 63333 which is not our child. We'll most likely not notice when it exits.
Mar 10 19:01:20 foo-bar systemd[1]: Stopping Teleport SSH Service...
Mar 10 19:02:50 foo-bar systemd[1]: teleport.service stop-sigterm timed out. Killing.
Mar 10 19:02:50 foo-bar systemd[1]: Stopped Teleport SSH Service.
Mar 10 19:02:50 foo-bar systemd[1]: Unit teleport.service entered failed state.
Mar 10 19:02:50 foo-bar systemd[1]: teleport.service failed.
Mar 10 19:02:50 foo-bar systemd[1]: Started Teleport SSH Service.
Mar 10 19:02:50 foo-bar teleport[53924]: [NODE]         Service 8.3.4:v8.3.4-0-g010bea1 is starting on 0.0.0.0:3022.

USR1 trace available upon request.

@espadolini
Copy link
Contributor

Was this Teleport instance running only the node (i.e. ssh_service) role?

@espadolini
Copy link
Contributor

@jdconti do you happen to also have the process logs around the time of one of the failed attempts to tsh ssh?

@espadolini
Copy link
Contributor

espadolini commented Mar 18, 2022

I'm not entirely sure how Teleport ended up with a /proc/self/exe pointing to /usr/local/bin/teleport;622a7e9a - that ;random suffix is used by RPM during the installation/upgrade process, but the sequence (on the RPM on centos 7 that I've tested, at least) is that the file is first written to path;random and then renamed onto the correct path - so there should never be a situation in which we try to execve a path with the ;random suffix. Perhaps after the file with the intended path got deleted (during the upgrade) the linux VFS decided that the correct name for the /proc/self/exe symlink was the one with the suffix?

Either way, that should be fixed on Linux by setting the command path to /proc/self/exe in lib/srv/reexec_linux.go:reexecCommandOSTweaks(), which will always reexec the currently running Teleport binary rather than one that might have been replaced as part of an upgrade - which would also avoid problems in case we happen to change the behavior of the internal teleport exec/teleport forward/teleport checkhomedir commands across versions.

The messages about Supervising process N which is not our child. We'll most likely not notice when it exits. are somewhat harmless - they happen when systemd notices that the pidfile has changed (and thus the new MainPID for the service should be updated) but the new process is still a child of the old one; systemd will be able to start tracking the new process correctly as soon as the old process terminates - either way, that's not actually what caused this issue. We could directly interact with systemd through its notify socket mechanism so it (and the user) gets informed about reloads and PID changes, but the pidfile system works fine - and part of the automatic upgrades system (wip, #10690) might involve running a process-level supervisor for teleport subprocesses, which would mean that no more PID confusion would ensue, anyway.

@russjones russjones added platform robustness Resistance to crashes and reliability and removed platform labels Mar 19, 2022
@jdconti
Copy link
Author

jdconti commented Mar 22, 2022

@espadolini I don't have the process logs handy but can reproduce this readily if you still need them or can't reproduce this yourself. Thanks!

@espadolini
Copy link
Contributor

@jdconti While I'm still baffled by how RPM could've actually caused this, the general class of problems caused by the teleport binary not being accessible where we think it is is fixed in #11283 in master, and will be fixed in the next point releases of teleport - even deleting the binary will not prevent tsh ssh from working, now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug robustness Resistance to crashes and reliability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants