Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck forever in down command (despite shutdown.timeout_seconds) #258

Closed
mprimi opened this issue Sep 30, 2024 · 8 comments
Closed

Stuck forever in down command (despite shutdown.timeout_seconds) #258

mprimi opened this issue Sep 30, 2024 · 8 comments
Labels
bug Something isn't working done Done, awaiting release

Comments

@mprimi
Copy link

mprimi commented Sep 30, 2024

This is a follow-up to #251 where i was seeing call to list getting stuck. Since then I have added a shutdown.timeout_seconds parameter to ensure timely shutdown.
However I am now seeing down getting stuck forever, e.g..:

$ ps aux | grep process-compose
myuser  480743  0.1  1.1 1917956 44684 pts/1   Sl+  15:48   0:13 process-compose -L /[...]/process-compose.log --config ./config/[...].yml --port 18080 --tui=false up
myuser  481605  0.0  0.9 1622260 37120 pts/1   Sl+  15:53   0:00 process-compose --port 18080 down

In this configuration, i am running 6 processes.

  • 5 of them shut down without issue
  • 1 of them is still alive under that up command.

I expected shutdown.timeout_seconds (set to 5 seconds) to kick in and KILL the culprit.
However the down process has been stuck where it is for the last hour.


$ sudo cat /proc/480743/stack
[<0>] futex_wait_queue+0x67/0xa0
[<0>] futex_wait+0x151/0x270
[<0>] do_futex+0x178/0x230
[<0>] __x64_sys_futex+0x78/0x1f0
[<0>] x64_sys_call+0x1c47/0x2570
[<0>] do_syscall_64+0x56/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x73/0xdd
 sudo $(which gdb) -batch -ex bt $(which process-compose) 480743
[New LWP 481115]
[New LWP 480892]
[New LWP 480751]
[New LWP 480750]
[New LWP 480749]
[New LWP 480748]
[New LWP 480747]
[New LWP 480746]
[New LWP 480745]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/3dyw8dzj9ab4m8hv5dpyx7zii8d0w6fi-glibc-2.39-52/lib/libthread_db.so.1".
0x0000000000476c43 in ?? ()
#0  0x0000000000476c43 in ?? ()
#1  0x000000000043b830 in ?? ()
#2  0x0000000001cc2020 in ?? ()
#3  0x0000000000000080 in ?? ()
#4  0x0000000000000000 in ?? ()
[Inferior 1 (process 480743) detached]

Unfortunately not much to go on here...


Current repro (maybe you can repro this yourself):

  • Create configuration with multiple processes
  • Set shutdown.timeout_seconds for all of them
  • Modify one of the processes to ignore SIGTERM (e.g. force the shutdown to timeout and issue a SIGKILL)
  • Start-stop a few times and possibly observe down getting stuck

This is reproducing for me with v1.27 on Linux.

Happy to capture any more data you think is interesting, or to try other configuration settings.

@mprimi mprimi changed the title Stuck forever in down command Stuck forever in down command (despite shutdown.timeout_seconds) Oct 1, 2024
@F1bonacc1
Copy link
Owner

Hi @mprimi,

Thank you for such a detailed analysis, this is very helpful.
I created this project:

version: "0.5"
log_level: info
log_length: 300

processes:
  pc_log:
    command: "tail -f -n100 process-compose-${USER}.log"
    working_dir: "/tmp"
    shutdown:
      timeout_seconds: 5

  sigterm_resistant:
    command: "trap '' SIGTERM && sleep 20"
    shutdown:
      timeout_seconds: 5

  sleeper:
    command: "sleep 60"
    replicas: 5
    shutdown:
      timeout_seconds: 5

Started it about 20 times and stopped it with the down command. Each time the project stopped as expected.
This is the script I used:

for i in {1..20} ; do ./bin/process-compose up -f issues/issue_258/process-compose.yaml -t=false & sleep 1 && ./bin/process-compose down ; done

Are there any additional details we might have missed?
Maybe there is something specific to your process-compose.yaml? Would you be able to publish it or try to repro with mine?

@F1bonacc1 F1bonacc1 added bug Something isn't working need more info labels Oct 2, 2024
@mprimi
Copy link
Author

mprimi commented Oct 4, 2024

Thank you for looking into this.

I also have been trying to repro with a variant of TestSystem_TestProcShutDownWithConfiguredTimeOut, unsuccessfully so far.

Nevertheless our CI pipes (1.27 on Linux via nix) get stuck daily on up and down blocked on one process that doesn't shut down.

I'll keep trying to see if i can learn more from it and report back.

Here's a snip of the configuration I'm using:

  Server_s1:
    namespace: "cluster"
    command: "[...]/server -c ./config/s1.conf"
    log_location: /tmp/s1.log
    working_dir: "."
    availability:
      restart: "always"
    liveness_probe:
      initial_delay_seconds: 5
      period_seconds: 1
      timeout_seconds: 3
      success_threshold: 1
      failure_threshold: 3
      http_get:
        host: 127.0.0.1
        scheme: http
        path: "/"
        port: 18221
    readiness_probe:
      initial_delay_seconds: 5
      period_seconds: 1
      timeout_seconds: 2
      success_threshold: 1
      failure_threshold: 8
      http_get:
        host: 127.0.0.1
        scheme: http
        path: "/health"
        port: 18221
    shutdown:
      timeout_seconds: 5

(more servers with the same configuration omitted)

@F1bonacc1
Copy link
Owner

This is very helpful.
I suspect it might be related to the readiness probe not being Ready when the down command is issued.
I will find time soon to confirm and fix it.

@Lagoja
Copy link

Lagoja commented Oct 15, 2024

We're seeing a similar issue with Devbox using our process-compose.yaml for Postgresql:

version: "0.5"

processes:
  postgresql:
    command: "pg_ctl start -o \"-k $PGHOST\""
    is_daemon: true
    shutdown:
      command: "pg_ctl stop -m fast"
    availability:
      restart: "always"
    readiness_probe:
      exec:
        command: "pg_isready"

It looks like the progresql process here gets stuck in Launching, and then process-compose never exits, similar to what's described above.

Trying different versions, it looks like this bug was introduced sometime after 1.24.2 -- If I run the process-compose file above with 1.24.2, I get a Launched status and everything works fine. On 1.27 and higher, the process gets stuck at Launching

@F1bonacc1
Copy link
Owner

F1bonacc1 commented Oct 26, 2024

Hi @Lagoja,
This was very helpful in pinpointing the issues (yes there is more than one):

Issue 1

Process Compose skipped termination of a Launching daemon. It's an easy issue to fix and this will prevent PC from getting stuck when trying to stop it while in the Launching state.

Issue 2

But why did pc_ctl remain in the Launching state in the first place?
The root cause is a fix introduced as part of #237.
In this fix, Process Compose waits for stdout and stderr to finish (EOF) before waiting for the process to end.
In this sense, pc_ctl behaves differently than other processes. It keeps its stdout open even after the process is terminated but is not waited on (this is what collects its status to prevent zombies).
Please notice the timestamps:
image
5 minutes after the process is "completed" it still sends output to stdout.

>ps -aef | grep 80415
eugene     80415   80400  0 14:25 pts/1    00:00:00 [pg_ctl] <defunct> # pg_ctl is a zombie

Possible Fixes

  1. I can add a new timeout variable for daemon starters that will wait for std[out,err] to close and continue if they didn't in that given time. Default value 5s if not specified.
  2. Revert to the previous logic: don't wait for stdout to close, but sleep. I can make it longer, to reduce the (very low) risk of missing log lines.
  3. I looked at the pg_ctl source code to find if there are any flags that can change this behavior but didn't find anything helpful.

Personally, I prefer the 1st option, but I will be happy to get feedback from the users.

CC:
@thenonameguy, @mkenigs

@F1bonacc1 F1bonacc1 added done Done, awaiting release and removed need more info labels Nov 2, 2024
@mprimi
Copy link
Author

mprimi commented Nov 5, 2024

@F1bonacc1, thank you for investigating this.

While it seems likely the problems are related, the issue I originally reported is not exactly what you identified:

  • I am not using daemon processes
  • The process I'm shutting down is not in a zombie state

Since I can reproduce, please let me know of any additional info I can capture.

The configuration I posted above is still current.

Overview:

  1. Launch 5+1 processes with up
  2. Some of my processes get killed or die, that's expected. I rely on process-compose to bring them back up
  3. At some point I issue a down which gets stuck forever

While in this state, process list also hangs forever.
If I try to attach, it shows 0 processes, but it's probably stuck waiting on the daemon to respond.
Looking at ps/pstree shows that up is still running along with one of my processes:

$  ps aux | grep process-compose
myuser  230939  0.1  1.1 1844196 45872 ?       Sl   00:08   1:06 process-compose -L /tmp/[...]/process-compose.log --config ./config/process-compose/5-nodes.yml --port 18080 --tui=false up
myuser  243648  0.0  0.9 1622488 38016 ?       Sl   04:12   0:00 process-compose --port 18080 down

The down command above has been stuck there for hours.

Looking with pstree shows 1 of the 5 servers still running:

$ pstree -p -T 230939
process-compose(230939)───my-server(231308)

Here's the tail of process-compose log:

24-11-05 00:08:45.471 INF Restarting Server_s3 in 1 second(s)... Restarts: 1
24-11-05 00:08:46.472 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s3.conf"] process=Server_s3
24-11-05 00:08:52.495 INF Exited exit_code=-1 process=Server_s2
24-11-05 00:08:52.495 INF Restarting Server_s2 in 1 second(s)... Restarts: 1
24-11-05 00:08:53.497 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s2.conf"] process=Server_s2
24-11-05 00:09:05.545 INF Exited exit_code=-1 process=Server_s5
24-11-05 00:09:05.545 INF Restarting Server_s5 in 1 second(s)... Restarts: 1
24-11-05 00:09:06.547 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s5.conf"] process=Server_s5
24-11-05 00:09:13.562 INF Exited exit_code=-1 process=Server_s1
24-11-05 00:09:13.562 INF Restarting Server_s1 in 1 second(s)... Restarts: 1
24-11-05 00:09:14.564 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s1.conf"] process=Server_s1
24-11-05 00:09:20.598 INF Exited exit_code=-1 process=Server_s2
24-11-05 00:09:20.598 INF Restarting Server_s2 in 1 second(s)... Restarts: 2
24-11-05 00:09:20.878 INF Started command=["bash","-c","/tmp/my-server/bin/my-server -c ./config/[...]/s2.conf"] process=Server_s2
24-11-05 00:09:20.899 INF Exited exit_code=1 process=Server_s3
24-11-05 00:09:20.929 INF Exited exit_code=1 process=Server_s4
24-11-05 00:09:20.936 INF Exited exit_code=1 process=Server_s5
24-11-05 00:09:20.941 INF Exited exit_code=0 process=Proxy
24-11-05 00:09:20.952 INF Exited exit_code=1 process=Server_s1
24-11-05 01:09:00.422 INF Caught terminated - Shutting down the running processes...

Notice processes are dying and getting restarted (as expected).

Very interesting that the one last restarted (Server_s2 just 20ms before everything is torn down) is the one still active (i.e. pid 231308 above).

Seems likely that this process was not "ready" when the down command was issued.

I tried to peek a the a stack trace (via SIGQUIT), but could not find where stderr went (not in syslog).
Although if you are stumped I can probably figure out how to capture one.

@mprimi
Copy link
Author

mprimi commented Nov 10, 2024

Confirming issue is fixed in v1.40.0. Thank you @F1bonacc1

@mkenigs
Copy link
Contributor

mkenigs commented Nov 12, 2024

CC: @thenonameguy, @mkenigs

@F1bonacc1 thanks for the ping and sorry I missed that earlier! Glad to see you already merged a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working done Done, awaiting release
Projects
None yet
Development

No branches or pull requests

4 participants