Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StopSignal SIGTERM failed to stop container in 10 seconds #20196

Closed
edsantiago opened this issue Sep 28, 2023 · 21 comments
Closed

StopSignal SIGTERM failed to stop container in 10 seconds #20196

edsantiago opened this issue Sep 28, 2023 · 21 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Weird new flake, where "new" means "since ExitCleanly() started checking stderr":

# podman [options] run --http-proxy=false -d quay.io/libpod/alpine:latest top
89a002f65e0abc37edb7dd74e7402aa5d3c8e44b348be9b8173b0b4b62c84e7e
# podman [options] stop --ignore foobar 89a002f65e0abc37edb7dd74e7402aa5d3c8e44b348be9b8173b0b4b62c84e7e
time="2023-09-28T10:44:05-05:00" level=warning msg="StopSignal SIGTERM failed to stop container lucid_dubinsky in 10 seconds, resorting to SIGKILL"

Happens almost exclusively in this one test. (The other one, the "pod something", is probably because entrypoint defaults to sh. Probably.).

This shouldn't happen, because top is quick to exit upon signal. And okay, maybe not always, maybe 2-3 seconds, but ten??

I've tried reproducing, with no luck:

$ while :;do cid=$(bin/podman --events-backend=file  run --http-proxy=false  -d quay.io/libpod/alpine:latest top);bin/podman stop --ignore foobar $cid;bin/podman rm $cid;done

Anything obvious I've missed?

  • fedora-37 : int podman fedora-37 rootless host sqlite
    • 09-22 12:10 in Podman stop podman stop --ignore bogus container
  • fedora-38 : int podman fedora-38 root host sqlite
    • 09-28 12:00 in Podman stop podman stop --ignore bogus container
  • fedora-39 : int podman fedora-39 root host sqlite
    • 09-27 21:49 in Podman stop podman stop --ignore bogus container
  • fedora-39 : int podman fedora-39 rootless host boltdb
    • 09-27 15:37 in Podman stop podman stop --ignore bogus container
  • fedora-39 : int podman fedora-39 rootless host sqlite
    • 09-27 08:58 in Podman stop podman stop --ignore bogus container
  • fedora-39? : int podman fedora-39? root host boltdb
    • 09-28 13:13 in Podman stop podman stop --ignore bogus container
  • rawhide : int podman rawhide root host sqlite
    • 09-27 21:53 in Podman stop podman stop --ignore bogus container
    • 09-27 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true

Seen in: fedora-37/fedora-38/fedora-39/fedora-39?/rawhide root/rootless boltdb/sqlite

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 28, 2023
@rhatdan
Copy link
Member

rhatdan commented Sep 29, 2023

Could it be that the system is so slow and overburdened that it is taking more then 10 seconds?

@edsantiago
Copy link
Member Author

It could be, but by my count there are >20 other podman stop commands in e2e tests that do not have -t, and I would expect to see took-too-long errors in at least one of those. What I'm seeing is consistently in the --ignore test.

Copy link

github-actions bot commented Nov 2, 2023

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

edsantiago commented Nov 2, 2023

  • fedora-37 : int podman fedora-37 rootless host sqlite
  • fedora-38 : int podman fedora-38 root host sqlite
    • 10-31 17:22 in Podman stop podman stop --ignore bogus container
    • 10-31 11:36 in Podman prune podman system prune pods
    • 10-26 19:05 in Podman stop podman stop --ignore bogus container
    • 10-26 19:05 in Podman stop podman stop container by id
    • 10-25 18:39 in Podman stop podman stop --ignore bogus container
    • 09-28-2023 12:00 in Podman stop podman stop --ignore bogus container
  • fedora-38 : int podman fedora-38 rootless host boltdb
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 10-11 10:52 in Podman stop podman stop --ignore bogus container
  • fedora-39 : int podman fedora-39 root host sqlite
  • fedora-39 : int podman fedora-39 rootless host boltdb
  • fedora-39 : int podman fedora-39 rootless host sqlite
  • fedora-39β : int podman fedora-39β root host boltdb
  • rawhide : int podman rawhide root host sqlite
    • 09-27-2023 21:53 in Podman stop podman stop --ignore bogus container
    • 09-27-2023 15:38 in Podman prune podman system prune with running, exited pod and volume prune set true
  • rawhide : int podman rawhide rootless host sqlite
    • 10-18 23:04 in Podman stop podman stop --ignore bogus container

Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite

@edsantiago edsantiago changed the title podman stop --ignore: failed to stop container in 10 seconds StopSignal SIGTERM failed to stop container in 10 seconds Nov 28, 2023
@edsantiago
Copy link
Member Author

Still happening

Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite

@Luap99
Copy link
Member

Luap99 commented Dec 6, 2023

[+1003s] not ok 298 [160] podman volume rm --force
         # (from function `bail-now' in file test/system/[helpers.bash, line 227](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L227),
         #  from function `die' in file test/system/[helpers.bash, line 790](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L790),
         #  from function `run_podman' in file test/system/[helpers.bash, line 435](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/helpers.bash#L435),
         #  in test file test/system/[160-volumes.bats, line 158](https://github.com/containers/podman/blob/3367fd6095bd2db5c0e60e72abeeae71f43b4e8f/test/system/160-volumes.bats#L158))
         #   `run_podman volume rm myvol --force' failed
         #
<+     > # # podman rm -t 0 --all --force --ignore
         #
<+051ms> # # podman ps --all --external --format {{.ID}} {{.Names}}
         #
<+052ms> # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
<+046ms> # quay.io/libpod/testimage:20221018 f5a99120db64
         #
<+026ms> # # podman volume rm -a
         #
<+050ms> # # podman run -d --volume myvol:/myvol quay.io/libpod/testimage:20221018 top
<+376ms> # 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe
         #
<+011ms> # # podman volume rm myvol
<+045ms> # Error: volume myvol is being used by the following container(s): 7e8d417666f6f695e575285be058534f41ec6f0f3858fdda7cd632da9bda66fe: volume is being used
<+005ms> # [ rc=2 (expected) ]
         #
<+014ms> # # podman volume rm myvol --force
<+0010s> # time="2023-12-05T11:50:40-06:00" level=warning msg="StopSignal SIGTERM failed to stop container frosty_hopper in 10 seconds, resorting to SIGKILL"
         # myvol
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: Command succeeded, but issued unexpected warnings
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         #
<+011ms> # # podman rm -a --volumes

Also seen is sys tests: https://api.cirrus-ci.com/v1/artifact/task/6106357493923840/html/sys-podman-fedora-38-root-host-boltdb.log.html

@edsantiago
Copy link
Member Author

Thanks. I have one other system test failure, on November 22. Here is the catalog so far:

Seen in: int(29)+sys(1) podman(30) fedora-37(1)+fedora-38(20)+fedora-39(3)+fedora-39β(2)+rawhide(4) root(19)+rootless(11) host(30) boltdb(16)+sqlite(14)

That is: never (yet) seen in remote nor in containerized; no difference between boltdb/sqlite; and, mostly VFS but not all.

@Luap99
Copy link
Member

Luap99 commented Dec 6, 2023

Keep in mind the logrus errors/warnings are on the server side (unless they are logged on the client which most of them aren't) so it makes sense that you do not see these in remote CI logs.
Maybe we can teach CI to capture server side logs and error for unexpected stderr there as well. At least in e2e that should be possible as we have one server per test. System test is unlikely to work as the server is spawned outside of the testing setup.

@edsantiago
Copy link
Member Author

Eek. Yes, e2e tests run one server per test, but most tests run a number of podman commands, some with ExitCleanly(), some with Exit(N). I could envision setting up a per-test tracker of those calls, checking it in teardown, and doing a warning check iff there are zero Exit() checks... but I'm not sure it's worth the effort or complexity? Worth discussing, though. Thanks for the idea.

@Luap99
Copy link
Member

Luap99 commented Dec 6, 2023

I have idea what could be wrong, as pid 1 the program must register signal handlers for SIGTERM otherwise it will ignore it by default. This is what top is doing but because signal handlers are of course part of program and can only be installed after to has been started it could mean that podman stop was run before top was given enough time to install said handlers.
It is a very small race but given we have a bunch of weird CI flakes I could imagine that this might be the cause.

@edsantiago
Copy link
Member Author

Good hypothesis. Oh how I hate podman run -d. Oh how I wish someone would take #18952 seriously.

@Luap99
Copy link
Member

Luap99 commented Dec 6, 2023

A naive reproducer: podman run -d --name test alpine sh -c 'sleep 0.05; top' && podman stop test && podman rm test
Of course not the same thing but the sleep 0.05 alone is enough to make it flaky on my laptop.

I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.

@Luap99
Copy link
Member

Luap99 commented Dec 6, 2023

I guess we need the same podman logs fix to make sure top printed already output before we run podman stop.

Or in cases were we really do not care about the stop behaviour we could just create the container with --stop-signal KILL to always kill right away.

edsantiago added a commit to edsantiago/libpod that referenced this issue Dec 14, 2023
A number of tests start a container then immediately run podman stop.
This frequently flakes with:

   StopSignal SIGTERM failed to stop [...] in 10 seconds, resorting to SIGKILL

Likely reason: container is still initializing, and its process
has not yet set up its signal handlers.

Solution: if possible (containers running "top"), wait for "Mem:"
to indicate that top is running. If not possible (pods / catatonit),
sleep half a second.

Intended to fix some of the flakes cataloged in containers#20196 but I'm
leaving that open in case we see more. These are hard to identify
just by looking in the code.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

#21011 has had no effect. This is still one of the most prevalent flakes I'm seeing, and not just in my no-retry PR:

x x x x x x
int(13) podman(17) fedora-38(17) root(10) host(17) boltdb(17)
sys(4) rootless(7)

Seems interesting that it's only f38.

I will start writing PRs to run stop -t0.

@Luap99
Copy link
Member

Luap99 commented Feb 6, 2024

I wouldn't say #21011 has had no effect, I have not looked at all logs but the ones I have all are suspect the the mentioned race so the fix was juts not complete. If anything is to believe I would argue the fixed worked as none of the old test names are mentioned in you new comment.

@mheon
Copy link
Member

mheon commented Feb 14, 2024

Looking to expand/generalize the solution here - I think most of these are fairly obvious fixes, but podman run --replace is not. We're not directly invoking podman stop there, it's getting invoked from within podman run as part of removing the old container. I don't think the logic for doing that is incorrect, we want to obey the usual stop timeout, but that means we have a race we can't close. Maybe we should ignore that warning message just for run --replace?

@edsantiago
Copy link
Member Author

You mean, ignore in tests? Yes, that's actually my plan. I have a PR in the works to clean up more of these flakes, but (sigh) other more urgent issues keep coming up. Maybe I'll just submit what I have now as a stepping stone.

@edsantiago
Copy link
Member Author

#21661

edsantiago added a commit to edsantiago/libpod that referenced this issue Feb 15, 2024
Continuing to see CI failures of the form "StopSignal SIGTERM
failed to stop container in 10 seconds". Work around those,
either by adding "-t0" to podman stop, or by using Expect(Exit(0))
instead of ExitCleanly().

Addresses, but does not close, containers#20196

Signed-off-by: Ed Santiago <[email protected]>
@bonjour-py
Copy link

i have same thing here. podman version is 4.3.1
it seems the program can not receive any StopSignal .
image

@rhatdan
Copy link
Member

rhatdan commented Mar 8, 2024

This just means your script is ignoring the SIGTERM signal.

@chris42
Copy link

chris42 commented Apr 17, 2024

I am not sure, if this helps and is correct here: I just had a similar issue, with a container "ignoring" SIGTERM. However when checking, the developer figured, that without giving an explicit --stop-signal=SIGTERM, podman assigned "37" as stop signal to the container. Which did nothing for the included scripts.
You find the issue here (in German): jens-maus/RaspberryMatic#2717

Hence I would advice to check which stop signal actually is set in the container, before trying to amend the running processes. I had this issue on Podman 4.9.3 in a rootful container. My rootless ones are all getting "15" per default correctly, no clue where the 37 comes from.

@Luap99 Luap99 closed this as completed Jun 15, 2024
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 14, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

6 participants