-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StopSignal SIGTERM failed to stop container in 10 seconds #20196
Comments
Could it be that the system is so slow and overburdened that it is taking more then 10 seconds? |
It could be, but by my count there are >20 other |
A friendly reminder that this issue had no activity for 30 days. |
Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite |
Still happening
Seen in: int podman fedora-37+fedora-38+fedora-39+fedora-39β+rawhide root+rootless host boltdb+sqlite |
Also seen is sys tests: https://api.cirrus-ci.com/v1/artifact/task/6106357493923840/html/sys-podman-fedora-38-root-host-boltdb.log.html |
Thanks. I have one other system test failure, on November 22. Here is the catalog so far:
That is: never (yet) seen in remote nor in containerized; no difference between boltdb/sqlite; and, mostly VFS but not all. |
Keep in mind the logrus errors/warnings are on the server side (unless they are logged on the client which most of them aren't) so it makes sense that you do not see these in remote CI logs. |
Eek. Yes, e2e tests run one server per test, but most tests run a number of podman commands, some with |
I have idea what could be wrong, as pid 1 the program must register signal handlers for SIGTERM otherwise it will ignore it by default. This is what top is doing but because signal handlers are of course part of program and can only be installed after to has been started it could mean that podman stop was run before top was given enough time to install said handlers. |
Good hypothesis. Oh how I hate |
A naive reproducer: I guess we need the same podman logs fix to make sure top printed already output before we run podman stop. |
Or in cases were we really do not care about the stop behaviour we could just create the container with |
A number of tests start a container then immediately run podman stop. This frequently flakes with: StopSignal SIGTERM failed to stop [...] in 10 seconds, resorting to SIGKILL Likely reason: container is still initializing, and its process has not yet set up its signal handlers. Solution: if possible (containers running "top"), wait for "Mem:" to indicate that top is running. If not possible (pods / catatonit), sleep half a second. Intended to fix some of the flakes cataloged in containers#20196 but I'm leaving that open in case we see more. These are hard to identify just by looking in the code. Signed-off-by: Ed Santiago <[email protected]>
#21011 has had no effect. This is still one of the most prevalent flakes I'm seeing, and not just in my no-retry PR:
Seems interesting that it's only f38. I will start writing PRs to run |
I wouldn't say #21011 has had no effect, I have not looked at all logs but the ones I have all are suspect the the mentioned race so the fix was juts not complete. If anything is to believe I would argue the fixed worked as none of the old test names are mentioned in you new comment. |
Looking to expand/generalize the solution here - I think most of these are fairly obvious fixes, but |
You mean, ignore in tests? Yes, that's actually my plan. I have a PR in the works to clean up more of these flakes, but (sigh) other more urgent issues keep coming up. Maybe I'll just submit what I have now as a stepping stone. |
Continuing to see CI failures of the form "StopSignal SIGTERM failed to stop container in 10 seconds". Work around those, either by adding "-t0" to podman stop, or by using Expect(Exit(0)) instead of ExitCleanly(). Addresses, but does not close, containers#20196 Signed-off-by: Ed Santiago <[email protected]>
This just means your script is ignoring the SIGTERM signal. |
I am not sure, if this helps and is correct here: I just had a similar issue, with a container "ignoring" SIGTERM. However when checking, the developer figured, that without giving an explicit --stop-signal=SIGTERM, podman assigned "37" as stop signal to the container. Which did nothing for the included scripts. Hence I would advice to check which stop signal actually is set in the container, before trying to amend the running processes. I had this issue on Podman 4.9.3 in a rootful container. My rootless ones are all getting "15" per default correctly, no clue where the 37 comes from. |
Weird new flake, where "new" means "since ExitCleanly() started checking stderr":
Happens almost exclusively in this one test. (The other one, the "pod something", is probably because entrypoint defaults to
sh
. Probably.).This shouldn't happen, because
top
is quick to exit upon signal. And okay, maybe not always, maybe 2-3 seconds, but ten??I've tried reproducing, with no luck:
$ while :;do cid=$(bin/podman --events-backend=file run --http-proxy=false -d quay.io/libpod/alpine:latest top);bin/podman stop --ignore foobar $cid;bin/podman rm $cid;done
Anything obvious I've missed?
Seen in: fedora-37/fedora-38/fedora-39/fedora-39?/rawhide root/rootless boltdb/sqlite
The text was updated successfully, but these errors were encountered: