Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart=always not properly working with podman stop #18259

Closed
Luap99 opened this issue Apr 18, 2023 · 6 comments · Fixed by #18267
Closed

restart=always not properly working with podman stop #18259

Luap99 opened this issue Apr 18, 2023 · 6 comments · Fixed by #18267
Assignees
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@Luap99
Copy link
Member

Luap99 commented Apr 18, 2023

Issue Description

Another bug found in my ginkgov2 work. podman kube play seems to configure the pod/container in a different way which causes podman stop --all to fail. It is a race condition I see quite a lot in the CI logs, however I can reproduce locally in about 1 out of 3 tries.

Steps to reproduce the issue

Steps to reproduce the issue

  1. create yaml file
$ cat > t.yaml <<EOF
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-04-18T16:58:12Z"
  labels:
    app: test
  name: test
spec:
  containers:
  - command:
    - ip
    - a
    image: docker.io/library/alpine:latest
    name: laughinghaibt
EOF
  1. run bin/podman kube play t.yaml && bin/podman stop --all && bin/podman rm -fa
    Note this is a flake so you may need to run into several times until it fails.

Describe the results you received

podman stop fails and exits with 125

Pod:
2cde8e2fc327ebc2444339244ead6b7bb31693c99e8d4f45cb8fc4711b239822
Container:
01552f3338e7a0f9d04b6519d213983dccceefdb4ab00a5708cda081173b0311

4bc7eecf0816790bec3675f2f68950f8a49b0430a855f7a5706209b927f249c1
Error: cannot get namespace path unless container 4bc7eecf0816790bec3675f2f68950f8a49b0430a855f7a5706209b927f249c1 is running: container is stopped

Describe the results you expected

Podman stop should work.

podman info output

latest main branch

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

CI log: https://api.cirrus-ci.com/v1/artifact/task/5261073197039616/html/int-podman-debian-12-root-host-boltdb.log.html#t--podman-generate-kube-privileged-container--1
search for cannot get namespace path unless container

Interestingly enough I was not able to reproduce with pod create and run, CI logs also seems to only show it with play kube.

@Luap99 Luap99 added kind/bug Categorizes issue or PR as related to a bug. flakes Flakes from Continuous Integration labels Apr 18, 2023
@edsantiago
Copy link
Member

The weird thing about this one, IIRC (from memory; I haven't looked at all log files), is that if it flakes once, it will also fail on all retries. Any explanation for that?

@Luap99
Copy link
Member Author

Luap99 commented Apr 18, 2023

The problem seems to be caused by play kube defaulting to restart policy always. Thus the short lived container is consistently restarted. If the container is started while the infra is stopped this error happens.

@edsantiago
Copy link
Member

Ohhhh.... I remember something like that. Could my fix in #18169 be used as inspiration for fixing this? By adding a magic restart-no string to the yaml?

@Luap99
Copy link
Member Author

Luap99 commented Apr 18, 2023

I could but then again the above command is simple reproducer with (for me) a valid yaml. The fact that restart always is the default seems odd to me but OK I do not care about k8s defaults.

Regardless of this the above error shows clearly that there is a bug somewhere, we do not lock correctly. An explicit podman stop must always ignore the restart policy and stop the container until it is started by a user again.

@edsantiago
Copy link
Member

I too find the restart-forever policy troublesome, but am not qualified to comment on it. However: what does podman stop -a even mean on a forever-spinning container? If it happens when the container is stopped, stop -a will not see the container. If it happens when the container is up, "stop" will presumably stop it, but the container will start right back up. (This is trivial to reproduce with podman run -d --restart=always ...).

Maybe a better solution would be for generate_kube_test.go:AfterEach() to do pod rm -f -a?

@Luap99
Copy link
Member Author

Luap99 commented Apr 19, 2023

Sure there ways to make it work for me in the tests, but the real problem is this a valid bug any user can run into.

If podman stop is run on a container with restart always it must do that and ignore the restart. If it does not do that we have a big problem.

@Luap99 Luap99 changed the title pod started with kube play throws weird error on podman stop --all restart=always not properly working with podman stop --all Apr 19, 2023
@Luap99 Luap99 changed the title restart=always not properly working with podman stop --all restart=always not properly working with podman stop Apr 19, 2023
@Luap99 Luap99 self-assigned this Apr 19, 2023
Luap99 added a commit to Luap99/libpod that referenced this issue Apr 20, 2023
Commit 1ab833f improved the situation but it is still not enough.
If you run short lived containers with --restart=always podman is
basically permanently restarting them. To only way to stop this is
podman stop. However podman stop does not do anything when the
container is already in a not running state. While this makes sense we
should still mark the container as explicitly stopped by the user.

Together with the change in shouldRestart() which now checks for
StoppedByUser this makes sure the cleanup process is not going to start
it back up again.

A simple reproducer is:
```
podman run --restart=always --name test -d alpine true
podman stop test
```
then check if the container is still running, the behavior is very
flaky, it took me like 20 podman stop tries before I finally hit the
correct window were it was stopped permanently.
With this patch it worked on the first try.

Fixes containers#18259

[NO NEW TESTS NEEDED] This is super flaky and hard to correctly test
in CI. MY ginkgo v2 work seems to trigger this in play kube tests so
that should catch at least some regressions. Also this may be something
that should be tested at podman test days by users (containers#17912).

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue May 23, 2023
Commit 1ab833f improved the situation but it is still not enough.
If you run short lived containers with --restart=always podman is
basically permanently restarting them. To only way to stop this is
podman stop. However podman stop does not do anything when the
container is already in a not running state. While this makes sense we
should still mark the container as explicitly stopped by the user.

Together with the change in shouldRestart() which now checks for
StoppedByUser this makes sure the cleanup process is not going to start
it back up again.

A simple reproducer is:
```
podman run --restart=always --name test -d alpine true
podman stop test
```
then check if the container is still running, the behavior is very
flaky, it took me like 20 podman stop tries before I finally hit the
correct window were it was stopped permanently.
With this patch it worked on the first try.

Fixes containers#18259

[NO NEW TESTS NEEDED] This is super flaky and hard to correctly test
in CI. MY ginkgo v2 work seems to trigger this in play kube tests so
that should catch at least some regressions. Also this may be something
that should be tested at podman test days by users (containers#17912).

Signed-off-by: Paul Holzinger <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants