rawhide: podman hangs, unstoppable #21504

edsantiago · 2024-02-04T14:03:19Z

Weird new flake, manifesting in many different tests but the common factors are:

'Expected process to exit. It did not.'
'could not remove container, ... could not be stopped'

I'm pretty sure it has to do with the new CI VMs. The table below shows the diffs between old-VMs-without-the-hang and new-VMs. I include all OSes, not just rawhide, in order to help isolate what-changed-in-rawhide-only vs what-changed-everywhere. The most suspicious changes IMO are conmon and containers-common, but it could also be systemd or kernel. (Sorry, I don't have an easy way to show kernel versions).

I think I've seen more of these flakes but they're really hard to categorize so I'm filing as-is right now and will follow up over time.

20240201t143038z-f39f38d13
20240102t212217z-f39f38d13 ⇑

	debian	prior-fedora	fedora	fedora-aws	rawhide
base	13	38-1.6	39-1.5	?	40-0

aardvark-dns	1.4.0-5	1.9.0-1	1.10.0-1	1.9.0-1	1.10.0-1
			1.9.0-1 ⇑		1.9.0-1 ⇑
buildah	1.33.3+ds1-2	1.33.2-1	1.34.0-1	1.34.0-1	1.34.0-3
	1.32.2+ds1-1 ⇑		1.33.2-1 ⇑	1.33.2-1 ⇑	1.33.2-1 ⇑
conmon	2.1.6+ds1-1	2:2.1.8-2	2:2.1.8-2	2:2.1.8-2	2:2.1.8-4
					2:2.1.8-2 ⇑
container-selinux	?	2:2.227.0-1	2:2.229.0-1	2:2.229.0-1	2:2.229.0-2
		2:2.226.0-1 ⇑	2:2.226.0-1 ⇑	2:2.226.0-1 ⇑	2:2.226.0-1 ⇑
containers-common	?	4:1-89	4:1-95	4:1-95	5:0.57.1-7
					4:1-101 ⇑
criu	3.17.1-3	3.18-1	3.19-2	3.19-2	3.19-4
					3.19-2 ⇑
crun	1.14-1	1.12-1	1.14-1	1.14-1	1.14-1
	1.12-1 ⇑		1.12-1 ⇑	1.12-1 ⇑	1.12-1 ⇑
golang	2:1.21~2	1.20.13-1	1.21.6-1	1.21.6-1	1.22~rc2-1
		1.20.12-1 ⇑	1.21.5-1 ⇑	1.21.5-1 ⇑	1.21.5-1 ⇑
netavark	1.4.0-4	1.9.0-1	1.10.1-5	1.10.1-5	1.10.1-5
			1.9.0-1 ⇑	1.9.0-1 ⇑	1.9.0-1 ⇑
passt	2023-12-30	2023-12-30	?	2023-12-30	2023-12-30
	2023-12-04 ⇑	2023-12-04 ⇑		2023-12-04 ⇑	2023-12-04 ⇑
podman	4.9.0+ds1-2	5:4.8.3-1	5:4.9.0-1	5:4.9.0-1	5:4.9.0-1
	4.7.2+ds1-2 ⇑	5:4.7.2-1 ⇑	5:4.8.2-1 ⇑	5:4.8.1-1 ⇑	5:4.8.1-1 ⇑
runc	1.1.10+ds1-1	2:1.1.8-1	2:1.1.8-1	2:1.1.8-1	2:1.1.9-2
					2:1.1.9-1 ⇑
skopeo	1.13.3+ds1-2	1:1.14.1-1	1:1.14.1-1	1:1.14.0-1	1:1.14.1-2
		1:1.14.0-1 ⇑	1:1.14.0-1 ⇑		1:1.14.0-1 ⇑
slirp4netns	1.2.1-1	1.2.2-1	1.2.2-1	1.2.2-1	1.2.2-2
					1.2.2-1 ⇑
systemd-container	255.3-2	253.15-2	254.8-2	254.8-2	255.3-1
	255.2-3 ⇑	253.14-1 ⇑	254.7-1 ⇑	254.7-1 ⇑	255.1-1 ⇑
tar	?	?	?	?	2:1.35-3
					2:1.35-2 ⇑

rawhide : int podman rawhide root host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 02-03 09:58 in Podman checkpoint podman checkpoint container with export (migration) and --ipc host
  - 02-01 15:02 in Podman run podman test selinux --privileged label hosts
  - 01-31 11:50 in Podman images podman images --all flag
  - 01-28 16:06 in Podman create podman create --mount flag with multiple mounts
rawhide : int podman rawhide rootless host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 01-31 11:47 in Podman rmi podman rmi with cached images
  - 01-27 14:47 in TOP-LEVEL [AfterEach] Podman init podman init single container by ID
rawhide : int remote rawhide root host sqlite [remote]
- PR Ed's pet PR with no flake retries #17831
  - 02-03 09:48 in Podman exec podman exec --detach
  - 01-27 17:15 in Podman checkpoint podman checkpoint container with export (migration)
rawhide : sys podman rawhide rootless host sqlite
- PR Update to runc main, removing pin to an older version #21472
  - 02-02 15:04 in [sys] [200] podman pod - communicating between pods

x	x	x	x	x	x
int(8)	podman(7)	rawhide(9)	root(6)	host(9)	sqlite(9)
sys(1)	remote(2)		rootless(3)

The text was updated successfully, but these errors were encountered:

Luap99 · 2024-02-05T14:44:03Z

I also just saw it on one of my PRs https://cirrus-ci.com/task/5624583085096960

This looks really strange, reading the code I would most likely assume the kernel here, we SIGKILL the process before we wait up to 5s for it to be dead (checked with kill -0). So we can try to bump the 5s timeout although the fact that all the following rm tries fail in the same way leaves me to believe something is wrong with sending SIGKILL then waiting for the pid to disappear, I mean looking at this there is a obvious pid reuse bug here, i.e. we kill old pid and when we check if pid is alive it could be already reassigned although that sounds super unlikely to me.

It would be a good start if we could capture ps auxww and print the container pid podman inspect --format {{.State.Pid}} <name> when this happens to see if the container process is still there.

edsantiago · 2024-02-06T17:52:31Z

FINALLY! I instrumented #17831 and have been hammering at it. Finally got a failure in the right place. It's just the ps, not podman inspect. Hope it helps.

Luap99 · 2024-02-06T18:51:55Z

So no top process in sight which is good as it means the container process is in fact gone, however we can clearly see conmon is still around so the question is why is conmon not exiting even though the container process is gone.

I really need to see the inspect of the container, the only thing I can imagine here is that the pid is set to 0. I find it unlikely that kill would work otherwise. Something like kill(0,0) would always work and thus we might think waiting for the pid timed out.
Looking at code it is most likely that our container is stucked in the wrong state.

You could also try adding podman events --since 0 --stream=false as debug command but I think the inspect command is more important.

Either way I think the root cause is something wrong with conmon, and to debug that I fear we need to attach gdb to see were it hangs and for that we would need a reproducer. Although given per your table conmon hasn't really changed and given so far all cases linked here were seen on rawhide I think it may be the kernel change which is the underlying cause. The minimum you can try is to build new images to get a newer kernel and maybe this is already fixed.

edsantiago · 2024-02-06T20:01:41Z

I don't think I can run podman inspect. My instrumentation (ps auxww) is in matchers.go, which does not have access to any test context.

To build new VMs, I'm waiting for netavark to go stable on f38/f39. Maybe tomorrow.

And, I haven't tried to reproduce it in a VM because it's so unreliable. But I'll try to make time tomorrow.

edsantiago · 2024-02-07T22:09:03Z

Another one. ps aux only, sorry. I have wasted a LOT of time trying to find some way to run podman commands from the context that sees the failure, including a lot of rereading of gomega docs. No clue. If anyone has ideas, please offer!

Luap99 · 2024-02-08T10:21:17Z

Ah, remote logs are much cluttered due the stupid extra conmon exit delay, anyhow this is relevant container part:

           root      101580  0.0  0.0   9868  2212 ?        Ss   11:10   0:00 /usr/bin/conmon --api-version 1 -c 5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b -u 5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b -r /usr/bin/crun -b /tmp/podman_test3705908020/root/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata -p /tmp/podman_test3705908020/runroot/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata/pidfile -n modest_hopper --exit-dir /tmp/podman_test3705908020/exits --full-attach -s -l journald --log-level warning --syslog --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/tmp/podman_test3705908020/runroot/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata/oci-log --conmon-pidfile /tmp/podman_test3705908020/runroot/overlay-containers/5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b/userdata/conmon.pid --exit-command /var/tmp/go/src/github.com/containers/podman/bin/podman --exit-command-arg --root --exit-command-arg /tmp/podman_test3705908020/root --exit-command-arg --runroot --exit-command-arg /tmp/podman_test3705908020/runroot --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /tmp/podman_test3705908020 --exit-command-arg --network-config-dir --exit-command-arg /etc/containers/networks --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /tmp/podman_test3705908020/root/volumes --exit-command-arg --db-backend --exit-command-arg sqlite --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.imagestore=/tmp/ginkgo692055407/imagecachedir --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 5a0c6d10c1a71ecb47d3f0e0417e0a7678f6cc860ca995fcfa76525dca0e5e9b
           root      101582 92.2  0.0      0     0 ?        Rs   11:10   1:32  \_ [touch]

The container process (touch) is alive and ps reports 92% cpu usage, WTF? At this point the command is running for at least 90 seconds per the log. And ps says touch is actually running and not in a sleeping state.

edsantiago · 2024-02-08T14:25:17Z

Quick observation: a large fraction of the failures are in podman build tests, and (I think) in RUN steps:

[+0353s] not ok 83 [030] podman run no /etc/mtab 
...
<+187ms> # # podman build -t nomtab /tmp/podman_bats.ptI9oE/build-test
<+0120s> # STEP 1/2: FROM quay.io/libpod/testimage:20221018
         # STEP 2/2: RUN rm /etc/mtab
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
<+025ms> # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***

Since this is MUCH easier to instrument than e2e tests, I've been running a loop test on it. No failures in many hours but I'll keep it going. It's not a kernel diff, kernel is 6.8.0-0.rc0.20240112git70d201a40823.5.fc40.x86_6 in CI and in my 1mt VM.

Flake catalog so far:

rawhide : int podman rawhide root host sqlite
- 02-07 22:59 in TOP-LEVEL [AfterEach] Podman pod create podman create pod with --no-hosts
- 02-07 18:22 in TOP-LEVEL [AfterEach] Podman kube play test with reserved CIDFile annotation in yaml
- 02-07 18:22 in Podman prune podman image prune dangling images
- 02-05 23:00 in Podman run with volumes podman named volume copyup symlink
- 02-03 09:58 in Podman checkpoint podman checkpoint container with export (migration) and --ipc host
- 02-01 15:02 in Podman run podman test selinux --privileged label hosts
- 01-31 11:50 in Podman images podman images --all flag
- 01-28 16:06 in Podman create podman create --mount flag with multiple mounts
rawhide : int podman rawhide rootless host sqlite
- 02-08 09:07 in Podman build podman build verify explicit cache use with squash-all and --layers
- 02-08 07:58 in Podman kill podman kill a running container by id
- 02-07 22:50 in TOP-LEVEL [AfterEach] Podman search podman search attempts HTTP if tls-verify flag is set false
- 02-06 12:49 in Podman pod restart podman pod restart all pods
- 02-06 12:49 in Podman pod restart podman pod restart all pods
- 02-06 08:28 in Podman kube play should be able to run image where workdir is a symlink
- 02-05 11:11 in TOP-LEVEL [AfterEach] Podman stats podman stats with zero interval
- 02-04 12:39 in Podman run entrypoint podman run entrypoint with cmd
- 01-31 11:47 in Podman rmi podman rmi with cached images
- 01-27 14:47 in TOP-LEVEL [AfterEach] Podman init podman init single container by ID
rawhide : int remote rawhide root host sqlite [remote]
- 02-08 09:00 in TOP-LEVEL [AfterEach] Podman top podman pod top with options
- 02-08 07:49 in Podman kube play should be able to run image where workdir is a symlink
- 02-07 12:39 in Podman run podman run --volumes flag with multiple volumes
- 02-06 10:11 in TOP-LEVEL [AfterEach] Podman pod create podman pod container share Namespaces
- 02-06 08:28 in Podman run with volumes volume permissions after run
- 02-03 09:48 in Podman exec podman exec --detach
- 01-27 17:15 in Podman checkpoint podman checkpoint container with export (migration)
rawhide : sys podman rawhide root host sqlite
- PR fix(deps): update module github.com/docker/docker to v25.0.2+incompatible #21494
  - 02-02 18:21 in [sys] [030] podman run no /etc/mtab
- PR Warn if cgroups-v1 #21431
  - 02-06 07:05 in [sys] [070] podman build build context ownership
  - 02-07 20:26 in [sys] [700] podman kube play - hostport
  - 02-07 18:17 in [sys] [065] podman cp dir from container to container
  - 02-04 12:49 in [sys] [070] podman build check_seccomp_ulimits
rawhide : sys podman rawhide rootless host sqlite
- PR Upgrade tests: reenable, but revamped #21535
  - 02-06 16:16 in [sys] [250] podman create --health-on-failure=kill
- PR Update to runc main, removing pin to an older version #21472
  - 02-02 15:04 in [sys] [200] podman pod - communicating between pods
rawhide : sys remote rawhide root host sqlite [remote]
- PR Remove leftover autoremove containers during refresh #21541
  - 02-07 09:56 in [sys] [500] podman run /etc/* permissions
- PR Use tmpfs mounts when creating a memory-backed emptyDir volume #21524
  - 02-05 17:56 in [sys] [045] podman start --all - start all containers
- PR drop support for "pasta" as a network name #21512
  - 02-05 09:27 in [sys] [030] Verify /run/.containerenv exist
- PR Podman Machine AppleHV CI fixes #21508
  - 02-05 15:47 in [sys] [030] Verify /run/.containerenv exist
  - 02-05 10:57 in [sys] [070] podman build check_seccomp_ulimits
  - 02-04 10:49 in [sys] [030] podman top does not use nsenter from image

x	x	x	x	x	x
int(25)	podman(25)	rawhide(38)	root(26)	host(38)	sqlite(38)
sys(13)	remote(13)		rootless(12)

Luap99 · 2024-02-08T14:34:56Z

Yeah hard to tell if they all happen due the same cause. Could also be two different problems. On the plus side build doesn't use conmon at all which is further proof that it must be the kernel miss behaving.

I also have a 1mt VM running over the system test to try to reproduce but so far no luck either.

edsantiago · 2024-02-09T13:52:49Z

New symptom: complete test hang, no useful output whatsoever.

Interesting thing is, this is system tests, which I instrumented as follows:

    # FIXME FIXME FIXME 2024-02-07 for hang
    run_podman '?' rm -t 0 --all --force --ignore
    if [[ $status -ne 0 ]]; then
        foo=$(expr "$output" : "container \(.*\) as it could not be stopped")
        if [[ -n "$foo" ]]; then
            echo "# BARF BARF BARF"
            run_podman '?' container inspect $foo
            echo "$_LOG_PROMPT ps auxww --forest"
            ps auxww --forest
        fi

Hypothesis: the flake triggered, and podman inspect CID is hanging. Admittedly not a great hypothesis, I don't see how this could happen, but it's trivial to test: run ps auxww first, before inspect. I've just done so and repushed.

Luap99 · 2024-02-09T14:21:00Z

But the podman commands shouldn't hang forever though because we run them through the run_podman function which uses the 90s timeout. So maybe ps, bash, bats or the timout command itsef hangs? Looking at this weird symptoms I really start to wonder if any process could just start hanging forever with this kernel.

edsantiago · 2024-02-14T15:12:45Z

Latest list:

rawhide : int podman rawhide root host sqlite
- 02-08 13:22 in Podman checkpoint podman checkpoint and restore dev/shm content with --export and --import
- 02-07 22:59 in TOP-LEVEL [AfterEach] Podman pod create podman create pod with --no-hosts
- 02-07 18:22 in TOP-LEVEL [AfterEach] Podman kube play test with reserved CIDFile annotation in yaml
- 02-07 18:22 in Podman prune podman image prune dangling images
- 02-05 23:00 in Podman run with volumes podman named volume copyup symlink
- 02-03 09:58 in Podman checkpoint podman checkpoint container with export (migration) and --ipc host
- 02-01 15:02 in Podman run podman test selinux --privileged label hosts
- 01-31 11:50 in Podman images podman images --all flag
- 01-28 16:06 in Podman create podman create --mount flag with multiple mounts
rawhide : int podman rawhide rootless host sqlite
- 02-12 19:17 in Podman build podman build relay exit code to process
- 02-12 19:17 in Podman create podman create with --mount flag
- 02-12 17:06 in Podman run podman run seccomp test --privileged no profile should be unconfined
- 02-12 12:48 in Podman kube play test with reserved volumes-from annotation in yaml
- 02-12 12:48 in Toolbox-specific testing podman run --privileged --userns=keep-id --user root:root - entrypoint - (bind)mounting
- 02-09 14:35 in TOP-LEVEL [AfterEach] Podman top podman pod top on pod with containers in same pid namespace
- 02-09 12:48 in TOP-LEVEL [AfterEach] Podman kube play test with reserved privileged annotation in yaml
- 02-09 11:05 in TOP-LEVEL [AfterEach] Podman container clone podman container clone basic test
- 02-08 19:17 in Podman run exit podman run exit ExecErrorCodeCannotInvoke
- 02-08 11:42 in TOP-LEVEL [AfterEach] Podman pod create podman pod create --userns=auto:size=%d
- 02-08 09:07 in Podman build podman build verify explicit cache use with squash-all and --layers
- 02-08 07:58 in Podman kill podman kill a running container by id
- 02-07 22:50 in TOP-LEVEL [AfterEach] Podman search podman search attempts HTTP if tls-verify flag is set false
- 02-06 12:49 in Podman pod restart podman pod restart all pods
- 02-06 12:49 in Podman pod restart podman pod restart all pods
- 02-06 08:28 in Podman kube play should be able to run image where workdir is a symlink
- 02-05 11:11 in TOP-LEVEL [AfterEach] Podman stats podman stats with zero interval
- 02-04 12:39 in Podman run entrypoint podman run entrypoint with cmd
- 01-31 11:47 in Podman rmi podman rmi with cached images
- 01-27 14:47 in TOP-LEVEL [AfterEach] Podman init podman init single container by ID
rawhide : int remote rawhide root host sqlite [remote]
- 02-12 19:12 in Podman checkpoint podman checkpoint and restore container with different port mappings
- 02-09 08:24 in Podman ps podman ps truncate long create command
- 02-08 17:59 in Podman rmi podman image rm - concurrent with shared layers
- 02-08 09:00 in TOP-LEVEL [AfterEach] Podman top podman pod top with options
- 02-08 07:49 in Podman kube play should be able to run image where workdir is a symlink
- 02-07 12:39 in Podman run podman run --volumes flag with multiple volumes
- 02-06 10:11 in TOP-LEVEL [AfterEach] Podman pod create podman pod container share Namespaces
- 02-06 08:28 in Podman run with volumes volume permissions after run
- 02-03 09:48 in Podman exec podman exec --detach
- 01-27 17:15 in Podman checkpoint podman checkpoint container with export (migration)
rawhide : sys podman rawhide root host sqlite
- PR fix(deps): update module github.com/docker/docker to v25.0.2+incompatible #21494
  - 02-02 18:21 in [sys] [030] podman run no /etc/mtab
- PR Warn if cgroups-v1 #21431
  - 02-06 07:05 in [sys] [070] podman build build context ownership
  - 02-12 17:04 in [sys] [070] podman build check_seccomp_ulimits
  - 02-09 12:56 in [sys] [070] podman build check_label
  - 02-07 20:26 in [sys] [700] podman kube play - hostport
  - 02-07 18:17 in [sys] [065] podman cp dir from container to container
  - 02-04 12:49 in [sys] [070] podman build check_seccomp_ulimits
rawhide : sys podman rawhide rootless host sqlite
- PR Upgrade tests: reenable, but revamped #21535
  - 02-06 16:16 in [sys] [250] podman create --health-on-failure=kill
- PR Send container stats over API on a per-interface basis #21534
  - 02-06 18:36 in [sys] [030] podman run : user namespace preserved root ownership
- PR Update to runc main, removing pin to an older version #21472
  - 02-02 15:04 in [sys] [200] podman pod - communicating between pods
- PR Include machine-tests in cirrus-cron jobs #21413
  - 02-02 16:05 in [sys] [220] podman healthcheck --health-on-failure
  - 02-09 08:32 in sys podman rawhide rootless host sqlite: Timeout
  - 02-08 13:23 in [sys] [200] podman pod create - hashtag AllTheOptions
rawhide : sys remote rawhide root host sqlite [remote]
- PR Replace panic with no-op #21644
  - 02-13 17:05 in [sys] [070] podman build check_seccomp_ulimits
- PR Remove leftover autoremove containers during refresh #21541
  - 02-07 09:56 in [sys] [500] podman run /etc/* permissions
- PR Use tmpfs mounts when creating a memory-backed emptyDir volume #21524
  - 02-05 17:56 in [sys] [045] podman start --all - start all containers
- PR vendor: update c/common + libhvee to latest main and use pkg/strongunits from c/common #21518
  - 02-06 11:39 in [sys] [220] podman healthcheck --health-on-failure
- PR drop support for "pasta" as a network name #21512
  - 02-05 09:27 in [sys] [030] Verify /run/.containerenv exist
- PR Podman Machine AppleHV CI fixes #21508
  - 02-05 15:47 in [sys] [030] Verify /run/.containerenv exist
- PR Windows-cross linting #21441
  - 02-06 17:30 in [sys] [220] podman healthcheck --health-on-failure
  - 02-12 23:17 in [sys] [070] podman build - workdir, cmd, env, label
  - 02-12 17:05 in [sys] podman start --filter invalid-restart-policy - return error
  - 02-08 13:09 in [sys] [070] podman build build context ownership
  - 02-05 10:57 in [sys] [070] podman build check_seccomp_ulimits
  - 02-04 10:49 in [sys] [030] podman top does not use nsenter from image

x	x	x	x	x	x
int(39)	podman(42)	rawhide(64)	root(38)	host(64)	sqlite(64)
sys(25)	remote(22)		rootless(26)

edsantiago · 2024-02-14T15:15:25Z

Here is the current package-version table, possibly useful to see what is different between rawhide and all the others:

	debian	prior-fedora	fedora	fedora-aws	rawhide
base	13	38-1.6	39-1.5	?	40-0
kernel	6.6.13-1	6.7.4-100	6.7.4-200	6.7.4-200	6.8.0-0.rc3.20240209git1f719a2f3fa6.31
aardvark-dns	1.4.0-5	1.10.0-1	1.10.0-1	1.10.0-1	1.10.0-1
netavark	1.4.0-4	1.10.3-1	1.10.3-1	1.10.3-1	1.10.3-1
buildah	1.33.5+ds1-4	1.34.0-1	1.34.0-1	1.34.0-1	1.34.0-3
conmon	2.1.6+ds1-1	2.1.8-2	2.1.8-2	2.1.8-2	2.1.8-4
container-selinux	?	2.228.1-1	2.229.0-1	2.229.0-1	2.229.0-2
containers-common	?	1-89	1-99	1-99	0.57.3-3
criu	3.17.1-3	3.18-1	3.19-2	3.19-2	3.19-4
crun	1.14-1	1.14-1	1.14.1-1	1.14-1	1.14.1-1
golang	2:1.21~2	1.20.13-1	1.21.7-1	1.21.6-1	1.22.0-1
passt	2023-12-30	2023-12-30	2023-12-30	2023-12-30	2023-12-30
podman	4.9.2+ds1-2	4.8.3-1	4.9.2-1	4.9.0-1	5.0.0~rc1-2
runc	1.1.12+ds1-1	1.1.12-1	1.1.12-1	1.1.12-1	1.1.12-1
skopeo	1.13.3+ds1-2	1.14.1-1	1.14.2-1	1.14.0-1	1.14.2-1
slirp4netns	1.2.1-1	1.2.2-1	1.2.2-1	1.2.2-1	1.2.2-2
systemd-container	255.3-2	253.15-2	254.9-1	254.9-1	255.3-1
tar	1.34+dfsg-1.2	1.34-8	1.35-2	1.35-2	1.35-3

mheon · 2024-02-14T15:20:11Z

I concur with @Luap99 this smells like a kernel issue. I'll look into it.

edsantiago · 2024-02-16T02:09:53Z

Rawhide kernel 6.8.0-0.rc4.20240214git7e90b5c295ec.37, built in containers/automation_images#328 (comment) , does not fix the problem.

log 1

log 2

edsantiago · 2024-02-20T16:51:00Z

Seen in OpenQA

Luap99 · 2024-02-20T17:17:59Z

Ok that is a good pointer, I downloaded the logs and run journalctl on it and found
kernel: WARNING: stack going in the wrong direction? at do_syscall_64+0x89/0x170

That certainly doesn't sounds good at all, going back through our journal logs I do indeed see in some of them the same warning.
That doesn't mean it is related to this but it also cannot mean good things.
Any chance you can script this to grep through all our journald logs and see if we only see this on rawhide when there was a test failure?

edsantiago · 2024-02-20T17:21:49Z

Oh, nice catch. I don't fetch journal logs in my flake db but it's easy to fetch them. I will start working on it and report back.

edsantiago · 2024-02-20T18:04:11Z

Well, mixed news. Out of 78 flakes in my DB, the string "stack going" only appears in journal logs for 7 of them:

Next: that string appears in all instances of #21749, the new oom flake. This does not surprise me: I had a hunch that these were connected.

Next: that string DOES NOT appear in any rawhide journals that I looked at for flakes other than the hang. (This is admittedly a poor and non-thorough data point, because I just hand-picked logs).

Next: no other relevant matches (in these logs) for 'kernel.*warn' (I am ignoring Keylock active)

HTH.

edsantiago · 2024-02-20T23:09:12Z

Another one in OpenQA

edsantiago · 2024-02-21T01:26:59Z

Sigh. I am sorry to report that the flake lives on in today's new VMs, kernel 6.8.0-0.rc5.41.

edsantiago · 2024-02-21T15:38:57Z

@Luap99 I realized that I do have access to the CID and test environment, and can run podman inspect. It doesn't seem very useful any more give your reproducer, but is there something else I can do to instrument and debug? Like get {{.State.Pid}} and examine anything in /proc?

Luap99 · 2024-02-21T15:47:28Z

Nothing needed for now.

edsantiago · 2024-02-27T14:40:47Z

Build c20240227t125812z-f39f38d13 has rc6 kernel. Testing now in #17831, which I will keep rerunning all day. Will report back if I see the hang. I'm tempted to open a podman PR to update VMs, will probably do so after I see how 17831 does on its first run.

edsantiago · 2024-02-27T21:15:44Z

I am a rational human being. I don't believe in superstitions or jinxing or any such nonsense. Even so, I'm unreasonably scared to mention that #17831 has run six times today, using the new kernel, and not seen this flake. Nor did we see it in #21841, the PR that brought new VMs into podman CI.

mheon · 2024-02-28T19:03:48Z

Are we comfortable closing this given that the new kernel has arrived? We can always re-open if the demons reappear in the shadows.

edsantiago · 2024-02-28T20:31:24Z

It seems pretty likely that rc6 fixes the hang.

edsantiago added the flakes Flakes from Continuous Integration label Feb 4, 2024

edsantiago mentioned this issue Feb 5, 2024

new testimage and systemd-image #21356

Merged

Luap99 mentioned this issue Feb 12, 2024

kube play: always pull when both imagePullPolicy and tag are missing #21493

Merged

edsantiago mentioned this issue Feb 12, 2024

Bump CI VMs to ones with netavark 1.10.3 #21618

Merged

martinpitt mentioned this issue Feb 19, 2024

machine: Remove unnecessary TODOs #21636

Merged

edsantiago mentioned this issue Feb 19, 2024

rawhide: parallel build flake is back #21742

Closed

edsantiago mentioned this issue Feb 26, 2024

Use retry logic from containers/common containers/buildah#5345

Merged

edsantiago closed this as completed Feb 28, 2024

This was referenced Feb 29, 2024

Make inspect compatible with docker v1.44 API #21601

Merged

test: Replace setup/teardown cockpit-project/cockpit-podman#1592

Closed

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 29, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rawhide: podman hangs, unstoppable #21504

rawhide: podman hangs, unstoppable #21504

edsantiago commented Feb 4, 2024

Luap99 commented Feb 5, 2024

edsantiago commented Feb 6, 2024

Luap99 commented Feb 6, 2024

edsantiago commented Feb 6, 2024

edsantiago commented Feb 7, 2024

Luap99 commented Feb 8, 2024

edsantiago commented Feb 8, 2024

Luap99 commented Feb 8, 2024

edsantiago commented Feb 9, 2024

Luap99 commented Feb 9, 2024

edsantiago commented Feb 14, 2024

edsantiago commented Feb 14, 2024

mheon commented Feb 14, 2024

edsantiago commented Feb 16, 2024

edsantiago commented Feb 20, 2024

Luap99 commented Feb 20, 2024

edsantiago commented Feb 20, 2024

edsantiago commented Feb 20, 2024

edsantiago commented Feb 20, 2024

edsantiago commented Feb 21, 2024

edsantiago commented Feb 21, 2024

Luap99 commented Feb 21, 2024

edsantiago commented Feb 27, 2024

edsantiago commented Feb 27, 2024

mheon commented Feb 28, 2024

edsantiago commented Feb 28, 2024

rawhide: podman hangs, unstoppable #21504

rawhide: podman hangs, unstoppable #21504

Comments

edsantiago commented Feb 4, 2024

Luap99 commented Feb 5, 2024

edsantiago commented Feb 6, 2024

Luap99 commented Feb 6, 2024

edsantiago commented Feb 6, 2024

edsantiago commented Feb 7, 2024

Luap99 commented Feb 8, 2024

edsantiago commented Feb 8, 2024

Luap99 commented Feb 8, 2024

edsantiago commented Feb 9, 2024

Luap99 commented Feb 9, 2024

edsantiago commented Feb 14, 2024

edsantiago commented Feb 14, 2024

mheon commented Feb 14, 2024

edsantiago commented Feb 16, 2024

edsantiago commented Feb 20, 2024

Luap99 commented Feb 20, 2024

edsantiago commented Feb 20, 2024

edsantiago commented Feb 20, 2024

edsantiago commented Feb 20, 2024

edsantiago commented Feb 21, 2024

edsantiago commented Feb 21, 2024

Luap99 commented Feb 21, 2024

edsantiago commented Feb 27, 2024

edsantiago commented Feb 27, 2024

mheon commented Feb 28, 2024

edsantiago commented Feb 28, 2024