-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rawhide: podman hangs, unstoppable #21504
Comments
I also just saw it on one of my PRs https://cirrus-ci.com/task/5624583085096960 This looks really strange, reading the code I would most likely assume the kernel here, we SIGKILL the process before we wait up to 5s for it to be dead (checked with kill -0). So we can try to bump the 5s timeout although the fact that all the following rm tries fail in the same way leaves me to believe something is wrong with sending SIGKILL then waiting for the pid to disappear, I mean looking at this there is a obvious pid reuse bug here, i.e. we kill old pid and when we check if pid is alive it could be already reassigned although that sounds super unlikely to me. It would be a good start if we could capture |
FINALLY! I instrumented #17831 and have been hammering at it. Finally got a failure in the right place. It's just the |
So no I really need to see the inspect of the container, the only thing I can imagine here is that the pid is set to 0. I find it unlikely that kill would work otherwise. Something like kill(0,0) would always work and thus we might think waiting for the pid timed out. You could also try adding Either way I think the root cause is something wrong with conmon, and to debug that I fear we need to attach gdb to see were it hangs and for that we would need a reproducer. Although given per your table conmon hasn't really changed and given so far all cases linked here were seen on rawhide I think it may be the kernel change which is the underlying cause. The minimum you can try is to build new images to get a newer kernel and maybe this is already fixed. |
I don't think I can run To build new VMs, I'm waiting for netavark to go stable on f38/f39. Maybe tomorrow. And, I haven't tried to reproduce it in a VM because it's so unreliable. But I'll try to make time tomorrow. |
Another one. |
Ah, remote logs are much cluttered due the stupid extra conmon exit delay, anyhow this is relevant container part:
The container process (touch) is alive and ps reports 92% cpu usage, WTF? At this point the command is running for at least 90 seconds per the log. And ps says touch is actually running and not in a sleeping state. |
Quick observation: a large fraction of the failures are in
Since this is MUCH easier to instrument than e2e tests, I've been running a loop test on it. No failures in many hours but I'll keep it going. It's not a kernel diff, kernel is Flake catalog so far:
|
Yeah hard to tell if they all happen due the same cause. Could also be two different problems. On the plus side build doesn't use conmon at all which is further proof that it must be the kernel miss behaving. I also have a 1mt VM running over the system test to try to reproduce but so far no luck either. |
New symptom: complete test hang, no useful output whatsoever. Interesting thing is, this is system tests, which I instrumented as follows: # FIXME FIXME FIXME 2024-02-07 for hang
run_podman '?' rm -t 0 --all --force --ignore
if [[ $status -ne 0 ]]; then
foo=$(expr "$output" : "container \(.*\) as it could not be stopped")
if [[ -n "$foo" ]]; then
echo "# BARF BARF BARF"
run_podman '?' container inspect $foo
echo "$_LOG_PROMPT ps auxww --forest"
ps auxww --forest
fi Hypothesis: the flake triggered, and |
But the podman commands shouldn't hang forever though because we run them through the run_podman function which uses the 90s timeout. So maybe ps, bash, bats or the timout command itsef hangs? Looking at this weird symptoms I really start to wonder if any process could just start hanging forever with this kernel. |
Latest list:
|
Here is the current package-version table, possibly useful to see what is different between rawhide and all the others:
|
I concur with @Luap99 this smells like a kernel issue. I'll look into it. |
Rawhide kernel |
Seen in OpenQA |
Ok that is a good pointer, I downloaded the logs and run journalctl on it and found That certainly doesn't sounds good at all, going back through our journal logs I do indeed see in some of them the same warning. |
Oh, nice catch. I don't fetch journal logs in my flake db but it's easy to fetch them. I will start working on it and report back. |
Well, mixed news. Out of 78 flakes in my DB, the string "stack going" only appears in journal logs for 7 of them: Next: that string appears in all instances of #21749, the new oom flake. This does not surprise me: I had a hunch that these were connected. Next: that string DOES NOT appear in any rawhide journals that I looked at for flakes other than the hang. (This is admittedly a poor and non-thorough data point, because I just hand-picked logs). Next: no other relevant matches (in these logs) for 'kernel.*warn' (I am ignoring HTH. |
Another one in OpenQA |
Sigh. I am sorry to report that the flake lives on in today's new VMs, kernel |
@Luap99 I realized that I do have access to the CID and test environment, and can run podman inspect. It doesn't seem very useful any more give your reproducer, but is there something else I can do to instrument and debug? Like get |
Nothing needed for now. |
Build |
Are we comfortable closing this given that the new kernel has arrived? We can always re-open if the demons reappear in the shadows. |
It seems pretty likely that rc6 fixes the hang. |
Weird new flake, manifesting in many different tests but the common factors are:
I'm pretty sure it has to do with the new CI VMs. The table below shows the diffs between old-VMs-without-the-hang and new-VMs. I include all OSes, not just rawhide, in order to help isolate what-changed-in-rawhide-only vs what-changed-everywhere. The most suspicious changes IMO are
conmon
andcontainers-common
, but it could also besystemd
or kernel. (Sorry, I don't have an easy way to show kernel versions).I think I've seen more of these flakes but they're really hard to categorize so I'm filing as-is right now and will follow up over time.
The text was updated successfully, but these errors were encountered: