-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machine hung on starting, and stop didn't work #18662
Comments
Hi @deboer-tim, I forgot to check the existing issues when I created my issue: I think we had/have the same issue, if I am not mistaken? But I'm on macOS rather than Linux. |
I've been seeing this recently as well, the only thing that fixes it is killing the qemu process. What qemu version are you using? |
Looks like I'm on 8.0.0. |
FWIW this happened again yesterday, only solution was killing qemu. |
I think the state transitions of podman/cmd/podman/machine/start.go Lines 70 to 75 in 9706147
There's a couple of issue:
I assume this problem applies to other providers than QEMU. @n1hility @ashley-cui WDYT? |
NOTE: to reproduce a |
|
Thanks for looking into it @vrothberg!
I believe the purpose of CheckExclusiveActiveVM() is to check if there is ANY vm running, as opposed to which vm is running. Podman machine (on Mac and Linux) only allows one machine to be running, so it is sufficient to return only the first starting/running VM found. If a
I believe a
This sounds good to me
The interesting thing is that Windows allows more than one machine to be running at a time. There were disagreements on whether this was a bug, and if this should be allowed on all platforms, but that's another discussion. I think there were actually two different bugs here that looked like one. One was the stuck in starting bug, which was fixed in podman/pkg/machine/qemu/machine.go Lines 897 to 901 in 85ab620
|
I can confirm that (as mentioned in #18662 (comment)).
I think it's a legitimate expectation to have more than one running. I am trying to find a solution to fix the BUT: that would be a lot for work and there's no CI to check whether the changes cause regressions. We could start with QEMU but that would increase divergence among the providers. |
I just want to point out that I want to get to the point that a user could have multiple machines running at the same time. Imagine on a MAC M2, you have an ARM Machine and a X86_64 machine running, and then you use the podman buildfarm command to build a multi-arch image. |
I just posted a reply on the discussion with details (#18415 (comment)), the TL;DR is it was intentional to support multiple machines running in parallel. |
Keep in mind that the write-lock wont fully cover all cases, since the podman machine command could crash or be killed mid-start (e.g. system shutdown), and the qemu or gvproxy process can always immediately fail right after a state is written. So while I agree it makes sense to use file locking as a safeguard to serialize start/stop on the same machine name, all of the commands should be able to handle and recover from inaccurate state. For example start() - if not already recently changed to - should double check qemu is running at the expected pid even if the state file says Started.
IMO it wouldn't be too bad to add some machine name specific flock / aquire / wait guards amonst the providers . Let me know if you need a hand. |
@vrothberg BTW there is an unlikely race in the qemu / gvproxy dance that I have been aware of but havnet gotten to fix it. I'll try to push that up as a PR today just in case it helps (I doubt it but just in case) |
Absolutely agree on that. As mentioned above, the locking is means to serialize and have certain assumptions. Currently, two simultaneously running |
Thanks, @n1hility ! |
Lock the VM on start. If the machine is in the "starting" state we know that a previous start has failed and guide the user into resolving the issue. Concurrent starts will busy wait and return the expected "already running" error. NOTE: this change is only looking at the start issue (containers#18662). Other commands such as stop and update should also lock and will be updated in a future change. I expect the underlying issue to apply to all machine providers, not only QEMU. It's desirable to aim for extending the machine interface to also allow to `Lock()` and `Unlock()`. After acquiring the lock, the VM should automatically be reloaded/updated. [NO NEW TESTS NEEDED] Fixes: containers#18662 Signed-off-by: Valentin Rothberg <[email protected]>
Opened #19396 |
Issue Description
I had a podman machine hang indefinitely while starting. I don't know if it was random, but I was on a very low bandwidth internet connection at the time.
Using the stop command said it stopped the machine, but didn't. As expected start didn't do anything. In the end I had to find and kill the process.
Steps to reproduce the issue
Unknown.
Describe the results you received
Describe the results you expected
It would be nice if 'starting' always led to 'started' state, or failed after a timeout. Either way, the stop command should always stop the process (and never say it stopped it when it didn't), even if it has to kill it in the background.
podman info output
Podman in a container
No
Privileged Or Rootless
None
Upstream Latest Release
Yes
Additional environment details
Additional environment details
Additional information
Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting
The text was updated successfully, but these errors were encountered: