Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine hung on starting, and stop didn't work #18662

Closed
deboer-tim opened this issue May 23, 2023 · 16 comments · Fixed by #19396
Closed

Machine hung on starting, and stop didn't work #18662

deboer-tim opened this issue May 23, 2023 · 16 comments · Fixed by #19396
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine podman-desktop remote Problem is in podman-remote

Comments

@deboer-tim
Copy link

Issue Description

I had a podman machine hang indefinitely while starting. I don't know if it was random, but I was on a very low bandwidth internet connection at the time.

Using the stop command said it stopped the machine, but didn't. As expected start didn't do anything. In the end I had to find and kill the process.

Steps to reproduce the issue

Unknown.

Describe the results you received

% podman machine list                     
NAME                  VM TYPE     CREATED      LAST UP             CPUS        MEMORY      DISK SIZE
podman-machine-demo*  qemu        4 days ago   Currently starting  4           12.58GB     107.4GB
podman-machine-alt    qemu        2 weeks ago  4 days ago          4           9.98GB      107.4GB
podman-45             qemu        2 weeks ago  11 days ago         4           11.22GB     107.4GB
% podman machine stop podman-machine-demo 
Machine "podman-machine-demo" stopped successfully
% podman machine list                    
NAME                  VM TYPE     CREATED      LAST UP             CPUS        MEMORY      DISK SIZE
podman-machine-demo*  qemu        4 days ago   Currently starting  4           12.58GB     107.4GB
podman-machine-alt    qemu        2 weeks ago  4 days ago          4           9.98GB      107.4GB
podman-45             qemu        2 weeks ago  11 days ago         4           11.22GB     107.4GB
% podman machine start podman-machine-alt
Error: cannot start VM podman-machine-alt. VM podman-machine-demo is currently running or starting: only one VM can be active at a time
% podman machine list                    
NAME                  VM TYPE     CREATED      LAST UP             CPUS        MEMORY      DISK SIZE
podman-machine-demo*  qemu        4 days ago   Currently starting  4           12.58GB     107.4GB
podman-machine-alt    qemu        2 weeks ago  4 days ago          4           9.98GB      107.4GB
podman-45             qemu        2 weeks ago  11 days ago         4           11.22GB     107.4GB

Describe the results you expected

It would be nice if 'starting' always led to 'started' state, or failed after a timeout. Either way, the stop command should always stop the process (and never say it stopped it when it didn't), even if it has to kill it in the background.

podman info output

% podman info
host:
  arch: arm64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 99.25
    systemPercent: 0.45
    userPercent: 0.3
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: coreos
    version: "38"
  eventLogger: journald
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 501
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
  kernel: 6.2.15-300.fc38.aarch64
  linkmode: dynamic
  logDriver: journald
  memFree: 11486474240
  memTotal: 12208103424
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.4-1.fc38.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.4
      commit: 5a8fa99a5e41facba2eda4af12fa26313918805b
      rundir: /run/user/501/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/501/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.aarch64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 0h 7m 24.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 7
    paused: 0
    running: 0
    stopped: 7
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphRootAllocated: 106769133568
  graphRootUsed: 7990566912
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 10
  runRoot: /run/user/501/containers
  transientStore: false
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 4.5.0
  Built: 1681486872
  BuiltTime: Fri Apr 14 11:41:12 2023
  GitCommit: ""
  GoVersion: go1.20.2
  Os: linux
  OsArch: linux/arm64
  Version: 4.5.0

Podman in a container

No

Privileged Or Rootless

None

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@deboer-tim deboer-tim added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2023
@github-actions github-actions bot added the remote Problem is in podman-remote label May 23, 2023
@Luap99 Luap99 added the machine label May 23, 2023
@Aenimus
Copy link

Aenimus commented May 23, 2023

Hi @deboer-tim,

I forgot to check the existing issues when I created my issue:
#18666

I think we had/have the same issue, if I am not mistaken? But I'm on macOS rather than Linux.

@ashley-cui
Copy link
Member

I've been seeing this recently as well, the only thing that fixes it is killing the qemu process. What qemu version are you using?

@deboer-tim
Copy link
Author

Looks like I'm on 8.0.0.

@deboer-tim
Copy link
Author

FWIW this happened again yesterday, only solution was killing qemu.

@vrothberg
Copy link
Member

I think the state transitions of podman machine are not correct. In order to determine whether a machine is currently running or "starting", CheckExclusiveActiveVM() is being used. This function returns the first VM that is either being running or starting and matches that against the to-be-started VM, see

if active {
if vmName == activeName {
return fmt.Errorf("cannot start VM %s: %w", vmName, machine.ErrVMAlreadyRunning)
}
return fmt.Errorf("cannot start VM %s. VM %s is currently running or starting: %w", vmName, activeName, machine.ErrMultipleActiveVM)
}
.

There's a couple of issue:

  • Since CheckExclusiveActiveVM() does not return all running/starting VMs but only the first one it finds, machine start may very well start an already running machine.
  • There is no way to recover from the "starting" state. It's being written early during star, see
    v.Starting = true
    if err := v.writeConfig(); err != nil {
    return fmt.Errorf("writing JSON file: %w", err)
    }
    and if the machine process gets killed, we're stuck in this state.
  • The single .Starting bool is not enough for such state transitions. We need locks for concurrent processes to block. If we run podman start and get the lock and the machine is still in .Starting == true, we know something went wrong.

I assume this problem applies to other providers than QEMU. @n1hility @ashley-cui WDYT?

@vrothberg
Copy link
Member

NOTE: to reproduce a CTRL+C is not sufficient. The process must actually get killed, e.g. via killall -9 podman.

@vrothberg
Copy link
Member

machine stop does work in my environment with the latest version of Podman though.

@ashley-cui
Copy link
Member

ashley-cui commented Jul 26, 2023

Thanks for looking into it @vrothberg!

There's a couple of issue:

  • Since CheckExclusiveActiveVM() does not return all running/starting VMs but only the first one it finds, machine start may very well start an already running machine.

I believe the purpose of CheckExclusiveActiveVM() is to check if there is ANY vm running, as opposed to which vm is running. Podman machine (on Mac and Linux) only allows one machine to be running, so it is sufficient to return only the first starting/running VM found. If a podman machine start is issued on a different machine while another is running, it should fail. I'm not sure the bug is here.

  • There is no way to recover from the "starting" state. It's being written early during star, see
    and if the machine process gets killed, we're stuck in this state.

I believe a podman machine stop will fix incorrect state if the machine gets stuck in starting.

  • The single .Starting bool is not enough for such state transitions. We need locks for concurrent processes to block. If we run podman start and get the lock and the machine is still in .Starting == true, we know something went wrong.

This sounds good to me

I assume this problem applies to other providers than QEMU. @n1hility @ashley-cui WDYT?

The interesting thing is that Windows allows more than one machine to be running at a time. There were disagreements on whether this was a bug, and if this should be allowed on all platforms, but that's another discussion.

I think there were actually two different bugs here that looked like one. One was the stuck in starting bug, which was fixed in stop. Another was that the QEMU process becomes unresponsive to machine commands. This happens when the QMP socket is unable to be pinged. Podman machine assumes that if the QMP socket is not up, the machine is down, so stop doesn't do anything, but this is not the case when this unresponsive bug pops up.

// check if the qmp socket is there. if not, qemu instance is gone
if _, err := os.Stat(v.QMPMonitor.Address.GetPath()); errors.Is(err, fs.ErrNotExist) {
// Right now it is NOT an error to stop a stopped machine
logrus.Debugf("QMP monitor socket %v does not exist", v.QMPMonitor.Address)
// Fix incorrect starting state in case of crash during start

@vrothberg
Copy link
Member

I believe a podman machine stop will https://github.com/containers/podman/pull/18328if the machine gets stuck in starting.

I can confirm that (as mentioned in #18662 (comment)). stop must have changed in the meantime.

The interesting thing is that Windows allows #18415 to be running at a time. There were disagreements on whether this was a bug, and if this should be allowed on all platforms, but that's another discussion.

I think it's a legitimate expectation to have more than one running. CheckExclusiveActiveVM() should be able to check whether a given VM is running or not. With a locking mechanism the function wouldn't be needed anymore.

I am trying to find a solution to fix the start issue. IMO all commands that alter a VM such start, stop, update etc. should hold a write-lock. This will avoid races between the commands but also it allows for detecting whether a command has failed or not.

BUT: that would be a lot for work and there's no CI to check whether the changes cause regressions. We could start with QEMU but that would increase divergence among the providers.

@rhatdan
Copy link
Member

rhatdan commented Jul 26, 2023

I just want to point out that I want to get to the point that a user could have multiple machines running at the same time.

Imagine on a MAC M2, you have an ARM Machine and a X86_64 machine running, and then you use the podman buildfarm command to build a multi-arch image.

@n1hility
Copy link
Member

The interesting thing is that Windows allows more than one machine to be running at a time. There were disagreements on whether this was a bug, and if this should be allowed on all platforms, but that's another discussion.

I just posted a reply on the discussion with details (#18415 (comment)), the TL;DR is it was intentional to support multiple machines running in parallel.

@n1hility
Copy link
Member

CheckExclusiveActiveVM() should be able to check whether a given VM is running or not. With a locking mechanism the function wouldn't be needed anymore.

I am trying to find a solution to fix the start issue. IMO all commands that alter a VM such start, stop, update etc. should hold a write-lock. This will avoid races between the commands but also it allows for detecting whether a command has failed or not.

Keep in mind that the write-lock wont fully cover all cases, since the podman machine command could crash or be killed mid-start (e.g. system shutdown), and the qemu or gvproxy process can always immediately fail right after a state is written. So while I agree it makes sense to use file locking as a safeguard to serialize start/stop on the same machine name, all of the commands should be able to handle and recover from inaccurate state. For example start() - if not already recently changed to - should double check qemu is running at the expected pid even if the state file says Started.

BUT: that would be a lot for work and there's no CI to check whether the changes cause regressions. We could start with QEMU but that would increase divergence among the providers.

IMO it wouldn't be too bad to add some machine name specific flock / aquire / wait guards amonst the providers . Let me know if you need a hand.

@n1hility
Copy link
Member

@vrothberg BTW there is an unlikely race in the qemu / gvproxy dance that I have been aware of but havnet gotten to fix it. I'll try to push that up as a PR today just in case it helps (I doubt it but just in case)

@vrothberg
Copy link
Member

vrothberg commented Jul 26, 2023

Keep in mind that the write-lock wont fully cover all cases, since the podman machine command could crash or be killed mid-start (e.g. system shutdown), and the qemu or gvproxy process can always immediately fail right after a state is written. So while I agree it makes sense to use file locking as a safeguard to serialize start/stop on the same machine name, all of the commands should be able to handle and recover from inaccurate state. For example start() - if not already recently changed to - should double check qemu is running at the expected pid even if the state file says Started.

Absolutely agree on that. As mentioned above, the locking is means to serialize and have certain assumptions. Currently, two simultaneously running starts could shoot each other. With locks, we can have the assumption/invariant that "starting" implies that a previous start went wrong. Then we can, for instance, clean up leftovers of the previous failed start and restart etc.

@vrothberg
Copy link
Member

@vrothberg BTW there is an unlikely race in the qemu / gvproxy dance that I have been aware of but havnet gotten to fix it. I'll try to push that up as a PR today just in case it helps (I doubt it but just in case)

Thanks, @n1hility !

vrothberg added a commit to vrothberg/libpod that referenced this issue Jul 27, 2023
Lock the VM on start.  If the machine is in the "starting" state we know
that a previous start has failed and guide the user into resolving the
issue.

Concurrent starts will busy wait and return the expected "already
running" error.

NOTE: this change is only looking at the start issue (containers#18662).  Other
commands such as stop and update should also lock and will be updated
in a future change.  I expect the underlying issue to apply to all
machine providers, not only QEMU.  It's desirable to aim for extending
the machine interface to also allow to `Lock()` and `Unlock()`.  After
acquiring the lock, the VM should automatically be reloaded/updated.

[NO NEW TESTS NEEDED]

Fixes: containers#18662
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member

Opened #19396

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine podman-desktop remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants