Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman machine start stuck when running the second time using applehv on macos silicon M2 #21160

Closed
johannesmarx opened this issue Jan 4, 2024 · 27 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine macos MacOS (OSX) related

Comments

@johannesmarx
Copy link

johannesmarx commented Jan 4, 2024

Issue Description

I'm using podman with the applehv provider (due to issues using qemu and following suggestions of #20776 ) on an M2 mac. Initializing the machine works correctly and starting that initial machine also works including executing the quay.io/podman/hello container.

When I stop the machine and start it again, the VM gets stuck and cannot be used anymore. Any consecutive try to start it fails/gets stuck

Thanks a lot. Happy to provide any additional information you might need.

Steps to reproduce the issue

Steps to reproduce the issue

  1. brew tap cfergeau/crc
  2. brew install vfkit
  3. brew install podman
  4. export CONTAINERS_MACHINE_PROVIDER=applehv (or add the entry provider="applehv" in the [machine] section of ~/.config/containers/containers.conf which does not make any difference)
  5. podman machine init
  6. podman machine start
  7. podman run quay.io/podman/hello works ✅ using VM TYPE = applehv
  8. podman machine stop
  9. podman machine start gets stuck ❌

Describe the results you received

I'm only able to start and stop podman machine one single time.

Describe the results you expected

I should be able to start and stop podman machine (as often as I like).

podman info output

OS: darwin/arm64
provider: applehv
version: 4.8.3

macos: sonoma 14.2.1

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

none

Additional information

What I additional noticed and manage to capture is that on the first startup, grub contains one entry and on the second run it contains to entries (not sure though, if this is relevant)

First run of podman machine start

010-1st-run-grub-menu

Grub config for the entry

011-1st-run-ostree-0-config

Second run of podman machine start

020-2nd-run-grub-menu

Grub config for the entry ostree:0

021-2nd-run-ostree-0-config

Grub config for the entry ostree:1

022-2nd-run-ostree-1-config

@johannesmarx johannesmarx added the kind/bug Categorizes issue or PR as related to a bug. label Jan 4, 2024
@d34dh0r53
Copy link

d34dh0r53 commented Jan 4, 2024

I can confirm that this is happening to me as well on an M1 Max.

OS: darwin/arm64
provider: applehv
version: 4.8.2

macos: sonoma 14.2.1

@Luap99 Luap99 added macos MacOS (OSX) related machine labels Jan 4, 2024
@rhatdan
Copy link
Member

rhatdan commented Jan 4, 2024

This is beling looked into, we are thinking there is an issue in fcos.

@rhatdan
Copy link
Member

rhatdan commented Jan 4, 2024

@baude PTAL

@johannesmarx
Copy link
Author

I added some additional information above regarding the grub menu entries.

During the first successful run it contains one entry, where as for the second run it contains two entries.

@cgwalters
Copy link
Contributor

Brent and I were digging into this and what we were concluding is that the ultimate cause is filesystem corruption - which doesn't always happen immediately, but is pretty reliably triggered by the remove-moby systemd unit (which happens to cause substantial I/O).

@cgwalters
Copy link
Contributor

Yeah, from the kernel logs:

[    1.887746] XFS (vda4): Metadata CRC error detected at xfs_dir3_block_read_verify+0xd4/0x100 [xfs], xfs_dir3_block block 0x115370 
[    1.887867] XFS (vda4): Unmount and run xfs_repair
[    1.887885] XFS (vda4): First 128 bytes of corrupted metadata buffer:
[    1.887913] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.887941] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.887967] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.887994] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.888019] 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.888047] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.888075] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.888101] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    1.888127] XFS (vda4): metadata I/O error in "xfs_da_read_buf+0x100/0x150 [xfs]" at daddr 0x115370 len 8 error 74

It's notable that the entire content is zeroed here. But, this needs more debugging.

@cgwalters
Copy link
Contributor

This fixes it for me crc-org/vfkit#76

@johannesmarx
Copy link
Author

This fixes it for me crc-org/vfkit#76

Thanks a lot.
So I assume we need to be patient until your PR gets approved and merged and reinstall a new version of vfkit then.

cgwalters added a commit to cgwalters/podman that referenced this issue Jan 9, 2024
This depends on crc-org/vfkit#78
and is an alternative to crc-org/vfkit#76
that I like better for fixing
containers#21160

It looks like at least UTM switched to NVMe for Linux guests by default
for example.

[NO NEW TESTS NEEDED]

Signed-off-by: Colin Walters <[email protected]>
@anjannath
Copy link
Member

@johannesmarx the patch has been back ported to vfkit on brew, see: #21092 (comment), you can update vfkit and try it

@johannesmarx
Copy link
Author

@anjannath Thanks a lot, I'll do and post an update here

@gbraad
Copy link
Member

gbraad commented Jan 17, 2024

You also might need the fix that is part of: containers/gvisor-tap-vsock#309 as that resolves a race condition with the user-mode network stack.

@johannesmarx
Copy link
Author

@anjannath I did brew uninstall vfkit and brew install vfkit but the result was still the same, facing the issue with a stuck podman machine start command in the second run as described in my issue description above.

@gbraad how would I get this fix for my setup?

@anjannath
Copy link
Member

@johannesmarx did you delete your existing podman machine and start a new one, since the previous one was already stuck you'd need to start anew, remove the existing with podman machine stop podman machine rm then start a new one with podman machine init and podman machine start

@johannesmarx
Copy link
Author

johannesmarx commented Jan 18, 2024

@anjannath Yes, I did stop, remove and re-created the machine as you described.

@cfergeau
Copy link
Contributor

Can you check the version of vfkit that you have installed? This must be 0.5.0_2 or 0.5.1
You also need the latest release of gvisor-tap-vsock https://github.com/containers/gvisor-tap-vsock/releases/tag/v0.7.2 as this fixes a different bug related to podman machine startup

@johannesmarx
Copy link
Author

Hi @cfergeau

I upgraded vfkit and podman using brew upgrade:

==> Upgrading cfergeau/crc/vfkit
  0.5.0_2 -> 0.5.1

==> Upgrading podman
  4.8.3_1 -> 4.9.0

Now my initial use-case works.
But, if I reboot, podman machine start stops working again with the following error:

podman machine start --log-level=trace

INFO[0000] podman filtering at log level trace
DEBU[0000] Using Podman machine with `applehv` virtualization provider
DEBU[0000] connection refused: http://localhost:8081/vm/state
Starting machine "podman-machine-default"
DEBU[0000] connection refused: http://localhost:8081/vm/state
DEBU[0000] gvproxy binary being used: /opt/homebrew/Cellar/podman/4.9.0/libexec/podman/gvproxy
DEBU[0000] [-debug -mtu 1500 -ssh-port 49841 -listen-vfkit unixgram:///var/folders/br/19j4cvbs3bn01r_zrkcsdhvw0000gq/T/podman/gvproxy.sock -forward-sock /Users/<USER>/.local/share/containers/podman/machine/applehv/podman.sock -forward-dest /run/user/503/podman/podman.sock -forward-user core -forward-identity /Users/<USER>/.ssh/podman-machine-default -pid-file /var/folders/br/19j4cvbs3bn01r_zrkcsdhvw0000gq/T/podman/gvproxy.pid]
DEBU[0000] gvproxy unixgram socket "/var/folders/br/19j4cvbs3bn01r_zrkcsdhvw0000gq/T/podman/gvproxy.sock" not found: stat /var/folders/br/19j4cvbs3bn01r_zrkcsdhvw0000gq/T/podman/gvproxy.sock: no such file or directory
Error: gvproxy exited unexpectedly with exit code 1
DEBU[0000] Shutting down engines

What about your mentioning about gvisor-tap-vsock - would I also require it to fix the issue above? How would I install it?

Thanks a lot

@cfergeau
Copy link
Contributor

DEBU[0000] gvproxy binary being used: /opt/homebrew/Cellar/podman/4.9.0/libexec/podman/gvproxy

Grab https://github.com/containers/gvisor-tap-vsock/releases/download/v0.7.2/gvproxy-darwin and replace the binary at /opt/homebrew/Cellar/podman/4.9.0/libexec/podman/gvproxy with it (make sure to name it gvproxy, don't forget to set the permissions, maybe you'll need to remove the quarantine xattr,.. )

@johannesmarx
Copy link
Author

johannesmarx commented Jan 29, 2024

@cfergeau thanks a lot - when will gvproxy v0.7.2 be part of the official podman release? So I don't need to handle the quarantine issue? (Still didn't test v0.7.2 though)

@rhatdan
Copy link
Member

rhatdan commented Jan 29, 2024

Next release of Podman will be 5.0 some time in February.

@fabricepipart1a
Copy link

I upgraded to podman desktop 1.7.0 and podman 4.9.0 and it now seems to work:

╰─ podman version
Client:       Podman Engine
Version:      4.9.0
API Version:  4.9.0
Go Version:   go1.21.6
Git Commit:   f7c7b0a7e437b6d4849a9fb48e0e779c3100e337
Built:        Tue Jan 23 02:43:59 2024
OS/Arch:      darwin/arm64

Server:       Podman Engine
Version:      4.8.3
API Version:  4.8.3
Go Version:   go1.21.5
Built:        Wed Jan  3 15:10:40 2024

╰─ podman machine info
Host:
  Arch: arm64
  CurrentMachine: podman-machine-default
  DefaultMachine: ""
  EventsDir: /var/folder...podman-run--1/podman
  MachineConfigDir: /Users/xxx/.config/containers/podman/machine/applehv
  MachineImageDir: /Users/xxx/.local/share/containers/podman/machine/applehv
  MachineState: Running
  NumberOfMachines: 1
  OS: darwin
  VMType: applehv
Version:
  APIVersion: 4.9.0
  Built: 1705974239
  BuiltTime: Tue Jan 23 02:43:59 2024
  GitCommit: f7c7b0a7e437b6d4849a9fb48e0e779c3100e337
  GoVersion: go1.21.6
  Os: darwin
  OsArch: darwin/arm64
  Version: 4.9.0

I just created a machine with applehv, stopped it and restarted it. Was I just lucky ???

@johannesmarx
Copy link
Author

@fabricepipart1a for me it's still not 100% working. It stopped working for me after rebooting.

@fabricepipart1a
Copy link

Might be a stupid question but ... did you edit the containers.conf file (which I did) or did you set the env var to switch to applehv?

@johannesmarx
Copy link
Author

Might be a stupid question but ... did you edit the containers.conf file (which I did) or did you set the env var to switch to applehv?

Hi @fabricepipart1a, as described in step 4 to reproduce the issue, I ended up using ~/.config/containers/containers.conf to set provider="applehv". So it also ensures to persist after reboots.

@fabricepipart1a
Copy link

fabricepipart1a commented Feb 13, 2024

Then, there must be a difference between our setups.
A few hint regarding my setup:

  • crc and vfkit are updated to latest
  • I uninstalled from brew to reinstall via the pkg in order to benefit from smooth updates (counter intuitive I know)
  • I upgraded to 4.9.0
  • I deleted and recreated my podman machine. Mine is rootful with user network mode: podman machine init --cpus 8 --memory 12000 --rootful --user-mode-networking --now

@johannesmarx
Copy link
Author

johannesmarx commented Feb 13, 2024

Then, there must be a difference between our setups. A few hint regarding my setup:

  • crc and vfkit are updated to latest
  • I uninstalled from brew to reinstall via the pkg in order to benefit from smooth updates (counter intuitive I know)
  • I upgraded to 4.9.0
  • I deleted and recreated my podman machine. Mine is rootful with user network mode: podman machine init --cpus 8 --memory 12000 --rootful --user-mode-networking --now

I'm on the latest versions available via brew for podman & vfkit (not using Podman Desktop like you did as far as I see):

vfkit version: 0.5.1
podman version 4.9.2

I tried it with the same machine settings you used (besides rootful, I used rootless) but it stopped working after a reboot. So for me, nothing changed, waiting for the podman 5.0.0 release.

@Luap99
Copy link
Member

Luap99 commented Apr 4, 2024

I assume this works with 5.0

@Luap99 Luap99 closed this as completed Apr 4, 2024
@johannesmarx
Copy link
Author

Yes, it's working now 👍
Thanks a lot to all involved and sorry for my late feedback.

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jul 5, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jul 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine macos MacOS (OSX) related
Projects
None yet
Development

No branches or pull requests

9 participants