Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.2.3-rpm] libxenlight failed to create new domain sd-log #498

Closed
eloquence opened this issue Mar 19, 2020 · 72 comments · Fixed by #522
Closed

[0.2.3-rpm] libxenlight failed to create new domain sd-log #498

eloquence opened this issue Mar 19, 2020 · 72 comments · Fixed by #522

Comments

@eloquence
Copy link
Member

In my prod install, I have now seen this error message (displayed as a sticky Qubes notification) twice towards the end of updates. sd-log does end up running just fine.

/var/log/qubes/qubes.log shows this output:

Click to expand
2020-03-18 18:00:23,561 Setting Qubes DB info for the VM
2020-03-18 18:00:23,563 Starting Qubes DB
2020-03-18 18:00:23,633 Activating the sys-whonix VM
2020-03-18 18:00:28,747 Starting sd-proxy
2020-03-18 18:00:28,754 Starting sd-whonix
2020-03-18 18:00:32,529 Setting Qubes DB info for the VM
2020-03-18 18:00:32,529 Starting Qubes DB
2020-03-18 18:00:32,595 Activating the sd-whonix VM
2020-03-18 18:00:46,144 Setting Qubes DB info for the VM
2020-03-18 18:00:46,145 Starting Qubes DB
2020-03-18 18:00:46,179 Starting sd-log
2020-03-18 18:00:46,216 Activating the sd-proxy VM
2020-03-18 18:00:49,778 Start failed: internal error: libxenlight failed to create new domain 'sd-log'
2020-03-18 18:00:54,378 Starting sd-log
2020-03-18 18:00:58,560 Setting Qubes DB info for the VM
2020-03-18 18:00:58,560 Starting Qubes DB
2020-03-18 18:00:58,640 Activating the sd-log VM
2020-03-18 18:01:18,881 Starting sd-app
2020-03-18 18:01:26,525 Setting Qubes DB info for the VM
2020-03-18 18:01:26,526 Starting Qubes DB
2020-03-18 18:01:26,687 Activating the sd-app VM
2020-03-18 18:01:55,294 Starting sd-gpg
2020-03-18 18:02:01,086 Setting Qubes DB info for the VM
2020-03-18 18:02:01,088 Starting Qubes DB
2020-03-18 18:02:01,378 Activating the sd-gpg VM
@eloquence
Copy link
Member Author

See #495 for another sd-log error during a prod install reported by Kev last week, which may or may not be related.

@eloquence
Copy link
Member Author

eloquence commented Mar 23, 2020

Again seeing this during an update run today. Am blowing away this prod install momentarily, will report if I see it again on reinstall.

@eloquence
Copy link
Member Author

I'm seeing this on a brand new install from scratch, but this time with sd-proxy has failed to start; again, during the update process. @ninavizz also saw it earlier today. Note that all those laptops only have 16GB RAM.

@rocodes
Copy link
Contributor

rocodes commented Mar 25, 2020

Yup, repro'd. Installed prod yesterday, applied updates today, saw the error when I attempted to shut down sd-related vms.

Edit: see QubesOS/qubes-issues#3125
edit2: my logs

libxl-driver.log.20200325.txt

@eloquence
Copy link
Member Author

@rocodes How much memory are you using?

@emkll
Copy link
Contributor

emkll commented Mar 25, 2020

@conorsch has provided a script to monitor the health of the qmemman service in https://gist.github.com/conorsch/bb8b573a6a7a98af70db2a20b4866122

I ran this script (watch ./check-qmemman.sh) as I was updating my workstation. At the reboot step, observed the qmememan service failed at the same moment as the sd-log vm failed to start (the error described in this ticket).

@eloquence
Copy link
Member Author

eloquence commented Mar 26, 2020

Same but with sd-proxy. Always tends to happen at the 95% stage for me. Here's a smoking gun screenshot. Service does not appear to recover on its own after this happens.
qmem

@eloquence
Copy link
Member Author

Last qmemman.log lines before this happened: https://gist.github.com/eloquence/043db1727d4c7cca215eb587abba64c2

@eloquence
Copy link
Member Author

(For the record, we have reports of this happening on systems w/ 16 and 32 GB, the latter of which should be more than sufficient.)

@marmarek
Copy link

"EOF" message from qmemman isn't any kind of its failure. It is normal(*) message during VM startup, failed or not. It means qmemman resumes balancing normally, after allocating memory for the new VM.

(*) "normal" as in "expected", but indeed very cryptic.

@eloquence
Copy link
Member Author

@marmarek The state change Conor's script checks for (EOF being the most recent journal entry) is only reached exactly when I see the VM startup error, and after that, qmemman does not seem to leave that state until it's bounced. I'll test it again today to see if that behavior is consistently reproducible.

I do not see any suspicious line in the libxl-driver.log, indeed I don't see any events in that log at all that directly correlate with the startup failure, but I don't know if the timestamps between the logs are 100% consistent. The failure occurred around 00:00 and these are the logs before and after. To be sure, I don't see anything like the "Stubdom" log entry or "startup timed out" anywhere in the logs, across all of /var/log.

Click to expand
2020-03-25 23:02:05.140+0000: libxl: libxl_pci.c:1199:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2020-03-25 23:02:06.694+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/11/backend/vif/15/0
2020-03-25 23:02:06.698+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 15
2020-03-25 23:02:06.808+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to remove device with path /local/domain/11/backend/vif/14/0
2020-03-25 23:02:06.862+0000: libxl: libxl.c:1669:devices_destroy_cb: libxl__devices_destroy failed for 14
2020-03-25 23:02:35.482+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address
2020-03-25 23:02:35.542+0000: libxl: libxl_linux.c:155:libxl__loopdev_cleanup: unable to release device /dev/loop0: No such device or address
2020-03-26 00:41:04.117+0000: libxl: libxl_pci.c:1199:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2020-03-26 03:41:35.288+0000: libxl: libxl.c:1853:libxl_console_get_tty: unable to read console tty path `/local/domain/21/console/tty': Resource temporarily unavailable
2020-03-26 03:47:06.648+0000: libxl: libxl.c:1853:libxl_console_get_tty: unable to read console tty path `/local/domain/33/console/tty': Resource temporarily unavailable

@marmarek
Copy link

Error "libxenlight failed to create new domain" always has related details in libxl-driver.log. But there are also other ways the VM can fail to start, like "Cannot connecto to '...' qrexec agent for 60 seconds".

@eloquence
Copy link
Member Author

The screenshot above illustrates what happened -- there was no "Cannot connect to" error displayed on the screen, but I also do not see any relevant log entries in the libxl-driver.log; see excerpt above. Note that our updater performs many VM shutdown/startup operations, and we see this consistently happening towards the end of the process.

@eloquence
Copy link
Member Author

(To be sure, I'll also tail libxl-driver.log next time I run the updater and would advise anyone following along to do the same.)

@marmarek
Copy link

marmarek commented Mar 26, 2020

qmemman stopping to work sounds like QubesOS/qubes-issues#4890 (I'll backport the fix to R4.0 in a moment - EDIT: or not, seems to be a different issue in R4.0), but it doesn't result in "libxenlight failed to create new domain" error.

@eloquence
Copy link
Member Author

This screenshot shows the system state with libxl-driver.log tailed.
libxl

The devices-related errors happen while the updater restarts system VMs. The device model did not start happens around the time of the libxenlight error being reported to the user.

Regarding qmemman, I do see new lines getting added to the log if I start/stop VMs after I reach this state, they just immediately transition to EOF, so Conor's script reports failure. @conorsch Is it possible qmemman is operating normally but we're just seeing the state change because this happens around the end of the start/shutdown sequence of the updater - i.e. correlation not causation?

@marmarek
Copy link

It is always the same VM?

Also, copying comment from QubesOS/qubes-issues#3125 (comment):

2020-03-25 14:05:19.582+0000: libxl: libxl_dm.c:2098:stubdom_xswait_cb: Stubdom 57 for 56 startup: startup timed out

This seems to be the issue. Can you check console log of the corresponding stubdom (/var/log/xen/console/guest-<VMNAME>-dm.log)?

@eloquence
Copy link
Member Author

eloquence commented Mar 26, 2020

I also saw it for another VM in the past (also at that same stage of the update process), but most recently it's always been sd-proxy for me.

I don't see anything like that error in the guest-sd-proxy-dm.log. That log is also not timestamped for me. The most recent lines are:

Locked DMA mapping while invalidating mapcache! 0000000000000341 -> 0x78c7c36e4000 is present
Issued domain 12 poweroff

The "Locked DMA" is repeated a gazillion times before that.

@eloquence
Copy link
Member Author

Just did another run and did not see new lines being appended to that file when the error occurred.

But I did see a repeat of this in libxl-driver.log:

2020-03-26 20:58:38.011+0000: libxl: libxl_dm.c:2067:stubdom_pvqemu_cb: error connecting nics devices: No such file or directory
2020-03-26 20:58:38.011+0000: libxl: libxl_create.c:1555:domcreate_devmodel_started: device model did not start: -6

@eloquence
Copy link
Member Author

Cross-ref potentially related #486 which reboots system VMs in case of fedora-30 updates to reduce full system reboot requirement.

@marmarek
Copy link

2020-03-26 20:58:38.011+0000: libxl: libxl_dm.c:2067:stubdom_pvqemu_cb: error connecting nics devices: No such file or directory

I have an idea!
sd-proxy's netvm is sd-whonix, right? I guess sd-whonix is updated just before sd-proxy. And if sd-whonix wasn't running initially, it is started for the update, and after the update it is shut down (asynchronously). It may happen that this shutdown happen exactly at the time when sd-proxy is starting. And sd-proxy start fails, because its network provider is gone.
Let me enumerate the hypothetical scenario:

  1. Calling qubesctl --all state.highstate
  2. sd-whonix is started for the update.
  3. update is performed
  4. sd-whonix shutdown is initiated (it takes up to a minute)
  5. sd-proxy is starting for the update:
  • one of initial steps is checking if its netvm is running - it is still there, for few more seconds
  • prepare VM to start (request memory from qmemman etc)
  • in the meantime sd-whonix finishes shutdown process
  • try to start the VM - now it fails because its netvm is gone

This can be easily verified by avoiding this automatic shutdown. It is performed only if the VM wasn't initially running, so if you start sd-whonix before the update, the issue should be gone (if I guessed correctly of course).

@conorsch
Copy link
Contributor

@marmarek That's quite a lead! Thank you for chiming in, will investigate further and try to confirm that hypothesis.

Right now we're a bit aggressive in force-applying updates, which is a temporary measure for the purposes of our limit pilot. There are likely a few ways we could soften that logic, and thereby reduce the edge cases like the race condition you describe.

@eloquence
Copy link
Member Author

eloquence commented Mar 27, 2020

Poked a bit more at this and I do think some or all of this issue was introduced with the new reboot behavior in #486, which also correlates reasonably well with when we started observing this issue.

The following STR work for me:

  1. sudo dnf downgrade zlib in fedora-30, poweroff in fedora-30
  2. Run updater til the end
  3. See sd-proxy libxenlight error, log lines in /var/log/libvirt/libxl/libxl-driver.log
2020-03-27 06:32:36.483+0000: libxl: libxl_create.c:1512:domcreate_attach_devices: unable to add nic devices
2020-03-27 06:32:36.515+0000: libxl: libxl_device.c:977:libxl__initiate_device_generic_remove: backend /local/domain/138/backend/vif/140/0 already removed, cleanup frontend only

Then, if I patch this line in Updater.py (I replaced the list with []) and follow the same STR, the update completes successfully, and I do not see the error, or the above entries in the log.

Note for anyone testing this that the updater will reinstate the unpatched file via Salt as part of the update, so expect subsequent runs to fail again.

@eloquence
Copy link
Member Author

eloquence commented Mar 27, 2020

^ I'll try these exact STR again now to see if this is consistently reproducing both the error and the resolution on my system. I'll first run a full update to ensure no variance between runs.

@rocodes
Copy link
Contributor

rocodes commented Mar 27, 2020

I think I am late to the party (and it sounds like there are already hypotheses), but I've repro'd this again with sd-proxy after applying updates today, so just in case it is still helpful, I'm attaching the logs from /var/log/xen/console/guest-sd-proxy-dm.log and /var/log/xen/console/guest-sd-proxy.log.
guest-sd-proxy-dm.log
guest-sd-proxy.log

@eloquence
Copy link
Member Author

And per suggestion from @conorsch, here's a version that uses qubesadmin instead of qvm-check, otherwise appears to behave identically:
https://gist.github.com/eloquence/3e5e02509bc96efde4ef447d1b364f87

@eloquence
Copy link
Member Author

In the spirit of raising its head again just when we think we may have squashed it, I just saw the sd-proxy error with the changes in #534 in place. Fortunately we have a lot more logging now -- here's what the logs say. In short, Qubes says all the required VMs are running, both through qvm-ls and the Qubes API calls added in #534. sd-proxy requires sd-whonix, sys-net and sys-firewall, all of which claim to be up.

2020-04-16 16:24:12,585 - sdw_updater_gui.Updater:387(apply_dom0_state) INFO: Dom0 state applied
2020-04-16 16:24:12,585 - sdw_updater_gui.Updater:419(shutdown_and_start_vms) INFO: Shutting down SDW TemplateVMs for updates
2020-04-16 16:24:13,632 - sdw_updater_gui.Updater:423(shutdown_and_start_vms) INFO: Shutting down SDW AppVMs for updates
2020-04-16 16:24:35,233 - sdw_updater_gui.Updater:431(shutdown_and_start_vms) INFO: Safely shutting down system VM: sys-usb
2020-04-16 16:24:40,264 - sdw_updater_gui.Updater:431(shutdown_and_start_vms) INFO: Safely shutting down system VM: sys-whonix
2020-04-16 16:24:44,202 - sdw_updater_gui.Updater:436(shutdown_and_start_vms) INFO: Forcing shutdown of system VM: sys-firewall
2020-04-16 16:24:54,447 - sdw_updater_gui.Updater:517(_wait_for) INFO: VM 'sys-firewall' entered expected state after 0.01 seconds of polling.
2020-04-16 16:24:54,448 - sdw_updater_gui.Updater:436(shutdown_and_start_vms) INFO: Forcing shutdown of system VM: sys-net
2020-04-16 16:25:04,754 - sdw_updater_gui.Updater:517(_wait_for) INFO: VM 'sys-net' entered expected state after 0.00 seconds of polling.
2020-04-16 16:25:04,754 - sdw_updater_gui.Updater:440(shutdown_and_start_vms) INFO: Starting fedora-based system VMs after updates
2020-04-16 16:25:04,949 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sys-net: b'dom0\nsd-devices\nwork\n'
2020-04-16 16:25:17,506 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sys-firewall: b'dom0\nsd-devices\nsys-net\nwork\n'
2020-04-16 16:25:26,292 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sys-whonix: b'dom0\nsd-devices\nsys-firewall\nsys-net\nwork\n'
2020-04-16 16:25:33,968 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sys-usb: b'dom0\nsd-devices\nsys-firewall\nsys-net\nsys-whonix\nwork\n'
2020-04-16 16:25:48,409 - sdw_updater_gui.Updater:444(shutdown_and_start_vms) INFO: Starting SDW VMs after updates
2020-04-16 16:25:49,242 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sd-log: b'dom0\nsd-devices\nsys-firewall\nsys-net\nsys-usb\nsys-whonix\nwork\n'
2020-04-16 16:26:13,813 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sd-gpg: b'dom0\nsd-devices\nsd-log\nsys-firewall\nsys-net\nsys-usb\nsys-whonix\nwork\n'
2020-04-16 16:26:39,755 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sd-whonix: b'dom0\nsd-devices\nsd-gpg\nsd-log\nsys-firewall\nsys-net\nsys-usb\nsys-whonix\nwork\n'
2020-04-16 16:26:49,150 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sd-proxy: b'dom0\nsd-devices\nsd-gpg\nsd-log\nsd-whonix\nsys-firewall\nsys-net\nsys-usb\nsys-whonix\nwork\n'
2020-04-16 16:26:56,649 - sdw_updater_gui.Updater:540(_safely_start_vm) ERROR: Error while starting sd-proxy
2020-04-16 16:26:56,649 - sdw_updater_gui.Updater:541(_safely_start_vm) ERROR: Command '['qvm-start', '--skip-if-running', 'sd-proxy']' returned non-zero exit status 1
2020-04-16 16:26:56,649 - sdw_updater_gui.Updater:542(_safely_start_vm) ERROR: b"Start failed: internal error: libxenlight failed to create new domain 'sd-proxy', see /var/log/libvirt/libxl/libxl-driver.log for details\n"
2020-04-16 16:26:56,811 - sdw_updater_gui.Updater:535(_safely_start_vm) INFO: VMs running before start of sd-app: b'dom0\nsd-devices\nsd-gpg\nsd-log\nsys-firewall\nsys-net\nsys-usb\nsys-whonix\nwork\n'

libxl-driver.log

2020-04-16 23:24:37.240+0000: libxl: libxl_pci.c:1199:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2020-04-16 23:24:55.241+0000: libxl: libxl_device.c:977:libxl__initiate_device_generic_remove: backend /local/domain/7/backend/vif/18/0 already removed, cleanup frontend only
2020-04-16 23:25:38.657+0000: libxl: libxl_pci.c:1199:libxl__device_pci_reset: The kernel doesn't support reset from sysfs for PCI device 0000:00:14.0
2020-04-16 23:26:52.771+0000: libxl: libxl_device.c:1093:device_backend_callback: unable to add device with path /local/domain/29/backend/vif/31/0
2020-04-16 23:26:52.771+0000: libxl: libxl_dm.c:2067:stubdom_pvqemu_cb: error connecting nics devices: No such file or directory
2020-04-16 23:26:52.772+0000: libxl: libxl_create.c:1555:domcreate_devmodel_started: device model did not start: -6
2020-04-16 23:26:52.797+0000: libxl: libxl_device.c:977:libxl__initiate_device_generic_remove: backend /local/domain/29/backend/vif/31/0 already removed, cleanup frontend only

And as usual, qmemman has fallen over.

@eloquence
Copy link
Member Author

Aha! A bit more insight into what's going on. Contrary to the logs above, sd-whonix was not running at the end of the update -- I ran a qvm-ls after the update, and it showed this:

sd-whonix                      Halted   AppVM       purple  whonix-gw-15                   sys-firewall
sys-net                        Running  AppVM       red     fedora-30                      -
sys-usb                        Running  AppVM       red     fedora-30                      -
sys-whonix                     Running  AppVM       black   whonix-gw-15                   sys-firewall

It started and then shut down without an error message to the user. This is similar to what we were seeing with sys-firewall above.

I'm not seeing the same error in the guest log for sd-whonix, but I am seeing that the filesystem is mounted read-only for some reason:

Waiting for /dev/xvda* devices...
Qubes: Doing R/W setup for TemplateVM...
[    1.524273]  xvdc: xvdc1 xvdc3
Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes)
no label, UUID=ccf029b0-bc31-428a-aa53-bf805c0582ab
Qubes: done.
[    1.574886] EXT4-fs (xvda3): mounted filesystem with ordered data mode. Opts: (null)
Waiting for /dev/xvdd device...
mount: /dev/xvdd is write-protected, mounting read-only
[    1.665263] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    1.669602] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
[    2.542879] EXT4-fs (xvda3): re-mounted. Opts: (null)
switch_root: failed to mount moving /dev to /sysroot/dev: Invalid argument
switch_root: forcing unmount of /dev
switch_root: failed to mount moving /proc to /sysroot/proc: Invalid argument
switch_root: forcing unmount of /proc
switch_root: failed to mount moving /sys to /sysroot/sys: Invalid argument
switch_root: forcing unmount of /sys
switch_root: failed to mount moving /run to /sysroot/run: Invalid argument
switch_root: forcing unmount of /run
[    2.653172] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
[    2.653496] systemd[1]: Detected virtualization xen.
[    2.653523] systemd[1]: Detected architecture x86-64.

@conorsch
Copy link
Contributor

Well, those more verbose logs sure are paying off! The most interesting problem is

[sd-whonix] started and then shut down without an error message to the user.

which is pretty baffling to me. Even with the verbose logs I can't point to an explanation there. Given that all these VMs have virt_mode=hvm, which is somewhat non-standard for Qubes AppVMs, maybe we should try additional blocking in the qvm-start calls. So, once qvm-start returns, we sleep for ~10s or so. (That sounds like a lot, but some off-the-cuff testing of HVM start times seemed to indicate that was a reasonable wait time, although I'm failing to call up the relevant citation in issues in this repo right now.)

Perhaps the additional wait time after qvm-start wouldn't resolve the specific problem you're seeing, but the only arguments I see against adding the sleep are:

  1. don't sleep, read some other state to block (clearly the "running" status can't be trusted here, so the real test of is-the-vm-up would be to use qubes.Waitforsession, or just vm.run("echo hello") and wait for that to return)
  2. it'll add more time to the reboot process (worth it, if it makes the reboot process more reliable! plus, template consolidation will bring this number back down, and that's the next big push after we iron out the reboot logic initially)

@marmarek
Copy link

Is the above console log the end of it? That's far from full start. If there are some more messages there - do you see any reason for shutdown? Or the log is just cut short? If the latter, check Xen log (xl dmesg, or /var/log/xen/console/hypervisor.log`) - maybe the VM has crashed.

As for the read-only filesystem - it is normal - /dev/xvdd is partition with kernel modules.
BTW you said you use virt_mode=hvm here and I assumed it was to use custom kernel from within the VM (template). But the presence of /dev/xvdd says it uses dom0-provided kernel. Am I missing something?

@conorsch
Copy link
Contributor

As for the read-only filesystem - it is normal

Thanks for clarifying, I was just sifting through logs locally to understand that better!

BTW you said you use virt_mode=hvm here and I assumed it was to use custom kernel from within the VM (template). But the presence of /dev/xvdd says it uses dom0-provided kernel. Am I missing something?

No, you are correct there as well: sd-whonix is based on whonix-gw-15, so that's virt_mode=pvh, same as the Whonix defaults. It's the other sd-* VMs that are HVMs. So, the delayed start time definitely isn't relevant for the sd-whonix problem @eloquence just reported.

Thanks for the suggestions on more places to inspect to understand the sd-whonix start-then-stop behavior that the new logging clearly didn't illuminate.

@eloquence
Copy link
Member Author

eloquence commented Apr 17, 2020

Here's my entire guest-sd-whonix.log:
https://gist.github.com/eloquence/201fcf4ca278ddbe286ae4b86b511d1a

The most recently logged run is for the successful VM start that was triggered when I attempted to login to the SecureDrop Client. It's the previous run that failed (without an error to the user, but which may then have caused sd-proxy to fail with an error to the user). I am not noticing anything suspect like the /proc/xen issue we saw with sys-firewall.

@marmarek
Copy link

One more thing that could be related in some weird way is whonix.NewStatus and whonix.SdwdateStatus service. Those are used for the time-sync status icon. The thing is sys-whonix polls connected VMs with whonix.SdwdateStatus, which may interfere with restart sequence (namely - early start the VM that was just shutdown/killed). I hit this issue with automated tests a lot (where cleanup includes killing and removing many VMs).
Some more details: https://phabricator.whonix.org/T930

@eloquence
Copy link
Member Author

If the latter, check Xen log (xl dmesg, or /var/log/xen/console/hypervisor.log`) - maybe the VM has crashed.

I do see a crash in that log, though it's not timestamped:

(XEN) p2m_pod_demand_populate: Dom29 out of PoD memory! (tot=128016 ents=40416 dom29)
(XEN) domain_crash called from p2m-pod.c:1218
(XEN) Domain 29 (vcpu#1) crashed on cpu#4:
(XEN) ----[ Xen-4.8.5-14.fc25  x86_64  debug=n   Tainted: M    ]----
(XEN) CPU:    4
(XEN) RIP:    0033:[<000064560fd875c5>]
(XEN) RFLAGS: 0000000000010a07   CONTEXT: hvm guest (d29v1)
(XEN) rax: 7fffffffffffffdf   rbx: 000000000000006a   rcx: 000064560fffc4a0
(XEN) rdx: 000000000000000a   rsi: aaaaaaaaaaaaaaab   rdi: 00000000000001af
(XEN) rbp: 0000000000000035   rsp: 00007ffc58e03130   r8:  000064560fff0e80
(XEN) r9:  0000000000000010   r10: 0000000000000002   r11: 000078debebb5000
(XEN) r12: 00000000000001b0   r13: 000064560fffc160   r14: 00006456101df580
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000003606e0
(XEN) cr3: 000000000bd1b802   cr2: 000078debebb5024
(XEN) fsb: 000078debfd9d740   gsb: 0000000000000000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 002b   cs: 0033
(XEN) p2m_pod_demand_populate: Dom29 out of PoD memory! (tot=128016 ents=40416 dom29)
(XEN) domain_crash called from p2m-pod.c:1218
(XEN) p2m_pod_demand_populate: Dom29 out of PoD memory! (tot=128016 ents=40416 dom29)
(XEN) domain_crash called from p2m-pod.c:1218

This is followed by many hundreds of lines like this:

(XEN) CPU4: Temperature/speed normal
(XEN) CPU4: Temperature above threshold

These seem to be coming in every few seconds, so it's possible that this crash log entry does refer to the sd-whonix issue from earlier today (the machine has been idle since then).

@marmarek
Copy link

This looks similar to QubesOS/qubes-issues#5121
The start sequence (regarding populate-on-demand memory) is:

  1. The VM is constructed with memory map covering maxmem (4GB) RAM, but only memory (400MB) actually physically allocated - this 400MB is mapped wherever VM tries to write something
  2. Early in the boot, kernel loads balloon driver that takes away memory from the kernel and return its to Xen - this should leave the kernel with just 400MB (or whatever the actual initial allocation was)
  3. Later qubes-meminfo-writer starts and notify qmemman how much memory the VM needs
  4. qmemman based on this info, allocates (or not) more memory in Xen and tells VM's balloon driver it can release this memory to the kernel

Some more info: https://xenproject.org/2014/02/14/ballooning-rebooting-and-the-feature-youve-never-heard-of/
Based on the above, the PoD is used only early in the boot process. Later, the balloon driver takes care to not give kernel more memory than actually allocated to the VM. But maybe there is some bug in that part and not all the over-provisioned memory is returned to Xen?

Debugging this would probably require adding some debug prints into the Linux kernel and Xen. xl debug-key q also gives some info (into hypervisor log), but I'm not sure how it interpret it.

As a workaround, you can try either increasing initial memory, or decreasing maxmem for that VM. That should make it less likely to hit the issue.

@eloquence
Copy link
Member Author

@marmarek Thanks much for that detail. How do you see this relating to the issue that qmemman observably falls over when this happens? Is the fix for that likely to mitigate the problem?

@marmarek
Copy link

In the meantime, qmemman fix is in current-testing repo for R4.0 already (QubesOS/updates-status#1749)

@zenmonkeykstop
Copy link
Contributor

zenmonkeykstop commented May 4, 2020

Saw this in non-workstation usage on Friday - specifically, a Fedora-30 VM using sys-whonix for networking failed to start repeatedly, because sys-whonix itself was starting and then halting immediatedly. It looked like sys-whonix was never being allocated any more than 400MB memory. Restarting qubes-qmemman and then starting the sys-whonix VM seemed to clear it up.

@marmarek
Copy link

marmarek commented May 4, 2020

With the above update installed or not?

@zenmonkeykstop
Copy link
Contributor

Oh sorry, no, not with the above update.

@eloquence
Copy link
Member Author

I'll run with the version in https://github.com/marmarek/qubes-core-admin/blob/dd50e300c3f27e31cc26a9f503c95b11aaf9be25/qubes/tools/qmemmand.py (from QubesOS/qubes-core-admin#331) on one of my laptops for a while to see if I can still get the issue.

@eloquence
Copy link
Member Author

I've not seen this recently, and the qmemmand changes landed upstream a while ago. Have other folks seen this issue on their workstation in the last few weeks, at the 95% stage of updates?

@eloquence
Copy link
Member Author

We've seen no new reports of this, so closing as it may have been fixed upstream through memory management fixes.

@eloquence
Copy link
Member Author

Cross-linking @zenmonkeykstop report in #619 (comment) ; this has become far less frequent but there still seem to be cases in which it can be triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants