-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.2.3-rpm] libxenlight failed to create new domain sd-log
#498
Comments
See #495 for another |
Again seeing this during an update run today. Am blowing away this prod install momentarily, will report if I see it again on reinstall. |
I'm seeing this on a brand new install from scratch, but this time with |
Yup, repro'd. Installed prod yesterday, applied updates today, saw the error when I attempted to shut down sd-related vms. Edit: see QubesOS/qubes-issues#3125 |
@rocodes How much memory are you using? |
@conorsch has provided a script to monitor the health of the qmemman service in https://gist.github.com/conorsch/bb8b573a6a7a98af70db2a20b4866122 I ran this script ( |
Last |
(For the record, we have reports of this happening on systems w/ 16 and 32 GB, the latter of which should be more than sufficient.) |
"EOF" message from qmemman isn't any kind of its failure. It is normal(*) message during VM startup, failed or not. It means qmemman resumes balancing normally, after allocating memory for the new VM. (*) "normal" as in "expected", but indeed very cryptic. |
@marmarek The state change Conor's script checks for (EOF being the most recent journal entry) is only reached exactly when I see the VM startup error, and after that, qmemman does not seem to leave that state until it's bounced. I'll test it again today to see if that behavior is consistently reproducible. I do not see any suspicious line in the Click to expand
|
Error "libxenlight failed to create new domain" always has related details in |
The screenshot above illustrates what happened -- there was no "Cannot connect to" error displayed on the screen, but I also do not see any relevant log entries in the |
(To be sure, I'll also tail |
qmemman stopping to work sounds like QubesOS/qubes-issues#4890 (I'll backport the fix to R4.0 in a moment - EDIT: or not, seems to be a different issue in R4.0), but it doesn't result in "libxenlight failed to create new domain" error. |
This screenshot shows the system state with libxl-driver.log tailed. The devices-related errors happen while the updater restarts system VMs. The Regarding |
It is always the same VM? Also, copying comment from QubesOS/qubes-issues#3125 (comment):
This seems to be the issue. Can you check console log of the corresponding stubdom ( |
I also saw it for another VM in the past (also at that same stage of the update process), but most recently it's always been I don't see anything like that error in the
The "Locked DMA" is repeated a gazillion times before that. |
Just did another run and did not see new lines being appended to that file when the error occurred. But I did see a repeat of this in
|
Cross-ref potentially related #486 which reboots system VMs in case of |
I have an idea!
This can be easily verified by avoiding this automatic shutdown. It is performed only if the VM wasn't initially running, so if you start |
@marmarek That's quite a lead! Thank you for chiming in, will investigate further and try to confirm that hypothesis. Right now we're a bit aggressive in force-applying updates, which is a temporary measure for the purposes of our limit pilot. There are likely a few ways we could soften that logic, and thereby reduce the edge cases like the race condition you describe. |
Poked a bit more at this and I do think some or all of this issue was introduced with the new reboot behavior in #486, which also correlates reasonably well with when we started observing this issue. The following STR work for me:
Then, if I patch this line in Note for anyone testing this that the updater will reinstate the unpatched file via Salt as part of the update, so expect subsequent runs to fail again. |
^ I'll try these exact STR again now to see if this is consistently reproducing both the error and the resolution on my system. I'll first run a full update to ensure no variance between runs. |
I think I am late to the party (and it sounds like there are already hypotheses), but I've repro'd this again with |
And per suggestion from @conorsch, here's a version that uses |
In the spirit of raising its head again just when we think we may have squashed it, I just saw the
And as usual, |
Aha! A bit more insight into what's going on. Contrary to the logs above,
It started and then shut down without an error message to the user. This is similar to what we were seeing with I'm not seeing the same error in the guest log for
|
Well, those more verbose logs sure are paying off! The most interesting problem is
which is pretty baffling to me. Even with the verbose logs I can't point to an explanation there. Given that all these VMs have Perhaps the additional wait time after qvm-start wouldn't resolve the specific problem you're seeing, but the only arguments I see against adding the sleep are:
|
Is the above console log the end of it? That's far from full start. If there are some more messages there - do you see any reason for shutdown? Or the log is just cut short? If the latter, check Xen log (xl dmesg, or /var/log/xen/console/hypervisor.log`) - maybe the VM has crashed. As for the read-only filesystem - it is normal - /dev/xvdd is partition with kernel modules. |
Thanks for clarifying, I was just sifting through logs locally to understand that better!
No, you are correct there as well: Thanks for the suggestions on more places to inspect to understand the sd-whonix start-then-stop behavior that the new logging clearly didn't illuminate. |
Here's my entire The most recently logged run is for the successful VM start that was triggered when I attempted to login to the SecureDrop Client. It's the previous run that failed (without an error to the user, but which may then have caused |
One more thing that could be related in some weird way is |
I do see a crash in that log, though it's not timestamped:
This is followed by many hundreds of lines like this:
These seem to be coming in every few seconds, so it's possible that this crash log entry does refer to the |
This looks similar to QubesOS/qubes-issues#5121
Some more info: https://xenproject.org/2014/02/14/ballooning-rebooting-and-the-feature-youve-never-heard-of/ Debugging this would probably require adding some debug prints into the Linux kernel and Xen. As a workaround, you can try either increasing initial |
@marmarek Thanks much for that detail. How do you see this relating to the issue that |
In the meantime, qmemman fix is in current-testing repo for R4.0 already (QubesOS/updates-status#1749) |
Saw this in non-workstation usage on Friday - specifically, a Fedora-30 VM using |
With the above update installed or not? |
Oh sorry, no, not with the above update. |
I'll run with the version in https://github.com/marmarek/qubes-core-admin/blob/dd50e300c3f27e31cc26a9f503c95b11aaf9be25/qubes/tools/qmemmand.py (from QubesOS/qubes-core-admin#331) on one of my laptops for a while to see if I can still get the issue. |
I've not seen this recently, and the |
We've seen no new reports of this, so closing as it may have been fixed upstream through memory management fixes. |
Cross-linking @zenmonkeykstop report in #619 (comment) ; this has become far less frequent but there still seem to be cases in which it can be triggered. |
In my prod install, I have now seen this error message (displayed as a sticky Qubes notification) twice towards the end of updates.
sd-log
does end up running just fine./var/log/qubes/qubes.log
shows this output:Click to expand
The text was updated successfully, but these errors were encountered: