-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qmemman stops working until restarted; new VMs have only minimum memory #4890
qmemman stops working until restarted; new VMs have only minimum memory #4890
Comments
Normally qmemman produce significant amount of logs (into journalctl and |
Ok, thanks. I'll check this the next time it happens. |
It happened to me once. There were no new
|
This happened to me today for the first time |
This happened today to me for the first time too |
Got the exception:
|
EAGAIN error wasn't properly handled. Details in patch description. Fixes QubesOS/qubes-issues#4890
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
|
Came here from freedomofpress/securedrop-workstation#498 (comment). Have spent some time trying to document the originally reported issue of "qmemman stops working until restarted; new VMs have only minimum memory". That's precisely the behavior I'm seeing, and a restart of the qmemman service resolves, at least for a while. During local testing, I've been using this script in an attempt to observe a "broken" qmemman state: https://gist.github.com/conorsch/bb8b573a6a7a98af70db2a20b4866122 I realize that there's some disagreement about the significance of "EOF" occurring in the qmemman logs. The script tries to compare whether a rebalance as occurred after an "EOF" event, and if not, then it assumes the service is no longer working. That's a pretty weak check, I admit: it's essentially assuming that if no memory balance has been logged recently, then the service has stopped working. Hardly a bullet-proof conclusion, but nonetheless useful during debugging. For additional detail, please see the results of testing here: https://gist.github.com/conorsch/db95d5add4af4ab68862257cca655882 I've tried to make those results as reproducible as possible. As the output shows, an AppVM is clearly stuck at 399MB of RAM. After multiple interactions with the AppVM fail to trigger a memory rebalance, the script restarts qubes-qmemman, and rebalancing functionality is immediately restored. The correlation is strong between the EOF-but-no-rebalance logged and actual observed failed state of the service. Since above it was suggested to capture verbose logs, I've done that, and can share one interesting exception:
More detail in log gist: https://gist.github.com/conorsch/f6b1ca4502742f9a7d263c1fc479d3f3 Unfortunately these verbose logs are not from the same failure as reported in the gist above; I'll try again to see if I can collect all of the above types of info for a single failure event. If I'm understanding that stack trace correctly, it looks like a VM was destroyed while the memory balance was being performed. A try/except for KeyError, or at least some more debug logging, might be suitable around here: https://github.com/QubesOS/qubes-core-admin/blob/8f0ec59f956927694e60fc9d0ec949866983eb9c/qubes/qmemman/__init__.py#L245-L252 Please let me know if I can provide additional info to aid in debugging, or test any patches. |
Thanks, this is very helpful already! |
It seems killing a VM (instead of properly shutting down) makes it far more likely to trigger this bug. Anyway, fix is on the way. |
First the main bug: when meminfo xenstore watch fires, in some cases (just after starting some domain) XS_Watcher refreshes internal list of domains before processing the event. This is done specifically to include new domain in there. But the opposite could happen too - the domain could be destroyed. In this case refres_meminfo() function raises an exception, which isn't handled and interrupts the whole xenstore watch loop. This issue is likely to be triggered by killing the domain, as this way it could disappear shortly after writing updated meminfo entry. In case of proper shutdown, meminfo-writer is stopped earlier and do not write updates just before domain destroy. Fix this by checking if the requested domain is still there just after refreshing the list. Then, catch exceptions in xenstore watch handling functions, to not interrupt xenstore watch loop. If it gets interrupted, qmemman basically stops memory balancing. And finally, clear force_refresh_domain_list flag after refreshing the domain list. That missing line caused domain refresh at every meminfo change, making it use some more CPU time. Thanks @conorsch for capturing valuable logs. Fixes QubesOS/qubes-issues#4890
ah yes, that would explain my issues too. Thanks for noticing, conorsch. :) |
First the main bug: when meminfo xenstore watch fires, in some cases (just after starting some domain) XS_Watcher refreshes internal list of domains before processing the event. This is done specifically to include new domain in there. But the opposite could happen too - the domain could be destroyed. In this case refres_meminfo() function raises an exception, which isn't handled and interrupts the whole xenstore watch loop. This issue is likely to be triggered by killing the domain, as this way it could disappear shortly after writing updated meminfo entry. In case of proper shutdown, meminfo-writer is stopped earlier and do not write updates just before domain destroy. Fix this by checking if the requested domain is still there just after refreshing the list. Then, catch exceptions in xenstore watch handling functions, to not interrupt xenstore watch loop. If it gets interrupted, qmemman basically stops memory balancing. And finally, clear force_refresh_domain_list flag after refreshing the domain list. That missing line caused domain refresh at every meminfo change, making it use some more CPU time. While at it, change "EOF" log message to something a bit more meaningful. Thanks @conorsch for capturing valuable logs. Fixes QubesOS/qubes-issues#4890 (cherry picked from commit dd50e30)
Automated announcement from builder-github The package
|
First the main bug: when meminfo xenstore watch fires, in some cases (just after starting some domain) XS_Watcher refreshes internal list of domains before processing the event. This is done specifically to include new domain in there. But the opposite could happen too - the domain could be destroyed. In this case refres_meminfo() function raises an exception, which isn't handled and interrupts the whole xenstore watch loop. This issue is likely to be triggered by killing the domain, as this way it could disappear shortly after writing updated meminfo entry. In case of proper shutdown, meminfo-writer is stopped earlier and do not write updates just before domain destroy. Fix this by checking if the requested domain is still there just after refreshing the list. Then, catch exceptions in xenstore watch handling functions, to not interrupt xenstore watch loop. If it gets interrupted, qmemman basically stops memory balancing. And finally, clear force_refresh_domain_list flag after refreshing the domain list. That missing line caused domain refresh at every meminfo change, making it use some more CPU time. While at it, change "EOF" log message to something a bit more meaningful. Thanks @conorsch for capturing valuable logs. Fixes QubesOS/qubes-issues#4890 (cherry picked from commit dd50e30)
Automated announcement from builder-github The package
|
Automated announcement from builder-github The package
Or update dom0 via Qubes Manager. |
Qubes OS version:
R4.0
Affected component(s) or functionality:
qubes-qmemman.service
Steps to reproduce the behavior:
sudo systemctl restart qubes-qmemman.service
.qubes-qmemman.service
must be restarted again.Expected or desired behavior:
qmemman
continues working normally.Actual behavior:
qmemman
must be periodically restarted by the user when the user notices that VMs are being assigned only minimum memory.General notes:
qubes-qmemman.service
before restarting. It was active and running, not crashed.I have consulted the following relevant documentation:
N/A
I am aware of the following related, non-duplicate issues:
None found.
The text was updated successfully, but these errors were encountered: