Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qmemman stops working until restarted; new VMs have only minimum memory #4890

Closed
andrewdavidwong opened this issue Mar 16, 2019 · 15 comments · Fixed by QubesOS/qubes-core-admin#331
Labels
C: core P: major Priority: major. Between "default" and "critical" in severity. r4.0-dom0-stable r4.1-buster-cur-test r4.1-dom0-cur-test r4.1-stretch-cur-test T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@andrewdavidwong
Copy link
Member

andrewdavidwong commented Mar 16, 2019

Qubes OS version:

R4.0

Affected component(s) or functionality:

qubes-qmemman.service


Steps to reproduce the behavior:

  1. Use the system normally.
  2. When you notice that a newly-started VM seems very slow, check how much memory it has.
  3. Observe that all newly-started VMs have only minimum memory (400 MB by default, displayed as 399 MB in the Qubes Domains widget).
  4. In dom0: sudo systemctl restart qubes-qmemman.service.
  5. The system works normally again (VMs are assigned more memory dynamically) for minutes to days until the aforementioned symptoms manifest again and qubes-qmemman.service must be restarted again.

Expected or desired behavior:

qmemman continues working normally.

Actual behavior:

qmemman must be periodically restarted by the user when the user notices that VMs are being assigned only minimum memory.

General notes:

  • When I last experienced this problem, I checked the status of qubes-qmemman.service before restarting. It was active and running, not crashed.
  • This only started recently. I would guess that it was introduced by the R4.0 dom0 updates around 2019-03-06 to 2019-03-07, which included kernel, Xen, and Qubes package updates.

I have consulted the following relevant documentation:

N/A

I am aware of the following related, non-duplicate issues:

None found.

@andrewdavidwong andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core labels Mar 16, 2019
@andrewdavidwong andrewdavidwong added this to the Release 4.0 updates milestone Mar 16, 2019
@marmarek
Copy link
Member

Normally qmemman produce significant amount of logs (into journalctl and /var/log/qubes/qmemman.log), do you see anything out of ordinary there? There should be a section of logs starting with balance_when_low_on_memory or balance_when_enough_memory. Can you provide some of it? It would be even more helpful, if you can identify which VM was affected (logs contain only xen ID, you can get it with xl list - but it is new at each VM start).

@andrewdavidwong
Copy link
Member Author

Normally qmemman produce significant amount of logs (into journalctl and /var/log/qubes/qmemman.log), do you see anything out of ordinary there? There should be a section of logs starting with balance_when_low_on_memory or balance_when_enough_memory. Can you provide some of it? It would be even more helpful, if you can identify which VM was affected (logs contain only xen ID, you can get it with xl list - but it is new at each VM start).

Ok, thanks. I'll check this the next time it happens.

@marmarek
Copy link
Member

It happened to me once. There were no new balance* log entries, only those related to new VM startup. I guess something was wrong with watch loop reacting to memory requests from VMs.
I've started qmemman with increased verbosity to collect more info:

systemctl stop qubes-qmemman
/usr/bin/qmemmand --foreground -v >qmemman-verbose.log 2>&1

@andrewdavidwong andrewdavidwong added the P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. label Mar 27, 2019
@t4777sd
Copy link

t4777sd commented May 2, 2019

This happened to me today for the first time

@andrewdavidwong andrewdavidwong added P: major Priority: major. Between "default" and "critical" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels May 2, 2019
@nil0x42
Copy link

nil0x42 commented May 6, 2019

This happened today to me for the first time too

@marmarek
Copy link
Member

Got the exception:

xen.lowlevel.xs.Error: (11, 'Resource temporarily unavailable')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/qmemmand", line 5, in <module>
    sys.exit(main())
  File "/usr/lib/python3.7/site-packages/qubes/tools/qmemmand.py", line 294, in main
    XS_Watcher().watch_loop()
  File "/usr/lib/python3.7/site-packages/qubes/tools/qmemmand.py", line 158, in watch_loop
    result = self.handle.read_watch()
SystemError: <method 'read_watch' of 'xen.lowlevel.xs.xs' objects> returned a result with an error set

marmarek added a commit to QubesOS/qubes-vmm-xen that referenced this issue Sep 6, 2019
EAGAIN error wasn't properly handled. Details in patch description.

Fixes QubesOS/qubes-issues#4890
@qubesos-bot
Copy link

Automated announcement from builder-github

The package xen_4.12.1-1 has been pushed to the r4.1 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing stretch-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package python2-xen-4.12.1-1.fc29 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@conorsch
Copy link

conorsch commented Apr 1, 2020

Came here from freedomofpress/securedrop-workstation#498 (comment). Have spent some time trying to document the originally reported issue of "qmemman stops working until restarted; new VMs have only minimum memory". That's precisely the behavior I'm seeing, and a restart of the qmemman service resolves, at least for a while.

During local testing, I've been using this script in an attempt to observe a "broken" qmemman state: https://gist.github.com/conorsch/bb8b573a6a7a98af70db2a20b4866122 I realize that there's some disagreement about the significance of "EOF" occurring in the qmemman logs. The script tries to compare whether a rebalance as occurred after an "EOF" event, and if not, then it assumes the service is no longer working. That's a pretty weak check, I admit: it's essentially assuming that if no memory balance has been logged recently, then the service has stopped working. Hardly a bullet-proof conclusion, but nonetheless useful during debugging.

For additional detail, please see the results of testing here: https://gist.github.com/conorsch/db95d5add4af4ab68862257cca655882 I've tried to make those results as reproducible as possible. As the output shows, an AppVM is clearly stuck at 399MB of RAM. After multiple interactions with the AppVM fail to trigger a memory rebalance, the script restarts qubes-qmemman, and rebalancing functionality is immediately restored. The correlation is strong between the EOF-but-no-rebalance logged and actual observed failed state of the service.

Since above it was suggested to capture verbose logs, I've done that, and can share one interesting exception:

del_domain(id='171')
refresh_meminfo(domid=171, untrusted_meminfo_key=b'296304')
global_lock released
Traceback (most recent call last):
  File "/usr/bin/qmemmand", line 5, in <module>
    sys.exit(main())
  File "/usr/lib/python3.5/site-packages/qubes/tools/qmemmand.py", line 298, in main
    XS_Watcher().watch_loop()
  File "/usr/lib/python3.5/site-packages/qubes/tools/qmemmand.py", line 161, in watch_loop
    token.fn(self, token.param)
  File "/usr/lib/python3.5/site-packages/qubes/tools/qmemmand.py", line 149, in meminfo_changed
    system_state.refresh_meminfo(domain_id, untrusted_meminfo_key)
  File "/usr/lib/python3.5/site-packages/qubes/qmemman/__init__.py", line 251, in refresh_meminfo
    self.domdict[domid], untrusted_meminfo_key)
KeyError: '171'

More detail in log gist: https://gist.github.com/conorsch/f6b1ca4502742f9a7d263c1fc479d3f3 Unfortunately these verbose logs are not from the same failure as reported in the gist above; I'll try again to see if I can collect all of the above types of info for a single failure event.

If I'm understanding that stack trace correctly, it looks like a VM was destroyed while the memory balance was being performed. A try/except for KeyError, or at least some more debug logging, might be suitable around here: https://github.com/QubesOS/qubes-core-admin/blob/8f0ec59f956927694e60fc9d0ec949866983eb9c/qubes/qmemman/__init__.py#L245-L252

Please let me know if I can provide additional info to aid in debugging, or test any patches.

@marmarek
Copy link
Member

marmarek commented Apr 1, 2020

Thanks, this is very helpful already!

@marmarek
Copy link
Member

marmarek commented Apr 1, 2020

It seems killing a VM (instead of properly shutting down) makes it far more likely to trigger this bug. Anyway, fix is on the way.

marmarek added a commit to marmarek/qubes-core-admin that referenced this issue Apr 1, 2020
First the main bug: when meminfo xenstore watch fires, in some cases
(just after starting some domain) XS_Watcher refreshes internal list of
domains before processing the event. This is done specifically to
include new domain in there. But the opposite could happen too - the
domain could be destroyed. In this case refres_meminfo() function raises
an exception, which isn't handled and interrupts the whole xenstore
watch loop. This issue is likely to be triggered by killing the domain,
as this way it could disappear shortly after writing updated meminfo
entry. In case of proper shutdown, meminfo-writer is stopped earlier and
do not write updates just before domain destroy.
Fix this by checking if the requested domain is still there just after
refreshing the list.

Then, catch exceptions in xenstore watch handling functions, to not
interrupt xenstore watch loop. If it gets interrupted, qmemman basically
stops memory balancing.

And finally, clear force_refresh_domain_list flag after refreshing the
domain list. That missing line caused domain refresh at every meminfo
change, making it use some more CPU time.

Thanks @conorsch for capturing valuable logs.

Fixes QubesOS/qubes-issues#4890
@0spinboson
Copy link

0spinboson commented Apr 1, 2020

ah yes, that would explain my issues too. Thanks for noticing, conorsch. :)

marmarek added a commit to QubesOS/qubes-core-admin that referenced this issue Apr 18, 2020
First the main bug: when meminfo xenstore watch fires, in some cases
(just after starting some domain) XS_Watcher refreshes internal list of
domains before processing the event. This is done specifically to
include new domain in there. But the opposite could happen too - the
domain could be destroyed. In this case refres_meminfo() function raises
an exception, which isn't handled and interrupts the whole xenstore
watch loop. This issue is likely to be triggered by killing the domain,
as this way it could disappear shortly after writing updated meminfo
entry. In case of proper shutdown, meminfo-writer is stopped earlier and
do not write updates just before domain destroy.
Fix this by checking if the requested domain is still there just after
refreshing the list.

Then, catch exceptions in xenstore watch handling functions, to not
interrupt xenstore watch loop. If it gets interrupted, qmemman basically
stops memory balancing.

And finally, clear force_refresh_domain_list flag after refreshing the
domain list. That missing line caused domain refresh at every meminfo
change, making it use some more CPU time.

While at it, change "EOF" log message to something a bit more
meaningful.

Thanks @conorsch for capturing valuable logs.

Fixes QubesOS/qubes-issues#4890

(cherry picked from commit dd50e30)
@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-dom0-4.1.11-1.fc31 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

marmarek added a commit to QubesOS/qubes-core-admin that referenced this issue Apr 18, 2020
First the main bug: when meminfo xenstore watch fires, in some cases
(just after starting some domain) XS_Watcher refreshes internal list of
domains before processing the event. This is done specifically to
include new domain in there. But the opposite could happen too - the
domain could be destroyed. In this case refres_meminfo() function raises
an exception, which isn't handled and interrupts the whole xenstore
watch loop. This issue is likely to be triggered by killing the domain,
as this way it could disappear shortly after writing updated meminfo
entry. In case of proper shutdown, meminfo-writer is stopped earlier and
do not write updates just before domain destroy.
Fix this by checking if the requested domain is still there just after
refreshing the list.

Then, catch exceptions in xenstore watch handling functions, to not
interrupt xenstore watch loop. If it gets interrupted, qmemman basically
stops memory balancing.

And finally, clear force_refresh_domain_list flag after refreshing the
domain list. That missing line caused domain refresh at every meminfo
change, making it use some more CPU time.

While at it, change "EOF" log message to something a bit more
meaningful.

Thanks @conorsch for capturing valuable logs.

Fixes QubesOS/qubes-issues#4890

(cherry picked from commit dd50e30)
@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-dom0-4.0.50-1.fc25 has been pushed to the r4.0 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-dom0-4.0.50-1.fc25 has been pushed to the r4.0 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core P: major Priority: major. Between "default" and "critical" in severity. r4.0-dom0-stable r4.1-buster-cur-test r4.1-dom0-cur-test r4.1-stretch-cur-test T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants