qmemman stops working until restarted; new VMs have only minimum memory #4890

andrewdavidwong · 2019-03-16T01:30:07Z

Qubes OS version:

R4.0

Affected component(s) or functionality:

qubes-qmemman.service

Steps to reproduce the behavior:

Use the system normally.
When you notice that a newly-started VM seems very slow, check how much memory it has.
Observe that all newly-started VMs have only minimum memory (400 MB by default, displayed as 399 MB in the Qubes Domains widget).
In dom0: sudo systemctl restart qubes-qmemman.service.
The system works normally again (VMs are assigned more memory dynamically) for minutes to days until the aforementioned symptoms manifest again and qubes-qmemman.service must be restarted again.

Expected or desired behavior:

qmemman continues working normally.

Actual behavior:

qmemman must be periodically restarted by the user when the user notices that VMs are being assigned only minimum memory.

General notes:

When I last experienced this problem, I checked the status of qubes-qmemman.service before restarting. It was active and running, not crashed.
This only started recently. I would guess that it was introduced by the R4.0 dom0 updates around 2019-03-06 to 2019-03-07, which included kernel, Xen, and Qubes package updates.

I have consulted the following relevant documentation:

N/A

I am aware of the following related, non-duplicate issues:

None found.

The text was updated successfully, but these errors were encountered:

marmarek · 2019-03-16T01:48:13Z

Normally qmemman produce significant amount of logs (into journalctl and /var/log/qubes/qmemman.log), do you see anything out of ordinary there? There should be a section of logs starting with balance_when_low_on_memory or balance_when_enough_memory. Can you provide some of it? It would be even more helpful, if you can identify which VM was affected (logs contain only xen ID, you can get it with xl list - but it is new at each VM start).

andrewdavidwong · 2019-03-16T05:56:21Z

Normally qmemman produce significant amount of logs (into journalctl and /var/log/qubes/qmemman.log), do you see anything out of ordinary there? There should be a section of logs starting with balance_when_low_on_memory or balance_when_enough_memory. Can you provide some of it? It would be even more helpful, if you can identify which VM was affected (logs contain only xen ID, you can get it with xl list - but it is new at each VM start).

Ok, thanks. I'll check this the next time it happens.

marmarek · 2019-03-19T16:43:19Z

It happened to me once. There were no new balance* log entries, only those related to new VM startup. I guess something was wrong with watch loop reacting to memory requests from VMs.
I've started qmemman with increased verbosity to collect more info:

systemctl stop qubes-qmemman
/usr/bin/qmemmand --foreground -v >qmemman-verbose.log 2>&1

t4777sd · 2019-05-02T03:54:43Z

This happened to me today for the first time

nil0x42 · 2019-05-06T15:12:18Z

This happened today to me for the first time too

marmarek · 2019-08-19T18:41:13Z

Got the exception:

xen.lowlevel.xs.Error: (11, 'Resource temporarily unavailable')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/qmemmand", line 5, in <module>
    sys.exit(main())
  File "/usr/lib/python3.7/site-packages/qubes/tools/qmemmand.py", line 294, in main
    XS_Watcher().watch_loop()
  File "/usr/lib/python3.7/site-packages/qubes/tools/qmemmand.py", line 158, in watch_loop
    result = self.handle.read_watch()
SystemError: <method 'read_watch' of 'xen.lowlevel.xs.xs' objects> returned a result with an error set

EAGAIN error wasn't properly handled. Details in patch description. Fixes QubesOS/qubes-issues#4890

qubesos-bot · 2019-09-09T02:15:29Z

Automated announcement from builder-github

The package xen_4.12.1-1 has been pushed to the r4.1 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing stretch-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

qubesos-bot · 2019-09-09T02:44:58Z

Automated announcement from builder-github

The package python2-xen-4.12.1-1.fc29 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

conorsch · 2020-04-01T00:05:03Z

Came here from freedomofpress/securedrop-workstation#498 (comment). Have spent some time trying to document the originally reported issue of "qmemman stops working until restarted; new VMs have only minimum memory". That's precisely the behavior I'm seeing, and a restart of the qmemman service resolves, at least for a while.

During local testing, I've been using this script in an attempt to observe a "broken" qmemman state: https://gist.github.com/conorsch/bb8b573a6a7a98af70db2a20b4866122 I realize that there's some disagreement about the significance of "EOF" occurring in the qmemman logs. The script tries to compare whether a rebalance as occurred after an "EOF" event, and if not, then it assumes the service is no longer working. That's a pretty weak check, I admit: it's essentially assuming that if no memory balance has been logged recently, then the service has stopped working. Hardly a bullet-proof conclusion, but nonetheless useful during debugging.

For additional detail, please see the results of testing here: https://gist.github.com/conorsch/db95d5add4af4ab68862257cca655882 I've tried to make those results as reproducible as possible. As the output shows, an AppVM is clearly stuck at 399MB of RAM. After multiple interactions with the AppVM fail to trigger a memory rebalance, the script restarts qubes-qmemman, and rebalancing functionality is immediately restored. The correlation is strong between the EOF-but-no-rebalance logged and actual observed failed state of the service.

Since above it was suggested to capture verbose logs, I've done that, and can share one interesting exception:

del_domain(id='171')
refresh_meminfo(domid=171, untrusted_meminfo_key=b'296304')
global_lock released
Traceback (most recent call last):
  File "/usr/bin/qmemmand", line 5, in <module>
    sys.exit(main())
  File "/usr/lib/python3.5/site-packages/qubes/tools/qmemmand.py", line 298, in main
    XS_Watcher().watch_loop()
  File "/usr/lib/python3.5/site-packages/qubes/tools/qmemmand.py", line 161, in watch_loop
    token.fn(self, token.param)
  File "/usr/lib/python3.5/site-packages/qubes/tools/qmemmand.py", line 149, in meminfo_changed
    system_state.refresh_meminfo(domain_id, untrusted_meminfo_key)
  File "/usr/lib/python3.5/site-packages/qubes/qmemman/__init__.py", line 251, in refresh_meminfo
    self.domdict[domid], untrusted_meminfo_key)
KeyError: '171'

More detail in log gist: https://gist.github.com/conorsch/f6b1ca4502742f9a7d263c1fc479d3f3 Unfortunately these verbose logs are not from the same failure as reported in the gist above; I'll try again to see if I can collect all of the above types of info for a single failure event.

If I'm understanding that stack trace correctly, it looks like a VM was destroyed while the memory balance was being performed. A try/except for KeyError, or at least some more debug logging, might be suitable around here: https://github.com/QubesOS/qubes-core-admin/blob/8f0ec59f956927694e60fc9d0ec949866983eb9c/qubes/qmemman/__init__.py#L245-L252

Please let me know if I can provide additional info to aid in debugging, or test any patches.

marmarek · 2020-04-01T00:16:53Z

Thanks, this is very helpful already!

marmarek · 2020-04-01T01:27:54Z

It seems killing a VM (instead of properly shutting down) makes it far more likely to trigger this bug. Anyway, fix is on the way.

@conorsch

First the main bug: when meminfo xenstore watch fires, in some cases (just after starting some domain) XS_Watcher refreshes internal list of domains before processing the event. This is done specifically to include new domain in there. But the opposite could happen too - the domain could be destroyed. In this case refres_meminfo() function raises an exception, which isn't handled and interrupts the whole xenstore watch loop. This issue is likely to be triggered by killing the domain, as this way it could disappear shortly after writing updated meminfo entry. In case of proper shutdown, meminfo-writer is stopped earlier and do not write updates just before domain destroy. Fix this by checking if the requested domain is still there just after refreshing the list. Then, catch exceptions in xenstore watch handling functions, to not interrupt xenstore watch loop. If it gets interrupted, qmemman basically stops memory balancing. And finally, clear force_refresh_domain_list flag after refreshing the domain list. That missing line caused domain refresh at every meminfo change, making it use some more CPU time. Thanks @conorsch for capturing valuable logs. Fixes QubesOS/qubes-issues#4890

0spinboson · 2020-04-01T07:30:52Z

ah yes, that would explain my issues too. Thanks for noticing, conorsch. :)

@conorsch

First the main bug: when meminfo xenstore watch fires, in some cases (just after starting some domain) XS_Watcher refreshes internal list of domains before processing the event. This is done specifically to include new domain in there. But the opposite could happen too - the domain could be destroyed. In this case refres_meminfo() function raises an exception, which isn't handled and interrupts the whole xenstore watch loop. This issue is likely to be triggered by killing the domain, as this way it could disappear shortly after writing updated meminfo entry. In case of proper shutdown, meminfo-writer is stopped earlier and do not write updates just before domain destroy. Fix this by checking if the requested domain is still there just after refreshing the list. Then, catch exceptions in xenstore watch handling functions, to not interrupt xenstore watch loop. If it gets interrupted, qmemman basically stops memory balancing. And finally, clear force_refresh_domain_list flag after refreshing the domain list. That missing line caused domain refresh at every meminfo change, making it use some more CPU time. While at it, change "EOF" log message to something a bit more meaningful. Thanks @conorsch for capturing valuable logs. Fixes QubesOS/qubes-issues#4890 (cherry picked from commit dd50e30)

qubesos-bot · 2020-04-18T04:06:16Z

Automated announcement from builder-github

The package qubes-core-dom0-4.1.11-1.fc31 has been pushed to the r4.1 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

@conorsch

First the main bug: when meminfo xenstore watch fires, in some cases (just after starting some domain) XS_Watcher refreshes internal list of domains before processing the event. This is done specifically to include new domain in there. But the opposite could happen too - the domain could be destroyed. In this case refres_meminfo() function raises an exception, which isn't handled and interrupts the whole xenstore watch loop. This issue is likely to be triggered by killing the domain, as this way it could disappear shortly after writing updated meminfo entry. In case of proper shutdown, meminfo-writer is stopped earlier and do not write updates just before domain destroy. Fix this by checking if the requested domain is still there just after refreshing the list. Then, catch exceptions in xenstore watch handling functions, to not interrupt xenstore watch loop. If it gets interrupted, qmemman basically stops memory balancing. And finally, clear force_refresh_domain_list flag after refreshing the domain list. That missing line caused domain refresh at every meminfo change, making it use some more CPU time. While at it, change "EOF" log message to something a bit more meaningful. Thanks @conorsch for capturing valuable logs. Fixes QubesOS/qubes-issues#4890 (cherry picked from commit dd50e30)

qubesos-bot · 2020-04-18T12:04:21Z

Automated announcement from builder-github

The package qubes-core-dom0-4.0.50-1.fc25 has been pushed to the r4.0 testing repository for dom0.
To test this update, please install it with the following command:

sudo qubes-dom0-update --enablerepo=qubes-dom0-current-testing

Changes included in this update

qubesos-bot · 2020-05-07T02:44:20Z

Automated announcement from builder-github

The package qubes-core-dom0-4.0.50-1.fc25 has been pushed to the r4.0 stable repository for dom0.
To install this update, please use the standard update command:

sudo qubes-dom0-update

Or update dom0 via Qubes Manager.

Changes included in this update

andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core labels Mar 16, 2019

andrewdavidwong added this to the Release 4.0 updates milestone Mar 16, 2019

marmarek mentioned this issue Mar 17, 2019

Available Xen free memory not used #4891

Closed

andrewdavidwong added the P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. label Mar 27, 2019

andrewdavidwong added P: major Priority: major. Between "default" and "critical" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels May 2, 2019

marmarek added a commit to QubesOS/qubes-vmm-xen that referenced this issue Sep 6, 2019

Fix xs.read_watch() python binding

f4c636e

EAGAIN error wasn't properly handled. Details in patch description. Fixes QubesOS/qubes-issues#4890

qubesos-bot mentioned this issue Sep 9, 2019

vmm-xen v4.12.1-1 (r4.1) QubesOS/updates-status#1286

Closed

qubesos-bot added the r4.1-buster-cur-test label Sep 9, 2019

qubesos-bot added the r4.1-stretch-cur-test label Sep 9, 2019

qubesos-bot added the r4.1-dom0-cur-test label Sep 9, 2019

eloquence mentioned this issue Feb 18, 2020

Investigate updater performance bottlenecks freedomofpress/securedrop-workstation#459

Closed

marmarek mentioned this issue Mar 26, 2020

[0.2.3-rpm] libxenlight failed to create new domain sd-log freedomofpress/securedrop-workstation#498

Closed

marmarek mentioned this issue Apr 1, 2020

Fix multiple qmemman issues QubesOS/qubes-core-admin#331

Merged

marmarek closed this as completed in marmarek/qubes-core-admin@dd50e30 Apr 10, 2020

qubesos-bot mentioned this issue Apr 18, 2020

core-admin v4.1.11 (r4.1) QubesOS/updates-status#1746

Closed

qubesos-bot added the r4.0-dom0-cur-test label Apr 18, 2020

qubesos-bot mentioned this issue Apr 18, 2020

core-admin v4.0.50 (r4.0) QubesOS/updates-status#1749

Closed

qubesos-bot added r4.0-dom0-stable and removed r4.0-dom0-cur-test labels May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qmemman stops working until restarted; new VMs have only minimum memory #4890

qmemman stops working until restarted; new VMs have only minimum memory #4890

andrewdavidwong commented Mar 16, 2019 •

edited

Loading

marmarek commented Mar 16, 2019

andrewdavidwong commented Mar 16, 2019

marmarek commented Mar 19, 2019

t4777sd commented May 2, 2019

nil0x42 commented May 6, 2019 •

edited

Loading

marmarek commented Aug 19, 2019

qubesos-bot commented Sep 9, 2019

qubesos-bot commented Sep 9, 2019

conorsch commented Apr 1, 2020

marmarek commented Apr 1, 2020

marmarek commented Apr 1, 2020

0spinboson commented Apr 1, 2020 •

edited

Loading

qubesos-bot commented Apr 18, 2020

qubesos-bot commented Apr 18, 2020

qubesos-bot commented May 7, 2020

qmemman stops working until restarted; new VMs have only minimum memory #4890

qmemman stops working until restarted; new VMs have only minimum memory #4890

Comments

andrewdavidwong commented Mar 16, 2019 • edited Loading

Qubes OS version:

Affected component(s) or functionality:

Steps to reproduce the behavior:

Expected or desired behavior:

Actual behavior:

General notes:

I have consulted the following relevant documentation:

I am aware of the following related, non-duplicate issues:

marmarek commented Mar 16, 2019

andrewdavidwong commented Mar 16, 2019

marmarek commented Mar 19, 2019

t4777sd commented May 2, 2019

nil0x42 commented May 6, 2019 • edited Loading

marmarek commented Aug 19, 2019

qubesos-bot commented Sep 9, 2019

qubesos-bot commented Sep 9, 2019

conorsch commented Apr 1, 2020

marmarek commented Apr 1, 2020

marmarek commented Apr 1, 2020

0spinboson commented Apr 1, 2020 • edited Loading

qubesos-bot commented Apr 18, 2020

qubesos-bot commented Apr 18, 2020

qubesos-bot commented May 7, 2020

andrewdavidwong commented Mar 16, 2019 •

edited

Loading

nil0x42 commented May 6, 2019 •

edited

Loading

0spinboson commented Apr 1, 2020 •

edited

Loading