-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes VM fails to start due to out of memory error, even though qmemman did freed (supposedly) enough #9431
Labels
affects-4.3
This issue affects Qubes OS 4.3.
C: core
diagnosed
Technical diagnosis has been performed (see issue comments).
P: major
Priority: major. Between "default" and "critical" in severity.
pr submitted
A pull request has been submitted for this issue.
r4.2-host-stable
r4.3-host-cur-test
T: bug
Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Comments
marmarek
added
T: bug
Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
C: core
P: default
Priority: default. Default priority for new issues, to be replaced given sufficient information.
affects-4.3
This issue affects Qubes OS 4.3.
P: major
Priority: major. Between "default" and "critical" in severity.
and removed
P: default
Priority: default. Default priority for new issues, to be replaced given sufficient information.
labels
Aug 24, 2024
andrewdavidwong
added
the
needs diagnosis
Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed.
label
Aug 25, 2024
With some extra logging I've collected the following:
So, it used 434MB, 12MB more than calculated. I need to collect more data for the new formula... |
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 20, 2024
Experiments show that using memory hotplug or populate-on-demand makes no difference in required memory at startup. But PV vs PVH/HVM does make a difference - PV doesn't need extra per-MB overhead at all. On top of that, experimentally find the correct factor. Do it by starting VM (paused) with different parameters and compare `xl info free_memory` before and after. memory / maxmem: difference 400 / 4000: 434 600 / 4000: 634 400 / 400: 405 400 / 600: 407 400 / 2000: 418 2000 / 2000: 2018 600 / 600: 607 All above are with 2 vcpus. Testing with other vcpus count shows the 1.5MB per vcpu is quite accurate. As seen above, the initial memory doesn't affect the overhead. The maxmem counts. Applying linear regression to that shows it's about 8kb per MB of maxmem, so round it up to 8192. The base overhead of 4MB doesn't match exactly, but since the calculated number is smaller, leave it at 4MB as a safety margin. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 20, 2024
Experiments show that using memory hotplug or populate-on-demand makes no difference in required memory at startup. But PV vs PVH/HVM does make a difference - PV doesn't need extra per-MB overhead at all. On top of that, experimentally find the correct factor. Do it by starting VM (paused) with different parameters and compare `xl info free_memory` before and after. memory / maxmem: difference 400 / 4000: 434 600 / 4000: 634 400 / 400: 405 400 / 600: 407 400 / 2000: 418 2000 / 2000: 2018 600 / 600: 607 All above are with 2 vcpus. Testing with other vcpus count shows the 1.5MB per vcpu is quite accurate. As seen above, the initial memory doesn't affect the overhead. The maxmem counts. Applying linear regression to that shows it's about 0.008MB per MB of maxmem, so round it up to 8192 bytes. The base overhead of 4MB doesn't match exactly, but since the calculated number is smaller, leave it at 4MB as a safety margin. Fixes QubesOS/qubes-issues#9431
While the formula may be inaccurate, it looks like the issue is somewhere else. One of the failed log contains:
So, it requested 438MB from qmemman, but then only 119MB was freed - way too little for starting 400MB VM. |
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 22, 2024
... for the next watcher loop iteration. If two VMs are started in parallel, there may be no watcher loop iteration between handling their requests. This means the memory request for the second VM will operate on outdated list of VMs and may not account for some allocations (assume memory is free, while in fact it's already allocated to another VM). If that happens, the second VM may fail to start due to out of memory error. This is very similar problem as described in QubesOS/qubes-issues#1389, but affects actual VM startup, not its auxiliary processes. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 22, 2024
Experiments show that using memory hotplug or populate-on-demand makes no difference in required memory at startup. But PV vs PVH/HVM does make a difference - PV doesn't need extra per-MB overhead at all. On top of that, experimentally find the correct factor. Do it by starting VM (paused) with different parameters and compare `xl info free_memory` before and after. memory / maxmem: difference 400 / 4000: 434 600 / 4000: 634 400 / 400: 405 400 / 600: 407 400 / 2000: 418 2000 / 2000: 2018 600 / 600: 607 All above are with 2 vcpus. Testing with other vcpus count shows the 1.5MB per vcpu is quite accurate. As seen above, the initial memory doesn't affect the overhead. The maxmem counts. Applying linear regression to that shows it's about 0.008MB per MB of maxmem, so round it up to 8192 bytes. The base overhead of 4MB doesn't match exactly, but since the calculated number is smaller, leave it at 4MB as a safety margin. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 22, 2024
... for the next watcher loop iteration. If two VMs are started in parallel, there may be no watcher loop iteration between handling their requests. This means the memory request for the second VM will operate on outdated list of VMs and may not account for some allocations (assume memory is free, while in fact it's already allocated to another VM). If that happens, the second VM may fail to start due to out of memory error. This is very similar problem as described in QubesOS/qubes-issues#1389, but affects actual VM startup, not its auxiliary processes. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 23, 2024
... for the next watcher loop iteration. If two VMs are started in parallel, there may be no watcher loop iteration between handling their requests. This means the memory request for the second VM will operate on outdated list of VMs and may not account for some allocations (assume memory is free, while in fact it's already allocated to another VM). If that happens, the second VM may fail to start due to out of memory error. This is very similar problem as described in QubesOS/qubes-issues#1389, but affects actual VM startup, not its auxiliary processes. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 24, 2024
Any memory adjustments must be done while holding a lock, to not interfere with client request handling. This is critical to prevent memory just freed for a new VM being re-allocated elsewhere. The domain_list_changed() function failed to do that - do_balance call was done after releasing the lock. It wasn't a problem for a long time because of Python's global interpreter lock. But Python 3.13 is finally starting to support proper parallel thread execution, and it revealed this bug. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 27, 2024
Experiments show that using memory hotplug or populate-on-demand makes no difference in required memory at startup. But PV vs PVH/HVM does make a difference - PV doesn't need extra per-MB overhead at all. On top of that, experimentally find the correct factor. Do it by starting VM (paused) with different parameters and compare `xl info free_memory` before and after. memory / maxmem: difference 400 / 4000: 434 600 / 4000: 634 400 / 400: 405 400 / 600: 407 400 / 2000: 418 2000 / 2000: 2018 600 / 600: 607 All above are with 2 vcpus. Testing with other vcpus count shows the 1.5MB per vcpu is quite accurate. As seen above, the initial memory doesn't affect the overhead. The maxmem counts. Applying linear regression to that shows it's about 0.008MB per MB of maxmem, so round it up to 8192 bytes. The base overhead of 4MB doesn't match exactly, but since the calculated number is smaller, leave it at 4MB as a safety margin. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 27, 2024
Any memory adjustments must be done while holding a lock, to not interfere with client request handling. This is critical to prevent memory just freed for a new VM being re-allocated elsewhere. The domain_list_changed() function failed to do that - do_balance call was done after releasing the lock. It wasn't a problem for a long time because of Python's global interpreter lock. But Python 3.13 is finally starting to support proper parallel thread execution, and it revealed this bug. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to marmarek/qubes-core-admin
that referenced
this issue
Oct 27, 2024
Any memory adjustments must be done while holding a lock, to not interfere with client request handling. This is critical to prevent memory just freed for a new VM being re-allocated elsewhere. The domain_list_changed() function failed to do that - do_balance call was done after releasing the lock. It wasn't a problem for a long time because of Python's global interpreter lock. But Python 3.13 is finally starting to support proper parallel thread execution, and it revealed this bug. Fixes QubesOS/qubes-issues#9431
andrewdavidwong
added
diagnosed
Technical diagnosis has been performed (see issue comments).
pr submitted
A pull request has been submitted for this issue.
and removed
needs diagnosis
Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed.
labels
Oct 30, 2024
fepitre
pushed a commit
to fepitre/qubes-core-admin
that referenced
this issue
Nov 4, 2024
Experiments show that using memory hotplug or populate-on-demand makes no difference in required memory at startup. But PV vs PVH/HVM does make a difference - PV doesn't need extra per-MB overhead at all. On top of that, experimentally find the correct factor. Do it by starting VM (paused) with different parameters and compare `xl info free_memory` before and after. memory / maxmem: difference 400 / 4000: 434 600 / 4000: 634 400 / 400: 405 400 / 600: 407 400 / 2000: 418 2000 / 2000: 2018 600 / 600: 607 All above are with 2 vcpus. Testing with other vcpus count shows the 1.5MB per vcpu is quite accurate. As seen above, the initial memory doesn't affect the overhead. The maxmem counts. Applying linear regression to that shows it's about 0.008MB per MB of maxmem, so round it up to 8192 bytes. The base overhead of 4MB doesn't match exactly, but since the calculated number is smaller, leave it at 4MB as a safety margin. Fixes QubesOS/qubes-issues#9431
fepitre
pushed a commit
to fepitre/qubes-core-admin
that referenced
this issue
Nov 4, 2024
Any memory adjustments must be done while holding a lock, to not interfere with client request handling. This is critical to prevent memory just freed for a new VM being re-allocated elsewhere. The domain_list_changed() function failed to do that - do_balance call was done after releasing the lock. It wasn't a problem for a long time because of Python's global interpreter lock. But Python 3.13 is finally starting to support proper parallel thread execution, and it revealed this bug. Fixes QubesOS/qubes-issues#9431
marmarek
added a commit
to QubesOS/qubes-core-admin
that referenced
this issue
Dec 8, 2024
Any memory adjustments must be done while holding a lock, to not interfere with client request handling. This is critical to prevent memory just freed for a new VM being re-allocated elsewhere. The domain_list_changed() function failed to do that - do_balance call was done after releasing the lock. It wasn't a problem for a long time because of Python's global interpreter lock. But Python 3.13 is finally starting to support proper parallel thread execution, and it revealed this bug. Fixes QubesOS/qubes-issues#9431 (cherry picked from commit 2de9eb7) Fixes QubesOS/qubes-issues#9627
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affects-4.3
This issue affects Qubes OS 4.3.
C: core
diagnosed
Technical diagnosis has been performed (see issue comments).
P: major
Priority: major. Between "default" and "critical" in severity.
pr submitted
A pull request has been submitted for this issue.
r4.2-host-stable
r4.3-host-cur-test
T: bug
Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
How to file a helpful issue
Qubes OS release
R4.3
Brief summary
Sometimes VM fails to start with
internal error: libxenlight failed to create new domain
message. libxl logs shows it's about out of memory.Steps to reproduce
Not sure exactly. Happens from time to time during integration tests, I think more often when starting two VMs at once.
Expected behavior
VM starts normally
Actual behavior
VM fails to start
libxl-driver.log contains
I suspect the calculation how much free memory is needed to start a VM needs an update.
This started happening after update to Xen 4.19 (from Xen 4.17).
The text was updated successfully, but these errors were encountered: