Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent VM startup failures - R4rc2 #3221

Closed
tasket opened this issue Oct 26, 2017 · 14 comments
Closed

Frequent VM startup failures - R4rc2 #3221

tasket opened this issue Oct 26, 2017 · 14 comments
Labels
C: core T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Milestone

Comments

@tasket
Copy link

tasket commented Oct 26, 2017

Qubes OS version:

R4rc2

Affected TemplateVMs:

fedora-25
debian-8


Steps to reproduce the behavior:

Start any VM using these templates.

Expected behavior:

VM starts and responds to commands.

Actual behavior:

Desktop notification that VM is starting, but there is relatively little disk activity and the VM menu widget shows a busy indicator for that VM until a minute later when the VM disappears.

This happens close to 50% of the time.

General notes:

Trying to start the VM over and over can get the VM running.

Discussion thread here:
https://groups.google.com/d/msgid/qubes-users/g5tLp_yA2-jvKvSkZmQyCEJU50NS6aWb7m1Dmezb6d1y2loGsi-fh1pSgK5Jk2ovnwECfmVVAym11iFX7CbaAdGvX_iKZWOvKzzAF4eEcsE%3D%40protonmail.com


Related issues:

@tasket
Copy link
Author

tasket commented Oct 26, 2017

Issue #3125 regards 'libxenlight' errors which I'm not seeing.

Also, this startup problem occurs with regular appVMs as much as network-providing or device-mapped VMs. So if sys-net and sys-firewall are running OK, an appVM that uses sys-firewall (or no netvm at all) may still fail to start.

@andrewdavidwong andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core labels Oct 27, 2017
@andrewdavidwong andrewdavidwong added this to the Release 4.0 milestone Oct 27, 2017
@tasket
Copy link
Author

tasket commented Oct 27, 2017

There is an emerging pattern (and workaround) to what I'm experiencing:

On boot, sys-net will usually start but sys-firewall or VPN (these both connect to sys-net) will fail, and any appVMs that use these proxyVMs will also fail. Non-connected appVMs may or may not start at this point. If I keep re-trying different VMs, I may get sys-firewall or VPN to run but downstream appVMs can't access the net.

However, if I shut down sys-net along with all the other VMs, I can then start VMs with much more reliability: I can start an appVM, and then sys-net and VPN or sys-firewall will start and run properly.

A memory management issue may be related to this... I have noticed sometimes appVMs lose the ability to acquire more RAM despite plenty available, resulting in the appVM swapping heavily when demand increases. But it may be the case when I re-start sys-net like above, the new VM instances retain their ability to gain (and relinquish) RAM; that is how my system is behaving now.

@na--
Copy link

na-- commented Oct 27, 2017

A memory management issue may be related to this... I have noticed sometimes appVMs lose the ability to acquire more RAM despite plenty available, resulting in the appVM swapping heavily when demand increases.

I've not experienced any of the other issues you described, but that memory management issue happened to me a few times and I've not managed to reliably replicate it yet. Although, now that I think about it, sys-net usually fails to start at boot, even though it should, while sys-usb starts normally.

@pietrushnic
Copy link

I also have problems with sys-net. During installations of rc2 I marked to handle USB in sys-net and my PCI USB device is connected there. When I disconnect USB controller sys-net starts fine, but I have no USB access. The log that I see is when trying to start from dom0:

Start failed: internal error: Unable to reset PCI device 0000:00:14.0: internal error: libxenlight failed to create new domain 'sys-net'

If I try to connect USB controller after boot and then start sys-net I get:

Start failed: internal error: Unable to reset PCI device 0000:00:14.0: no FLR, PM reset or bus reset available

Not sure if this is related. This behavior didn't appear on R4-rc1.

@aphidfarmer
Copy link

aphidfarmer commented Oct 28, 2017

Same here...

  • sys-net always starts automatically with no issues
  • sys-firewall often cannot start and eventually fails with qrexec error.

Even if sys-firewall does start successfully, I have similar issues with AppVMs (with no pci devices) connected to sys-firewall.

@marmarek
Copy link
Member

How much memory does the system have? Try adjusting initial memory (increase it), or maxmem (decrease it).

@marmarek
Copy link
Member

The above comment is to check relation to #2853

@aphidfarmer
Copy link

aphidfarmer commented Oct 29, 2017

sys-firewall: 600MB initial, 1GiB max
appvm that sometimes also fails to start: 400MB initial, 2GiB max; same thing if I change to 1GiB/2 GiB.
physical memory: 8 GiB

As tasket mentioned, the issue seems to affect VMs downstream from sys-firewall. If I change my AppVM's netvm from sys-firewall to none, it starts always. If I change it back, startup fails most of the time.

@marmarek
Copy link
Member

Ok, so this is something different.

@ghost
Copy link

ghost commented Oct 29, 2017

Having same issiue but starting a VM multiple times doesn't help

@tasket
Copy link
Author

tasket commented Oct 31, 2017

@marmarek
sys-net, sys-firewall and VPN are limited to 400MB with no balancing. All the other VMs have the default 400/3940MB with balancing. I will try increasing min to 600MB on appVMs.

One trick that has worked over the last 2 days:
A small appVM (300/400 RAM) that is isolated (no netvm) has a good chance of starting and then I can subsequently start other, connected VMs.

@tasket
Copy link
Author

tasket commented Oct 31, 2017

Also, overall system RAM is 8GB.

@aphidfarmer
Copy link

After qubes-dom0-update, VMs now start consistently for me (whereas before, startup would fail half the time).

@tasket
Copy link
Author

tasket commented Nov 18, 2017

I'm having similar good luck for the last 24 hrs. since the update, but I'm still keeping my fingers crossed. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: core T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

6 participants