Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many qrexec requests make the target domain hang #5343

Open
HW42 opened this issue Sep 26, 2019 · 4 comments
Open

Too many qrexec requests make the target domain hang #5343

HW42 opened this issue Sep 26, 2019 · 4 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. C: core needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@HW42
Copy link

HW42 commented Sep 26, 2019

Qubes OS version

4.0

Affected component(s) or functionality

qrexec

Brief summary

If a domain issues a lot of qrexec request the target domains starts to log xenbus: xen store gave: unknown error E2BIG in dmesg. At some point this will prevent any qrexec connection to the target domain (even from dom0).

To Reproduce

Create simple qrexec service and allow it in the policy.

Open a lot qrexec connections (you can ignore the local errors).

for i in {1..1000}; do qrexec-client-vm target-vm test & done

Expected behavior

There should be some rate limiting to prevent a domain from DoSing another.

@HW42 HW42 added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Sep 26, 2019
@HW42 HW42 changed the title Too many qrexec request make the target domain hang Too many qrexec requests make the target domain hang Sep 26, 2019
@andrewdavidwong andrewdavidwong added this to the Release 4.0 updates milestone Sep 26, 2019
@marmarek
Copy link
Member

marmarek commented Jan 23, 2020

This particular issue (xenbus: xen store gave: unknown error E2BIG) is directly caused by qrexec-fork-server processes waiting for the connection, each holding three xenstore watches. The default per-domain limit is 128, and also some watches are used by the kernel. In practice it means only about 40 processes can wait for connection setup at the same time. This issue is amplified by VM side of qrexec (qrexec-agent, qrexec-fork-server, qrexec-client-vm) not having timeout on vchan connection - only dom0 side has it (qrexec-client). So, when there are multiple waiting qrexec-fork-server processes using all the available watches, they will never release them until the other side setup the connection. If the other side is crashed, it will wait forever, preventing further connections. Killing those waiting qrexec-fork-server processes (but not their parent!) resolves the situation.

But that's only one side of the story.
The other is why those connections failed to establish, and I believe the answer is very close to #5344. Specifically, I believe this is what happens:

  1. Source VM request multiple qrexec connections.

  2. Initial connections are established successfully.

  3. At some point, source domain runs out of resources, I see those messages:

     gntshr: error: ioctl failed: No space left on device
     Data vchan connection failed
     xs_transaction_start: No space left on device
     Data vchan connection failed
     Data vchan connection failed
     Data vchan connection failed
     Data vchan connection failed
    

    (I've identified those "Data vchan connection failed" without accompanying other message as failed xs_transaction_start in libxenvchan)

  4. The target domain doesn't learn about this problem and still tries to establish data vchan connection.

  5. At some point, when sufficiently many of those failed (at source domain) vchan setup happens, target domain will have enough waiting processes to run out of allowed xenstore watches

  6. Now, even when source domain stops opening new connections and terminate existing to free resources, target domain still has multiple processes waiting for vchan connection and holds xenstore watches in use - this prevents further connections to that target domain.

There are several factors contributing to this issue as a whole:

  1. Lack of timeouts - having a timeout on both sides would at least allow to recover from such situation by waiting a little bit.

  2. Lack of vchan setup error reporting - when any side of a connection fails to setup vchan for any reason, there is no way the other side (or at least dom0) will learn about it.

  3. Insufficient rate-limiting of qrexec connections. Currently there are two limits:

    • how many outgoing (accepted) connection a domain can have (regardless of the target)
    • how many connections can wait for policy decision at the same time

    Both limits are set to 256, which as we can see is way above available resources.

Solutions:

Timeouts

Adding timeouts should be easy, especially since we have it done in dom0 part already. We should do that regardless of other options.

Error reporting

Adding error reporting most likely will require protocol change and as such may be hard. And also doesn't help in case of malicious domains.

Limits

I see three (non-exclusive) options:

  • lower the connections limit, to match resource limits
  • increase resource limits (xenstore watches, xenstore transactions, grant table entries etc)
  • add a third limit: concurrent connections per target domain, and set it to some low value

I think for now the easiest solution is to increase allowed resources, within reason. This is about:

  • grant tables (gntalloc) - we need it much higher in R4.1 anyway, so we can also increase it in R4.0; there are two limits:
    • enforced by the VM kernel (/sys/module/xen_gntalloc/parameters/limit) - default is 1024; this is mostly mitigation against a single process consuming too much resources within the domain; we can safely increase it to some absurdly high value (in R4.1 we have it 2^30) - for R4.0 I'd set it the same as the Xen limit (see below)
    • enforce by Xen; this more tricky, as this limit is also (I think) about Xen resources (having it too high could cause significant resource consumption on Xen side) - I think the default is 32 of grant table pages - each with 512 entries -> 16k entries; this should be enough for normal qrexec usage
  • xenstore watches - default is 128; more watches does consume more CPU time in xenstored process (dom0), but increasing the limit to 512 should still be reasonable
  • xenstore transactions - default is 10; this is more tricky, as pending transactions are quite expensive; we can try increasing it to 32 and see how it will work

To set the above:

  1. Add XENSTORED_ARGS="--transaction=32 --watch-nb=512" to /etc/sysconfig/xencommons` in dom0.
  2. Add options xen_gntalloc limit=16384 to /etc/modprobe.d/xen_gntalloc.conf in template

With the above set, I can still trigger the issue if I try very hard (like the command from the issue description), but it should be much less likely to hit it accidentally.

@mig5
Copy link

mig5 commented May 12, 2020

I just reproduced this, specifically using qubes-split-ssh via Ansible. I think it was the parallelism of Ansible's SSH calls that triggered it. But conceivably, it could be triggered with something like Split GPG too.

The number of hosts Ansible was calling out to (via qrexec call to the SSH agent) to was 33.

The above config tweaks to dom0 and the template seem to have solved it for me.

@Hir0-shi
Copy link

Hir0-shi commented Aug 7, 2021

I'm running qubes 4.0 with i3 4.16 and I have multiple qvm-run issues , I can't open a terminal in any appVm or template and the error xenbus: xen store gave: unknown error E2BIG.

@marmarek I added this two lines, the first in dom0 and the second in each template

  • Add XENSTORED_ARGS="--transaction=32 --watch-nb=512" to /etc/sysconfig/xencommons` in dom0.
  • Add options xen_gntalloc limit=16384 to /etc/modprobe.d/xen_gntalloc.conf in template

Now even with this two option on dom0 and templates I can't use qvm-run either or qubes-dom0-update freeze

qvm-run -u root sys-net xterm
Running 'xterm' on sys-net
sys-net: command failed with code: 1
sudo qubes-dom0-update 
Using sys-firewall as UpdateVm to download udpates for Dom0, this may take some time ...
(no more output I have the prompt )
sudo qubes-dom0-update rofi
Using sys-firewall as UpdateVm to download updates from Dom0, this may take some time...
Traceback (most recent call last):
  sys.exit(main())
  File "/usr/lib/python3.5/site-packages/qubesadmin/tools/qvm_run.py", line 281, in main
    proc, copy_proc, local_proc = run_command_single(args, mv)
    ...
    BrokenPipeError:  [Erno 32] Broken pipe
qvm-copy-to-vm fedora03 file01
gntshr: error: ioctl failed: No space left on device
Failed to start data vchan server

@andrewdavidwong andrewdavidwong added the eol-4.0 Closed because Qubes 4.0 has reached end-of-life (EOL) label Aug 5, 2023
@github-actions
Copy link

github-actions bot commented Aug 5, 2023

This issue is being closed because:

If anyone believes that this issue should be reopened and reassigned to an active milestone, please leave a brief comment.
(For example, if a bug still affects Qubes OS 4.1, then the comment "Affects 4.1" will suffice.)

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2023
@marmarek marmarek added affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. labels Sep 2, 2023
@marmarek marmarek removed this from the Release 4.0 updates milestone Sep 2, 2023
@marmarek marmarek reopened this Sep 2, 2023
@andrewdavidwong andrewdavidwong added needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. and removed eol-4.0 Closed because Qubes 4.0 has reached end-of-life (EOL) labels Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. C: core needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

5 participants