Too many qrexec requests make the target domain hang #5343

HW42 · 2019-09-26T05:10:44Z

Qubes OS version

4.0

Affected component(s) or functionality

qrexec

Brief summary

If a domain issues a lot of qrexec request the target domains starts to log xenbus: xen store gave: unknown error E2BIG in dmesg. At some point this will prevent any qrexec connection to the target domain (even from dom0).

To Reproduce

Create simple qrexec service and allow it in the policy.

Open a lot qrexec connections (you can ignore the local errors).

for i in {1..1000}; do qrexec-client-vm target-vm test & done

Expected behavior

There should be some rate limiting to prevent a domain from DoSing another.

The text was updated successfully, but these errors were encountered:

marmarek · 2020-01-23T23:37:26Z

This particular issue (xenbus: xen store gave: unknown error E2BIG) is directly caused by qrexec-fork-server processes waiting for the connection, each holding three xenstore watches. The default per-domain limit is 128, and also some watches are used by the kernel. In practice it means only about 40 processes can wait for connection setup at the same time. This issue is amplified by VM side of qrexec (qrexec-agent, qrexec-fork-server, qrexec-client-vm) not having timeout on vchan connection - only dom0 side has it (qrexec-client). So, when there are multiple waiting qrexec-fork-server processes using all the available watches, they will never release them until the other side setup the connection. If the other side is crashed, it will wait forever, preventing further connections. Killing those waiting qrexec-fork-server processes (but not their parent!) resolves the situation.

But that's only one side of the story.
The other is why those connections failed to establish, and I believe the answer is very close to #5344. Specifically, I believe this is what happens:

Source VM request multiple qrexec connections.
Initial connections are established successfully.

At some point, source domain runs out of resources, I see those messages:

 gntshr: error: ioctl failed: No space left on device
 Data vchan connection failed
 xs_transaction_start: No space left on device
 Data vchan connection failed
 Data vchan connection failed
 Data vchan connection failed
 Data vchan connection failed

(I've identified those "Data vchan connection failed" without accompanying other message as failed xs_transaction_start in libxenvchan)

The target domain doesn't learn about this problem and still tries to establish data vchan connection.
At some point, when sufficiently many of those failed (at source domain) vchan setup happens, target domain will have enough waiting processes to run out of allowed xenstore watches
Now, even when source domain stops opening new connections and terminate existing to free resources, target domain still has multiple processes waiting for vchan connection and holds xenstore watches in use - this prevents further connections to that target domain.

There are several factors contributing to this issue as a whole:

Lack of timeouts - having a timeout on both sides would at least allow to recover from such situation by waiting a little bit.
Lack of vchan setup error reporting - when any side of a connection fails to setup vchan for any reason, there is no way the other side (or at least dom0) will learn about it.
Insufficient rate-limiting of qrexec connections. Currently there are two limits:
- how many outgoing (accepted) connection a domain can have (regardless of the target)
- how many connections can wait for policy decision at the same time
Both limits are set to 256, which as we can see is way above available resources.

Solutions:

Timeouts

Adding timeouts should be easy, especially since we have it done in dom0 part already. We should do that regardless of other options.

Error reporting

Adding error reporting most likely will require protocol change and as such may be hard. And also doesn't help in case of malicious domains.

Limits

I see three (non-exclusive) options:

lower the connections limit, to match resource limits
increase resource limits (xenstore watches, xenstore transactions, grant table entries etc)
add a third limit: concurrent connections per target domain, and set it to some low value

I think for now the easiest solution is to increase allowed resources, within reason. This is about:

grant tables (gntalloc) - we need it much higher in R4.1 anyway, so we can also increase it in R4.0; there are two limits:
- enforced by the VM kernel (/sys/module/xen_gntalloc/parameters/limit) - default is 1024; this is mostly mitigation against a single process consuming too much resources within the domain; we can safely increase it to some absurdly high value (in R4.1 we have it 2^30) - for R4.0 I'd set it the same as the Xen limit (see below)
- enforce by Xen; this more tricky, as this limit is also (I think) about Xen resources (having it too high could cause significant resource consumption on Xen side) - I think the default is 32 of grant table pages - each with 512 entries -> 16k entries; this should be enough for normal qrexec usage
xenstore watches - default is 128; more watches does consume more CPU time in xenstored process (dom0), but increasing the limit to 512 should still be reasonable
xenstore transactions - default is 10; this is more tricky, as pending transactions are quite expensive; we can try increasing it to 32 and see how it will work

To set the above:

Add XENSTORED_ARGS="--transaction=32 --watch-nb=512" to /etc/sysconfig/xencommons` in dom0.
Add options xen_gntalloc limit=16384 to /etc/modprobe.d/xen_gntalloc.conf in template

With the above set, I can still trigger the issue if I try very hard (like the command from the issue description), but it should be much less likely to hit it accidentally.

mig5 · 2020-05-12T02:06:22Z

I just reproduced this, specifically using qubes-split-ssh via Ansible. I think it was the parallelism of Ansible's SSH calls that triggered it. But conceivably, it could be triggered with something like Split GPG too.

The number of hosts Ansible was calling out to (via qrexec call to the SSH agent) to was 33.

The above config tweaks to dom0 and the template seem to have solved it for me.

Hir0-shi · 2021-08-07T05:32:04Z

I'm running qubes 4.0 with i3 4.16 and I have multiple qvm-run issues , I can't open a terminal in any appVm or template and the error xenbus: xen store gave: unknown error E2BIG.

@marmarek I added this two lines, the first in dom0 and the second in each template

Add XENSTORED_ARGS="--transaction=32 --watch-nb=512" to /etc/sysconfig/xencommons` in dom0.
Add options xen_gntalloc limit=16384 to /etc/modprobe.d/xen_gntalloc.conf in template

Now even with this two option on dom0 and templates I can't use qvm-run either or qubes-dom0-update freeze

qvm-run -u root sys-net xterm
Running 'xterm' on sys-net
sys-net: command failed with code: 1

sudo qubes-dom0-update 
Using sys-firewall as UpdateVm to download udpates for Dom0, this may take some time ...
(no more output I have the prompt )

sudo qubes-dom0-update rofi
Using sys-firewall as UpdateVm to download updates from Dom0, this may take some time...
Traceback (most recent call last):
  sys.exit(main())
  File "/usr/lib/python3.5/site-packages/qubesadmin/tools/qvm_run.py", line 281, in main
    proc, copy_proc, local_proc = run_command_single(args, mv)
    ...
    BrokenPipeError:  [Erno 32] Broken pipe

qvm-copy-to-vm fedora03 file01
gntshr: error: ioctl failed: No space left on device
Failed to start data vchan server

github-actions · 2023-08-05T09:30:36Z

This issue is being closed because:

This issue is on the "Release 4.0 updates" milestone.
Qubes OS 4.0 reached EOL (end-of-life) over one year ago.
There has not been any activity on this issue in over one year.

If anyone believes that this issue should be reopened and reassigned to an active milestone, please leave a brief comment.
(For example, if a bug still affects Qubes OS 4.1, then the comment "Affects 4.1" will suffice.)

HW42 added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Sep 26, 2019

HW42 changed the title ~~Too many qrexec request make the target domain hang~~ Too many qrexec requests make the target domain hang Sep 26, 2019

HW42 mentioned this issue Sep 26, 2019

Many parallel qrexec requests can make the source domain hang #5344

Closed

andrewdavidwong added the C: core label Sep 26, 2019

andrewdavidwong added this to the Release 4.0 updates milestone Sep 26, 2019

marmarek mentioned this issue Oct 4, 2019

xenstore dirs in network backend are not cleaned up correctly #5369

Closed

rmol mentioned this issue Jan 23, 2020

Application totally frozen with large number of sources freedomofpress/securedrop-client#716

Closed

marmarek mentioned this issue Aug 6, 2020

Attempting to open more than ~40 terminal windows results in qvm-run error "command failed with code: 1" #5969

Closed

andrewdavidwong added the eol-4.0 Closed because Qubes 4.0 has reached end-of-life (EOL) label Aug 5, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2023

marmarek added affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. labels Sep 2, 2023

marmarek removed this from the Release 4.0 updates milestone Sep 2, 2023

marmarek reopened this Sep 2, 2023

marmarek mentioned this issue Sep 2, 2023

Implement connection timeout on the VM-side of qrexec #8476

Closed

andrewdavidwong added needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. and removed eol-4.0 Closed because Qubes 4.0 has reached end-of-life (EOL) labels Sep 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many qrexec requests make the target domain hang #5343

Too many qrexec requests make the target domain hang #5343

HW42 commented Sep 26, 2019

marmarek commented Jan 23, 2020 •

edited

Loading

mig5 commented May 12, 2020

Hir0-shi commented Aug 7, 2021 •

edited

Loading

github-actions bot commented Aug 5, 2023

Too many qrexec requests make the target domain hang #5343

Too many qrexec requests make the target domain hang #5343

Comments

HW42 commented Sep 26, 2019

marmarek commented Jan 23, 2020 • edited Loading

Solutions:

Timeouts

Error reporting

Limits

mig5 commented May 12, 2020

Hir0-shi commented Aug 7, 2021 • edited Loading

github-actions bot commented Aug 5, 2023

marmarek commented Jan 23, 2020 •

edited

Loading

Hir0-shi commented Aug 7, 2021 •

edited

Loading