dispVM halts prematurely with custom .desktop file involving `qrexec-client-vm` #3318

joshuathayer · 2017-11-17T22:38:53Z

Qubes OS version:

4.0rc2 (dom0 and templates up to date with testing repos)

Affected TemplateVMs:

At least fedora-25

Steps to reproduce the behavior:

I have a situation where I'm trying to open a particular file format in a disposable VM, from a (Whonix-based) AppVM, eg:

$ qvm-open-in-vm '$dispvm:work' ./foo.custom-ext

In my work VM I've configured xdg to associate .custom-ext to a python script. That script processes the given file, and as a side effect there's a call to qrexec-client-vm, eg:

...
p = subprocess.Popen(["qrexec-client-vm", "target-vm", "custom.Process"], stdin=subprocess.Pipe)
p.communicate("message")
...

Expected behavior:

I'd expect:

a disposable VM based on my work VM to start
my script to start running
the RPC call in the script to successfully complete
the script to continue processing until it exits
finally, the dispVM to shut down

Actual behavior:

My script seems to run as far as the call to qrexec (which completes successfully), and then the entire VM gets shut down. While the machine is shutting down, my script is still running: based on writes to STDERR, it's able to make some inconsistent number of lines before the machine is down.

General notes:

Removing the Popen call from my script allows it to run and complete normally.

Running the command on a non-disposable VM allows the script to run and complete normally, eg:

$ qvm-open-in-vm work ./foo.custom-ext

works great.

I suspected the issue may have had something to do with my script's STDIN/OUT/ERR being passed to the spawned child, but the following didn't change behavior at all:

...
fd = os.open("/dev/null", os.O_RDWR)
p = subprocess.Popen(["qrexec-client-vm", "target-vm", "custom.Process"],
      stdin=subprocess.Pipe, stdout=fd, stderr=fd)
...

For debugging, i tried making other calls which involve a fork() in my processing script, and those calls seemed to complete without error and without causing the VM to halt. Only the call to qrexec-client-vm caused the machine to halt.

Also for debugging I tried not using a pipe for stdin (instead using the same /dev/null filehandle), and that also didn't change matters.

I'm stumped. I'm not sure this is a Qubes bug as much as my not understanding something, but I do expect the behavior I noted above. Thanks for taking a look!

Related issues:

The text was updated successfully, but these errors were encountered:

marmarek · 2017-11-18T01:09:23Z

Just to be sure - have you checked the script isn't simply crashing? For example it should be subprocess.PIPE, not subprocess.Pipe. Have you tried running this in non-DisposableVM first?

joshuathayer · 2017-11-18T03:40:55Z

@marmarek, yes, I've run this on a non-dispVM (the "base" VM the dispVM is based on) and it runs without problem.

And right, subprocess.PIPE is correct... sorry for the typo here. It's correct in my script. This code is public (part of this SecureDrop project), so I'll commit the branch I'm working on tomorrow and link to the code in question here.

joshuathayer · 2017-11-20T22:58:13Z

I've come up with a minimal example of this behavior. It seems to be triggered when making qrexec calls to a disposable target VM, where the target VM then makes its own RPC calls back to the source VM in the course of handling the initial RPC call.

In this example we're going to make an RPC call from personal to a disposable VM based on work. On personal, we'll run qrexec-client-vm work bug.PartOne, then the invoked script will call qrexec-client-vm personal bug.PartTwo. We'll see that while that process works fine on a non-disposable VM, it fails when the target VM is disposable.

On work, create /rw/config/bug.PartOne with the following, and chmod 0755 it:

#!/usr/bin/python

import subprocess
import sys
import time

sys.stderr.write("Hello from bug-part-one. Going to qrexec bug.PartTwo now\n")

subprocess.Popen(["qrexec-client-vm","personal","bug.PartTwo"])

i = 0
while i < 10:
  time.sleep(1)
  subprocess.Popen(["qrexec-client-vm","personal","bug.PartTwo"])
  i += 1

Add the following to /rw/config/rc.local on work:

ln -s /rw/config/bug.PartOne /etc/qubes-rpc/

And also run that command in your running work VM.

On personal, create /rw/config/bug.PartTwo with the following, and chmod 0755 it:

#!/usr/bin/python

import sys

sys.stderr.write("Hello from bug part two!\n")

Although it's not strictly necessary for this demo, to /rw/config/rc.local on personal add:

ln -s /rw/config/bug.PartTwo /etc/qubes-rpc/

and also run that command on your running personal VM.

On dom0, create /etc/qubes-rpc/policy/bug.PartOne and /etc/qubes-rpc/policy/bug.PartTwo with the following:

$anyvm $anyvm allow

And allow work to be a dispvm template:

$ qvm-prefs work template-for-dispvms True

Now we'll all set up. On personal, run:

$ qrexec-client-vm 'work' bug.PartOne

You should see the expected output:

$ qrexec-client-vm 'work' bug.PartOne
Hello from bug-part-one. Going to qrexec bug.PartTwo now
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
Hello from bug part two!
$ _

OK, that's what we expect. Now run the same thing on a disposable VM using work as a template:

[user@personal ~]$ qrexec-client-vm '$dispvm:work' bug.PartOne
Hello from bug-part-one. Going to qrexec bug.PartTwo now
Hello from bug part two!

The call never returns, and the disposable VM that was created has crashed.

I believe that if bug.PartTwo is run from the disposable VM targeting a different VM from the initial source (devel, for example), the process runs fine.

… port (unconfirmed theory) Only one side of the connection should report MSG_CONNECTION_TERMINATED - the one where vchan port was allocated. This is especially important for VM-VM connection, because when two such connections are established between the same domains, in opposite directions - both will share the same vchan port number, so qrexec-daemon will get confused which connection was terminated. QubesOS/qubes-issues#3318

marmarek · 2017-11-21T03:39:30Z

I think this is about qrexec connection cleanup. When you execute two qrexec connections between the same domains (and those are the only connections), but in opposite directions, they will be assigned the same vchan port number. This is normally ok, because connection direction differs (different VM act as a server). But connection cleanup code got confused and report wrong connection being closed. The only code using that report in practice is cleanup of DispVM.

I've pushed preliminary attempt to fix this, but it doesn't work yet.

joshuathayer · 2017-11-21T03:48:03Z

@marmarek thanks for your prompt attention! I suspected something along those lines, but had no idea where to start looking... I'll be interested to take a close look at your patch. Is there documentation anywhere about setting up a development environment for building and testing these core libraries? I'd rather be submitting patches than reporting bugs ;)

marmarek · 2017-11-21T04:04:46Z

See here: https://www.qubes-os.org/doc/development-workflow/
For testing I suggest using cloned template (to not brick to many pieces at once).
There is also (really poor :/) documentation of qrexec internals: https://www.qubes-os.org/doc/qrexec3/#qrexec-protocol-details

As for the bug - the dom0 counterpart is here: https://github.com/QubesOS/qubes-core-admin-linux/blob/master/qrexec/qrexec-daemon.c#L387-L517 and https://github.com/QubesOS/qubes-core-admin-linux/blob/master/qrexec/qrexec-client.c#L713-L805. Unfortunately the logic about vchan port allocation is quite convoluted (the second link), mostly because dom0 is special - do not have own qrexec-agent/qrexec-daemon pair.

BTW if you want to install qrexec service in one VM only, you can use /usr/local/etc/qubes-rpc (which is part of /rw), instead of ln -s in rc.local.

joshuathayer · 2017-11-21T04:14:20Z

Thanks so much, I'll spend some time with that tomorrow.

unman · 2019-08-02T13:03:27Z

@andrewdavidwong
The test scripts provided by @joshuathayer now seem to work as expected, and output is same between 'target' and 'dispVM:target'
Can be closed.

rustybird · 2019-08-02T14:31:03Z

@unman:

The test scripts provided by @joshuathayer now seem to work as expected, and output is same between 'target' and 'dispVM:target'

Are you on R4.1? I'm using R4.0 current-testing, and the one-liner test command from #4273 (which is a duplicate of this bug) still crashes the DispVM.

marmarek · 2019-08-02T14:35:58Z

For me this one-liner test works fine.
VM: qubes-core-agent-4.0.46
dom0: qubes-core-dom0-linux-4.0.18

unman · 2019-08-02T14:39:18Z

On Fri, Aug 02, 2019 at 07:31:07AM -0700, Rusty Bird wrote: @unman: > The test scripts provided by @joshuathayer now seem to work as expected, and output is same between 'target' and 'dispVM:target' Are you on R4.1? I'm using R4.0 current-testing, and the one-liner test command from #4273 (which is a duplicate of this bug) still crashes the DispVM after a few seconds.

4.0 current-testing

rustybird · 2019-08-02T15:06:57Z

Hmm, strange. I have qubes-core-agent-4.0.47 in the Fedora 30 based sourcevm (and qubes-core-dom0-linux-4.0.18 in dom0). Its default_dispvm is also Fedora 30 based.

I run qvm-run-vm --dispvm '/usr/lib/qubes/qrexec-client-vm \$default qubes.Filecopy </dev/null; echo sleeping supposedly forever; sleep inf' in sourcevm, wait a few seconds for the DispVM to start, then there's a Filecopy dom0 permission popup where I choose sourcevm as the target. The still running qvm-run-vm command quickly prints sleeping supposedly forever. After about 2 seconds, there's a Domain disp123 has halted notification. After another 5 seconds or so, qvm-run-vm exits, and /var/log/qubes/qrexec.disp123.log in dom0 contains libvchan_is_eof.

marmarek · 2019-08-02T16:44:28Z

Ok, I can confirm it's still broken. I think it depends on something during source vm startup. After some source vm reboots it works reliably, but after some, it's broken.

rustybird · 2019-08-02T17:14:43Z

Just a shot in the dark, but it seems plausible if there's indeed some VM startup thing: My computer is pretty slow, and the bug always triggers. So maybe putting your (presumably faster) system under high CPU and/or I/O load during e.g. sourcevm startup would also more consistently reproduce it for you.

marmarek · 2019-08-02T17:59:03Z

Yes, it's likely.

I guess it's related to connection identifier / vchan port number, which are assigned on first-free basis. See #3318 (comment)

pwmarcz · 2020-02-10T17:33:47Z

I think I can confirm what happens in this case.

I added a debug print to qrexec-daemon whenever it allocated a port. Then I reproduced the issue:

sourcevm $ qvm-run-vm --dispvm '/usr/lib/qubes/qrexec-client-vm \$default qubes.Filecopy </dev/null; echo sleeping supposedly forever; sleep inf'

Here is the daemon log from source VM (with identifiers replaced):

connect_port: 513, connect_domain: DISP_VM, type: MSG_EXEC_CMDLINE
releasing port 513 to DISP_VM
eintr

And here is one from the disp VM:

connect_port: 513, connect_domain: SOURCE_VM, type: MSG_EXEC_CMDLINE
connect_port: 514, connect_domain: 0, type: MSG_EXEC_CMDLINE
connect_port: 515, connect_domain: 0, type: MSG_SERVICE_CONNECT
releasing port 515 to 0
eintr
releasing port 514 to 0
connect_port: 514, connect_domain: 0, type: MSG_SERVICE_CONNECT
eintr
releasing port 513 to SOURCE_VM, terminating client
libvchan_is_eof

Looks like there are two connections over port 513:

original qvm-run, in which SOURCE_VM is the vchan server
- this has a corresponding qrexec-client -W connected to DISP_VM and waiting for end (because this is a dispvm)
nested qrexec-client-vm, in which DISP_VM is the vchan server

Then, the second qrexec-client-vm finishes and disconnects from qrexec-agent for DISP_VM, which notifies qrexec-daemon for DISP_VM about connection end, which (incorrectly) kills the qrexec-client -W connected, just because it has the same port number.

pwmarcz · 2020-02-10T17:39:22Z

Initially I thought this is a different kind of collision: the "allocated port" is sometimes allocated for a vchan client (in case of MSG_EXEC_CMDLINE), and sometimes for a vchan server (in case of MSG_SERVICE_CONNECT). I still think that can cause problems, but here we have a case of two MSG_EXEC_CMDLINE requests.

I think a good move would be to prevent any kinds of collisions between two daemons. In other words:

When asking for domX <--> domY connection, the domX daemon and the domY daemon should never issue the same port number.

A simple scheme to do that would be:

If X < Y, then domX daemon issues only odd port numbers (513, 515, 517...) for a connection to domY.
Otherwise, domX daemon issues only even port numbers (514, 516, 518...) to domY.

I ran a proof-of-concept and it seems to solve this specific bug.

With the current protocol, I think it's cleaner than tracking where a given connection came from, who was running a server, etc.

Should prevent the situations where a wrong vchan connection is terminated, which causes e.g. premature termination of a DispVM. See longer discussion in QubesOS/qubes-issues#3318.

rustybird · 2020-05-08T16:01:36Z

@marmarek: Have you considered cherry-picking this patch for R4.0? It applies cleanly in qubes-core-admin-linux (after adjusting the file path).

Should prevent the situations where a wrong vchan connection is terminated, which causes e.g. premature termination of a DispVM. See longer discussion in QubesOS/qubes-issues#3318.

marmarek · 2020-05-10T01:15:49Z

Backported, included in QubesOS/updates-status#1787

andrewdavidwong added the C: other label Nov 18, 2017

andrewdavidwong added this to the Release 4.0 milestone Nov 18, 2017

joshuathayer mentioned this issue Jan 9, 2018

Work around Qubes' qrexec bug freedomofpress/securedrop-workstation#46

Closed

andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. C: core and removed C: other labels Mar 31, 2018

andrewdavidwong modified the milestones: Release 4.0, Release 4.0 updates Mar 31, 2018

marmarek mentioned this issue Sep 4, 2018

Any qrexec call from DispVM to opener VM kills DispVM #4273

Closed

andrewdavidwong added the P: major Priority: major. Between "default" and "critical" in severity. label Aug 3, 2019

pierwill mentioned this issue Aug 8, 2019

[WIP] Rewrite qrexec documentation (Phase 1) QubesOS/qubes-doc#847

Merged

deeplow mentioned this issue Nov 13, 2019

Investigate per-source dispVM freedomofpress/securedrop-workstation#333

Open

pwmarcz mentioned this issue Feb 11, 2020

daemon: prevent collisions between port numbers QubesOS/qubes-core-qrexec#23

Merged

marmarek closed this as completed Feb 24, 2020

qubesos-bot mentioned this issue Mar 25, 2020

core-qrexec v4.1.4 (r4.1) QubesOS/updates-status#1712

Closed

qubesos-bot mentioned this issue May 10, 2020

core-admin-linux v4.0.25 (r4.0) QubesOS/updates-status#1787

Closed

qubesos-bot mentioned this issue Jun 7, 2020

core-admin-linux v4.0.26 (r4.0) QubesOS/updates-status#1859

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dispVM halts prematurely with custom .desktop file involving `qrexec-client-vm` #3318

dispVM halts prematurely with custom .desktop file involving `qrexec-client-vm` #3318

joshuathayer commented Nov 17, 2017

marmarek commented Nov 18, 2017

joshuathayer commented Nov 18, 2017

joshuathayer commented Nov 20, 2017

marmarek commented Nov 21, 2017

joshuathayer commented Nov 21, 2017

marmarek commented Nov 21, 2017

joshuathayer commented Nov 21, 2017

unman commented Aug 2, 2019

rustybird commented Aug 2, 2019 •

edited

Loading

marmarek commented Aug 2, 2019

unman commented Aug 2, 2019 via email

rustybird commented Aug 2, 2019

marmarek commented Aug 2, 2019

rustybird commented Aug 2, 2019

marmarek commented Aug 2, 2019

pwmarcz commented Feb 10, 2020

pwmarcz commented Feb 10, 2020

rustybird commented May 8, 2020 •

edited

Loading

marmarek commented May 10, 2020

dispVM halts prematurely with custom .desktop file involving qrexec-client-vm #3318

dispVM halts prematurely with custom .desktop file involving qrexec-client-vm #3318

Comments

joshuathayer commented Nov 17, 2017

Qubes OS version:

Affected TemplateVMs:

Steps to reproduce the behavior:

Expected behavior:

Actual behavior:

General notes:

Related issues:

marmarek commented Nov 18, 2017

joshuathayer commented Nov 18, 2017

joshuathayer commented Nov 20, 2017

marmarek commented Nov 21, 2017

joshuathayer commented Nov 21, 2017

marmarek commented Nov 21, 2017

joshuathayer commented Nov 21, 2017

unman commented Aug 2, 2019

rustybird commented Aug 2, 2019 • edited Loading

marmarek commented Aug 2, 2019

unman commented Aug 2, 2019 via email

rustybird commented Aug 2, 2019

marmarek commented Aug 2, 2019

rustybird commented Aug 2, 2019

marmarek commented Aug 2, 2019

pwmarcz commented Feb 10, 2020

pwmarcz commented Feb 10, 2020

rustybird commented May 8, 2020 • edited Loading

marmarek commented May 10, 2020

dispVM halts prematurely with custom .desktop file involving `qrexec-client-vm` #3318

dispVM halts prematurely with custom .desktop file involving `qrexec-client-vm` #3318

rustybird commented Aug 2, 2019 •

edited

Loading

rustybird commented May 8, 2020 •

edited

Loading