-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close main process copy of pipe when sampling in parallel #3988
Conversation
Does this related to any of the bugs we see in MS related to multi process? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I will ask @lucianopaz for a second look.
It would be great if anyone could also test this on windows. My virtual machine is refusing to work again, and all that multiprocessing stuff tends to be different on windows. |
I added a test that I ran locally on Windows - checks out ✌ |
I think that test has an issue: every time the test is executed, the main
thread will also run the function that might segfault. In that case pytest
will be quite unhappy. This should happen in 1% of test runs, which would
be very annoying. Maybe we can store the pid of the main process when we
define the function and only segfault when os.getpid() != main_pid inside
the function.
Michael Osthege <[email protected]> schrieb am Mi., 1. Juli 2020,
21:14:
… Does this related to any of the bugs we see in MS related to multi process?
Possibly, but I don't know.
It would be great if anyone could also test this on windows. My virtual
machine is refusing to work again, and all that multiprocessing stuff tends
to be different on windows.
I added a test that I ran locally on Windows - checks out ✌
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3988 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOLSHJCNST5RDMPHU45H4LRZODK7ANCNFSM4ONJ2UTA>
.
|
With |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and I would merge it after the new tests pass!
However, it doesn't look related to the broken pipe errors we used to see on windows and, more recently Mac. Those happen while the processes are spawned with self._process.start()
(so it's even before we close the connection). Though I never figured out how to catch the exceptions that happened while spawning the workers, so that the leader process raised them to the user instead of getting useless broken pipes
The pid really shouldn't stay the same. I'm 100% about that on linux at least. (see eg https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html) import pymc3 as pm
import theano
import theano.tensor as tt
import random
import ctypes
import numpy as np
import os
master_pid = os.getpid()
@theano.as_op([tt.dvector], [tt.dvector])
def somefunc(a):
if random.random() < 1 and os.getpid() != master_pid:
# Segfault
ctypes.string_at(0)
return 2 * np.array(a)
with pm.Model() as model:
x = pm.Normal('x', shape=2)
pm.Normal('y', mu=somefunc(x), shape=2)
step = pm.Metropolis()
trace = pm.sample(step=step) |
One thing we could do to improve error messages if the spawn fails, is to first try to start a new process that is supposed to not do anything but unpickle the args and return 0. If we then check the return code of that process we can at least tell that something went wrong because of pickle, right? |
Co-authored-by: Junpeng Lao <[email protected]>
This is easier to test with the changes in #3991, so I included the fix and test there. |
When starting parallel sampling, we first create a pipe for communication. We pass one end to the new worker thread, but we should still close our own local copy, so that the pipe breaks when the remote process dies for some reason.
If we don't, then when a worker dies the main process will wait for new samples, but since there is still an open end of the queue it will not exit with a
ConnectionResetError
, but wait indefinitely.Closing the connection will prevent this. So when a worker dies, sampling stops and errors out.
We can test this behaviour by producing a segfault on purpose:
Unfortunately there is a small probability that this will also segfault the main process, so I don't really know how to create a proper test out of this.