Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close main process copy of pipe when sampling in parallel #3988

Closed
wants to merge 6 commits into from

Conversation

aseyboldt
Copy link
Member

When starting parallel sampling, we first create a pipe for communication. We pass one end to the new worker thread, but we should still close our own local copy, so that the pipe breaks when the remote process dies for some reason.
If we don't, then when a worker dies the main process will wait for new samples, but since there is still an open end of the queue it will not exit with a ConnectionResetError, but wait indefinitely.
Closing the connection will prevent this. So when a worker dies, sampling stops and errors out.
We can test this behaviour by producing a segfault on purpose:

import pymc3 as pm
import theano
import theano.tensor as tt
import random
import ctypes
import numpy as np

@theano.as_op([tt.dvector], [tt.dvector])
def somefunc(a):
    if random.random() < 0.01:
        # Segfault
        ctypes.string_at(0)
    return 2 * np.array(a)

with pm.Model() as model:
    x = pm.Normal('x', shape=2)
    pm.Normal('y', mu=somefunc(x), shape=2)
    
    step = pm.Metropolis()
    trace = pm.sample(step=step)

Unfortunately there is a small probability that this will also segfault the main process, so I don't really know how to create a proper test out of this.

@aseyboldt aseyboldt added the bug label Jul 1, 2020
@aseyboldt aseyboldt requested a review from junpenglao July 1, 2020 14:03
@junpenglao
Copy link
Member

Does this related to any of the bugs we see in MS related to multi process?

Copy link
Member

@junpenglao junpenglao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I will ask @lucianopaz for a second look.

@aseyboldt
Copy link
Member Author

Does this related to any of the bugs we see in MS related to multi process?
Possibly, but I don't know.

It would be great if anyone could also test this on windows. My virtual machine is refusing to work again, and all that multiprocessing stuff tends to be different on windows.

@michaelosthege
Copy link
Member

Does this related to any of the bugs we see in MS related to multi process?
Possibly, but I don't know.

It would be great if anyone could also test this on windows. My virtual machine is refusing to work again, and all that multiprocessing stuff tends to be different on windows.

I added a test that I ran locally on Windows - checks out ✌

@aseyboldt
Copy link
Member Author

aseyboldt commented Jul 1, 2020 via email

@michaelosthege
Copy link
Member

I think that test has an issue: every time the test is executed, the main
thread will also run the function that might segfault. In that case pytest
will be quite unhappy. This should happen in 1% of test runs, which would
be very annoying. Maybe we can store the pid of the main process when we
define the function and only segfault when os.getpid() != main_pid inside
the function.

With multiprocessing the PID is the same on the child processes. Instead, I changed it such that it segfaults on a<0, but with the prior mode >0, all testval checks are positive, so it does not crash before sampling.

Copy link
Contributor

@lucianopaz lucianopaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and I would merge it after the new tests pass!

However, it doesn't look related to the broken pipe errors we used to see on windows and, more recently Mac. Those happen while the processes are spawned with self._process.start() (so it's even before we close the connection). Though I never figured out how to catch the exceptions that happened while spawning the workers, so that the leader process raised them to the user instead of getting useless broken pipes

@aseyboldt
Copy link
Member Author

The pid really shouldn't stay the same. I'm 100% about that on linux at least. (see eg https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html)
This works for me on linux:

import pymc3 as pm
import theano
import theano.tensor as tt
import random
import ctypes
import numpy as np
import os


master_pid = os.getpid()

@theano.as_op([tt.dvector], [tt.dvector])
def somefunc(a):
    if random.random() < 1 and os.getpid() != master_pid:
        # Segfault
        ctypes.string_at(0)
    return 2 * np.array(a)

with pm.Model() as model:
    x = pm.Normal('x', shape=2)
    pm.Normal('y', mu=somefunc(x), shape=2)
    
    step = pm.Metropolis()
    trace = pm.sample(step=step)

@aseyboldt
Copy link
Member Author

However, it doesn't look related to the broken pipe errors we used to see on windows and, more recently Mac. Those happen while the processes are spawned with self._process.start() (so it's even before we close the connection). Though I never figured out how to catch the exceptions that happened while spawning the workers, so that the leader process raised them to the user instead of getting useless broken pipes

One thing we could do to improve error messages if the spawn fails, is to first try to start a new process that is supposed to not do anything but unpickle the args and return 0. If we then check the return code of that process we can at least tell that something went wrong because of pickle, right?

@aseyboldt
Copy link
Member Author

This is easier to test with the changes in #3991, so I included the fix and test there.
Closing this PR.

@aseyboldt aseyboldt closed this Jul 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants