Close main process copy of pipe when sampling in parallel #3988

aseyboldt · 2020-07-01T11:15:37Z

When starting parallel sampling, we first create a pipe for communication. We pass one end to the new worker thread, but we should still close our own local copy, so that the pipe breaks when the remote process dies for some reason.
If we don't, then when a worker dies the main process will wait for new samples, but since there is still an open end of the queue it will not exit with a ConnectionResetError, but wait indefinitely.
Closing the connection will prevent this. So when a worker dies, sampling stops and errors out.
We can test this behaviour by producing a segfault on purpose:

import pymc3 as pm
import theano
import theano.tensor as tt
import random
import ctypes
import numpy as np

@theano.as_op([tt.dvector], [tt.dvector])
def somefunc(a):
    if random.random() < 0.01:
        # Segfault
        ctypes.string_at(0)
    return 2 * np.array(a)

with pm.Model() as model:
    x = pm.Normal('x', shape=2)
    pm.Normal('y', mu=somefunc(x), shape=2)
    
    step = pm.Metropolis()
    trace = pm.sample(step=step)

Unfortunately there is a small probability that this will also segfault the main process, so I don't really know how to create a proper test out of this.

junpenglao · 2020-07-01T14:28:06Z

Does this related to any of the bugs we see in MS related to multi process?

junpenglao

LGTM but I will ask @lucianopaz for a second look.

aseyboldt · 2020-07-01T15:53:21Z

Does this related to any of the bugs we see in MS related to multi process?
Possibly, but I don't know.

It would be great if anyone could also test this on windows. My virtual machine is refusing to work again, and all that multiprocessing stuff tends to be different on windows.

michaelosthege · 2020-07-01T19:14:41Z

Does this related to any of the bugs we see in MS related to multi process?
Possibly, but I don't know.

It would be great if anyone could also test this on windows. My virtual machine is refusing to work again, and all that multiprocessing stuff tends to be different on windows.

I added a test that I ran locally on Windows - checks out ✌

aseyboldt · 2020-07-01T19:38:25Z

I think that test has an issue: every time the test is executed, the main thread will also run the function that might segfault. In that case pytest will be quite unhappy. This should happen in 1% of test runs, which would be very annoying. Maybe we can store the pid of the main process when we define the function and only segfault when os.getpid() != main_pid inside the function. Michael Osthege <[email protected]> schrieb am Mi., 1. Juli 2020, 21:14:

…

Does this related to any of the bugs we see in MS related to multi process? Possibly, but I don't know. It would be great if anyone could also test this on windows. My virtual machine is refusing to work again, and all that multiprocessing stuff tends to be different on windows. I added a test that I ran locally on Windows - checks out ✌ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3988 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOLSHJCNST5RDMPHU45H4LRZODK7ANCNFSM4ONJ2UTA> .

michaelosthege · 2020-07-01T20:15:15Z

I think that test has an issue: every time the test is executed, the main
thread will also run the function that might segfault. In that case pytest
will be quite unhappy. This should happen in 1% of test runs, which would
be very annoying. Maybe we can store the pid of the main process when we
define the function and only segfault when os.getpid() != main_pid inside
the function.

With multiprocessing the PID is the same on the child processes. Instead, I changed it such that it segfaults on a<0, but with the prior mode >0, all testval checks are positive, so it does not crash before sampling.

lucianopaz

This looks great and I would merge it after the new tests pass!

However, it doesn't look related to the broken pipe errors we used to see on windows and, more recently Mac. Those happen while the processes are spawned with self._process.start() (so it's even before we close the connection). Though I never figured out how to catch the exceptions that happened while spawning the workers, so that the leader process raised them to the user instead of getting useless broken pipes

aseyboldt · 2020-07-01T20:56:09Z

The pid really shouldn't stay the same. I'm 100% about that on linux at least. (see eg https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html)
This works for me on linux:

import pymc3 as pm
import theano
import theano.tensor as tt
import random
import ctypes
import numpy as np
import os


master_pid = os.getpid()

@theano.as_op([tt.dvector], [tt.dvector])
def somefunc(a):
    if random.random() < 1 and os.getpid() != master_pid:
        # Segfault
        ctypes.string_at(0)
    return 2 * np.array(a)

with pm.Model() as model:
    x = pm.Normal('x', shape=2)
    pm.Normal('y', mu=somefunc(x), shape=2)
    
    step = pm.Metropolis()
    trace = pm.sample(step=step)

aseyboldt · 2020-07-01T20:59:08Z

However, it doesn't look related to the broken pipe errors we used to see on windows and, more recently Mac. Those happen while the processes are spawned with self._process.start() (so it's even before we close the connection). Though I never figured out how to catch the exceptions that happened while spawning the workers, so that the leader process raised them to the user instead of getting useless broken pipes

One thing we could do to improve error messages if the spawn fails, is to first try to start a new process that is supposed to not do anything but unpickle the args and return 0. If we then check the return code of that process we can at least tell that something went wrong because of pickle, right?

pymc3/tests/test_sampling.py

Co-authored-by: Junpeng Lao <[email protected]>

aseyboldt · 2020-07-03T07:34:51Z

This is easier to test with the changes in #3991, so I included the fix and test there.
Closing this PR.

Close main process copy of pipe when sampling in parallel

b0fa60c

aseyboldt added the bug label Jul 1, 2020

aseyboldt requested a review from junpenglao July 1, 2020 14:03

junpenglao approved these changes Jul 1, 2020

View reviewed changes

junpenglao requested a review from lucianopaz July 1, 2020 14:28

add regression test for pymc-devs#3988

f39ac56

trigger segfault only during sampling

05ceb97

lucianopaz approved these changes Jul 1, 2020

View reviewed changes

michaelosthege added 2 commits July 2, 2020 10:48

make test floatX compatible

3709a78

disable convergence checks on regression test

bbfb29b

junpenglao reviewed Jul 2, 2020

View reviewed changes

pymc3/tests/test_sampling.py Outdated Show resolved Hide resolved

apply code format suggestion

503bf8d

Co-authored-by: Junpeng Lao <[email protected]>

aseyboldt closed this Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close main process copy of pipe when sampling in parallel #3988

Close main process copy of pipe when sampling in parallel #3988

aseyboldt commented Jul 1, 2020

junpenglao commented Jul 1, 2020

junpenglao left a comment

aseyboldt commented Jul 1, 2020

michaelosthege commented Jul 1, 2020

aseyboldt commented Jul 1, 2020 via email

michaelosthege commented Jul 1, 2020

lucianopaz left a comment

aseyboldt commented Jul 1, 2020

aseyboldt commented Jul 1, 2020

aseyboldt commented Jul 3, 2020

Close main process copy of pipe when sampling in parallel #3988

Close main process copy of pipe when sampling in parallel #3988

Conversation

aseyboldt commented Jul 1, 2020

junpenglao commented Jul 1, 2020

junpenglao left a comment

Choose a reason for hiding this comment

aseyboldt commented Jul 1, 2020

michaelosthege commented Jul 1, 2020

aseyboldt commented Jul 1, 2020 via email

michaelosthege commented Jul 1, 2020

lucianopaz left a comment

Choose a reason for hiding this comment

aseyboldt commented Jul 1, 2020

aseyboldt commented Jul 1, 2020

aseyboldt commented Jul 3, 2020