Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel sampling hangs on macOS with Python 3.8 with pickling error #3844

Closed
dfm opened this issue Mar 20, 2020 · 12 comments · Fixed by #4053
Closed

Parallel sampling hangs on macOS with Python 3.8 with pickling error #3844

dfm opened this issue Mar 20, 2020 · 12 comments · Fixed by #4053

Comments

@dfm
Copy link
Contributor

dfm commented Mar 20, 2020

Description of your problem

Starting with Python 3.8 the default start method for multiprocessing on macOS is spawn instead of fork - more info here and here. This means that some features that used to work will not continue to work on macOS. I specifically ran into this when using a DensityDist where the function was defined in place as follows because the function is not picklable:

import pymc3 as pm

with pm.Model() as model:
    
    def val(x):
        return -0.5 * pm.math.sum(x ** 2)
    
    x = pm.Flat("x")
    pm.DensityDist("obs", val)
    
    pm.sample(chains=2, cores=2)

This fails with the following error that does not actually get caught. I was running in a Jupyter notebook and the notebook just hangs and this gets written to the command line:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/dforeman/research/projects/exoplanet-dev/case-studies/env/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/dforeman/research/projects/exoplanet-dev/case-studies/env/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'val' on <module '__main__' (built-in)>

A temporary workaround that I found was to execute

import multiprocessing as mp
mp.set_start_method("fork")

before importing PyMC3, but my understanding is that this won't be a long term solution. It doesn't seem like there's anything obvious to do about this, but I wanted to bring it up here in case folks had clever ideas!

Versions and main components

  • PyMC3 Version: 3.8
  • Theano Version: 1.0.4
  • Python Version: 3.8.1
  • Operating system: macOS
  • How did you install PyMC3: pip
@ColCarroll
Copy link
Member

Wow -- haven't played around with this yet, but maybe we add mp.set_start_method to either the context manager (it'll fail if you don't use it in context mode, but will exit gracefully) or to _mp_sample.

@lucianopaz
Copy link
Contributor

And thus Mac now enters the glorious realm of broken pipes, which was exclusive to windows before py38!
As you said, this can be prevented by trying to set the fork start method, but sticking to the default spawn will get us some nasty theano bugs that we never managed to properly diagnose and fix.
The only thing that we did was to catch the broken pipe error on Windows and change the message to instruct users on where to look for the true error's traceback.

@AlexAndorra
Copy link
Contributor

Hi @dfm (and thanks for black_nbconvert 😉 )
Thanks for reporting! I'm closing since I think this is fixed by #3919, but feel free to reopen if needed.

@dfm
Copy link
Contributor Author

dfm commented Aug 13, 2020

Unfortunately I can't seem to re-open the issue, but I can confirm that #3919 does not fix this issue: https://gist.github.com/dfm/413938425166b0f30fba47a5d7639da7

I can confirm that "fork" rather than "forkserver" does work - what was the argument for using "forkserver" instead of "fork"?

@AlexAndorra
Copy link
Contributor

AlexAndorra commented Aug 14, 2020

Note that the changes in #3919 have been superseded by a recent PR by @aseyboldt. Does this seem to solve the issue you're having? Also, which version of PyMC3 are you running?
For context, the argument for using forkserver was given here.

@dfm
Copy link
Contributor Author

dfm commented Aug 14, 2020

Sorry - no I'm running the master branch and the changes made in that PR don't help - the problem is fork vs. forkserver, not where it's defined!

@dfm
Copy link
Contributor Author

dfm commented Aug 14, 2020

I don't really understand enough of the underlying issues to see the relevant differences. Perhaps this doesn't matter if people don't regularly use DensityDist in notebooks on macOS (+ maybe Windows?) because that seems to be the only place where this matters, so no stress. I just wanted to make sure that it was known that this issue is not fixed! I think it makes sense to re-open, but I can't since that option seems to only be available to maintainers.

@AlexAndorra
Copy link
Contributor

AlexAndorra commented Aug 14, 2020

Yeah this issue is definitely weird. From my experience, sometimes it works with forkserver and not with fork, other times it's the other way around. So this is really puzzling -- maybe there even are several issues mixed up in here 🤷‍♂️ In the little digging I did at the time though, I always saw the recommendation to use forkserver as the default, not fork.

Does the new kwarg in sample help you with that (you can specify the method you want instead of sticking to the default)?

I'm also curious about what @aseyboldt and @lucianopaz think before reopening.

@aseyboldt
Copy link
Member

I think for the most part this issue is just inherent weirdness of multiprocessing/pickle. If the logp func can't be pickled, and the multiprocessing context is one of spawn or forkserver, then we can not expect this to work. We do expect an error message that is shown in the notebook, that mentions pickle however. If that doesn't happen, then it really is a bug.

@dfm For your example I get this:

import pymc3 as pm

with pm.Model() as model:
    
    def val(x):
        return -0.5 * pm.math.sum(x ** 2)
    
    x = pm.Normal("x")
    pm.DensityDist("obs", val)
    
    pm.sample(chains=2, cores=2, mp_ctx="forkserver")
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/adr/git/pymc3/pymc3/parallel_sampling.py", line 114, in _unpickle_step_method
    self._step_method = pickle.loads(self._step_method)
AttributeError: Can't get attribute 'val' on <module '__main__' (built-in)>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/adr/git/pymc3/pymc3/parallel_sampling.py", line 135, in run
    self._unpickle_step_method()
  File "/home/adr/git/pymc3/pymc3/parallel_sampling.py", line 116, in _unpickle_step_method
    raise ValueError(unpickle_error)
ValueError: The model could not be unpickled. This is required for sampling with more than one core and multiprocessing context spawn or forkserver.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
ValueError: The model could not be unpickled. This is required for sampling with more than one core and multiprocessing context spawn or forkserver.

The above exception was the direct cause of the following exception:
...

If you specify the multiprocessing context (please use mp_ctx, not multiprocessing.set_start_method) the behaviour should also be the same on all platforms (except for windows, which will error out on any context other than spawn).

We can avoid this problem at least in this case by using dill instead of pickle (assuming it is installed):

import pymc3 as pm

with pm.Model() as model:
    def val(x):
        return -0.5 * pm.math.sum(x ** 2)
    
    x = pm.Normal("x")
    pm.DensityDist("obs", val)
    
    pm.sample(chains=2, cores=2, mp_ctx="forkserver", pickle_backend='dill')

Just to show that this isn't specific to pymc3 in any way: In the notebook this doesn't work either.

import multiprocessing

ctx = multiprocessing.get_context("forkserver")

def foo(x):
    print(2 * x)

proc = ctx.Process(target=foo, args=(2.,))
proc.start()

The child process dies and prints the unpickling error to the terminal where the notebook was started.

@aseyboldt
Copy link
Member

I guess we could actually work around this by depending on dill, and overwrite __setstate__ in DensityDist, to always use dill to serialize that function. This would not change how it behaves with ctx=fork, but would fix densitydist in notebooks on windows and mac.
I guess there is a chance that it could break something that currently works in a script running on windows or mac though...

Somthing like this:

class DensityDist(Distribution):
    ...

    def __getstate__(self):
        logp = dill.dumps(self.logp)
        vals = self.__dict__.copy()
        vals['logp'] = logp
        return vals

    def __setstate__(self, vals):
        vals['logp'] = dill.loads(vals['logp'])
        self.__dict__ = vals'

@dfm
Copy link
Contributor Author

dfm commented Aug 14, 2020

@aseyboldt: I think that the suggestion to use dill for DensityDist is a great idea!

I guess there is a chance that it could break something that currently works in a script running on windows or mac though...

I haven't used dill much, are there cases where it can't pickle things that pickle can? Or are you worried about something else?

@AlexAndorra: thanks for the response! I was suggesting that it be reopened because the bug still exists even if it's not PyMC3 specific. If it's not re-opened there should probably at least be a wontfix label or something to let people know who might run into the same issue :D

@aseyboldt
Copy link
Member

I haven't used dill much, are there cases where it can't pickle things that pickle can?

I haven't used it much either, and I really do not know. That is why I'm a bit worried.
But then, just using dill to serialize the functions of DensityDist is a pretty confined use case, I can't think of much that this might break that isn't broken already...

I'll also reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants