Skip to content
Isaac Overcast edited this page Nov 15, 2015 · 13 revisions

Alternative ideas for parallelization.

  • multiprocessing is not working perfectly right now
  • it would be nice to have MPI functionality

UPDATE:

Ok, despite having to start the engines outside of ipyrad (although we could probably work around that too) ipyparallel turns out to be extremely easy to use, and it can access MPI. Reading more it seems multiprocessing is not fully compatible with IPython, and so ipyparallel really seems like the better route. I created an ipyparallel branch to rewrite all parallel code.

Ways of parallelizing

pyrad was somewhat naively parallelized by running each sample on a separate processor, but ideally there are better ways to do this in ipyrad. Steps 1,2,5 can be parallelized by breaking large files into chunks, which I have implemented in s1 and s2 so far. Step3 has to be run on each sample separately, and each sample can only efficiently use about 4-6 processors before there are limiting returns from vsearch. So multiple samples should be run at once, but optimizing this can be a little tricky. Steps 6-7 are easy to parallelize.

Problem in Multiprocessing (currently implemented)

issue discussed here
It seems the fix is to upgrade to pyzmq==14.5.0. (Only a stop-gap, not fix for full problem).

I seem to keep getting stalled processes somewhat randomly while running in a Jupyter notebook. As in, it will stall sometimes and other times not when I run the exact same code. I'm not sure if this is a problem of calling Multiprocessing and having it also interact with ZeroMQ methods used by IPython, my guess is that it has to do with not killing processes properly, or rather, with IPython trying to execute the next line of code in a cell before allowing full time to kill the process. Is that crazy? It doesn't seem to happen when I execute only a single step function in a cell. Maybe a super short sleep function would fix this... but that seems pretty hacky. We either need to fix this or find an alternative parallel method. Fixing would probably be easiest, but other methods that allow for MPI would be an improvement...

Problems with IPython.parallel (ipyparallel)

My main problem with ipyparallel is that you have to start the clusters outside of ipyrad. So this requires a little bit more complicated code to ask users to run. For example, the CLI would need to be run as:

 ipcluster start -n 4 --daemon
 ipyrad -p params.txt 

The second problem with ipyparallel is that it is very actively under development (It's in IPython 4.0 but not in 3.0). We should certainly develop ipyrad for the future release, but it would be a pain if ipyparallel changes too fast for us to keep up with it easily.

Gotchas

This error means your cluster isn't running:

  UnboundLocalError: local variable 'ipyclient' referenced before assignment

This means you are attempting to result.get() on a call to map_async() that returned an empty list, which usually indicates bad arguments:

  AttributeError: 'list' object has no attribute 'get'