Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelise collapse_paralogs? #126

Open
johnlees opened this issue Jul 28, 2021 · 2 comments
Open

Parallelise collapse_paralogs? #126

johnlees opened this issue Jul 28, 2021 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@johnlees
Copy link
Collaborator

johnlees commented Jul 28, 2021

Running and finding the Processing paralogs... stage quite slow:

Processing paralogs...
  2%|██▉                                                    | 92/3815 [1:37:47<153:39:25, 148.58s/it]

Looks single threaded in the code – is it possible to do this in a multiprocessing loop, or are there interactions which make this difficult?

@gtonkinhill
Copy link
Owner

This is definitely a bottleneck in the pipeline. It is not trivial to parallelise as the processing of one paralogous family can depend on the result of another.

The algorithm initially attempts to collapse paralogous genes by identifying the nearest neighbour of a paralogous gene within the graph and collapsing them if they are from seperate samples. This is tricky to parallelise as you need to be sure that a previous step has not already included a gene from the same sample in that cluster.

When this approach fails the algorithm defaults to using gene context. This stage would be much easier to parallelise and is already faster as it does not rely on calculating shortest paths. I was hoping to experiment with how much the results would change if we only used this approach but might not get a chance for a couple of months.

@gtonkinhill gtonkinhill self-assigned this Jul 29, 2021
@gtonkinhill gtonkinhill added the enhancement New feature or request label Jul 29, 2021
@johnlees
Copy link
Collaborator Author

To keep using the first approach, could you use some shared memory to mark genes which have been included? The shared memory manager in python3.8 has made this kind of thing easier to get to work in poppunk

Just a suggestion though, sounds like this was already on your radar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants