Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running times in population mode when using raw reads #22

Open
IdoBar opened this issue Aug 6, 2023 · 3 comments
Open

Long running times in population mode when using raw reads #22

IdoBar opened this issue Aug 6, 2023 · 3 comments

Comments

@IdoBar
Copy link

IdoBar commented Aug 6, 2023

Hi,

I'm running unitig-caller in population mode to by used by pyseer for GWAS (following the documentation).
I don't have assemblies for all my samples (~250 samples), so I'm using the raw reads (2x 1-3gb gzipped paired-end fastq files for each sample) and it takes a long time to process.
I was wondering a couple of things:

  1. Is it possible to run each sample separately (in parallel) and later combine the graphs? will it be more computationally efficient?
  2. The input files don't contain any sample name column and all my samples have multiple input files (paired-end), so I don't understand at which point the uniting presence is being calculated/counted per sample (to serve as input for pyseer)?

Thanks, Ido

@samhorsfield96
Copy link
Collaborator

Hi Ido,

Building with reads rather than assemblies is more computationally intensive, as Bifrost has to conduct an additional filtering step to remove low coverage k-mers. Also, there are more sequences to include in the graph than using assemblies, which also increases runtime. To answer your questions:

  1. Unitig caller is already parallelized using Bifrost. You could alternatively build the graphs incrementally using Bifrost in 'update' mode (see https://github.com/pmelsted/bifrost), and then pass the .gfa and .bfg_colors files back to unitig caller using either call or query mode, depending on whether you want the unitigs for your current population, or to check unitig presence/absence for a new population. This process only adds one sample at a time and does not merge graphs however, so I'm unsure what speed-up you would get if any.
  2. As Bifrost generates edge-induced graphs, any k-mers with a k-1 overlap are connected by an edge, and so contiguity information is not taken into account. Paired-end reads from the same sample should therefore be merged into the same file, as the paired-end information is not used to generate unitigs.

I hope this helps. Please let me know if you have any further questions.

Sam

@IdoBar
Copy link
Author

IdoBar commented Aug 11, 2023

Thanks for your reply @samhorsfield96 ,

At the moment trying to run all the samples together in call mode requires more than 100GB and ran for over 65 hours (before reaching the memory limit and failing).
I understand that overall I won't save any time, but will splitting it into smaller jobs will reduce memory consumption so my HPC could handle them?
What is the time and memory-intensive step? Is it the graph generation or the calling? I'll try to generate the graph first and see if it finishes successfully.

Thanks, Ido

@samhorsfield96
Copy link
Collaborator

Hi Ido,

We haven't extensively tested the computational efficiency of each step, although I would predict it will be dataset dependent (more variation will make graph generation take longer with more memory, more genomes will impact calling). I'm not 100% sure on the effect of incrementally adding genomes either. If you are happy to do so, I would be interested to hear how your approach fares with regards to memory consumption.

All the best,

Sam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants