Several large genomes using "Supplemental Protocol 20—Data Parallelization" #374

silviaprietob · 2023-01-10T11:08:27Z

silviaprietob
Jan 10, 2023

Hi!

I have been following the Supplemental Protocol 20 described in "Hoff, K. J., & Stanke, M. (2018). Predicting Genes in Single Genomes with AUGUSTUS. Current Protocols in Bioinformatics, e57. https://doi.org/10.1002/cpbi.57" to predict the genes in 20 "large" genomes. I am using this protocol and not the one in "Multi-Genome Annotation with AUGUSTUS" as I am interested in predicting them independently. However, I have encountered the following issue:

When running the jobs.sh part, there are many more jobs than the array size of the cluster (slurm) I am using allows. Therefore, not all jobs (not all genome splits) are predicted.

I have downloaded the genomes that I am using from Ensembl, and they are large chordata species (mostly vertebrates) such as human, mouse, frog, ciona intestinalis, chick, kakapo, chrysemys picta bellii (turtle), several fishes... . Therefore, they are quite variable in the number and length of "sequences" in their genome and also in the number of splits and jobs returned by splitMfasta.pl and createAugustusJoblist.pl. The number of jobs ranges from 476 to 80100, with all but one species having more jobs than slurm allows to submit. In the example from the protocol there are only 30, so I was wondering what to do in this case with so many jobs. I am attaching the version of "jobs.sh" that I have been using so far which has given me the mentioned issue (runjobs.txt, originally runjobs.sh). There is the added issue that I am using a for loop to loop through the different species. I am still unsure about how the job array really works, as when assigning for example a 1-500 one, higher numbers such as 2000 jobs are reached when requeing jobs (with other software, I can't figure out how to do it for this purpose). Could you give me a hand or do you have some idea on how I could solve this?

Many many thanks!
runjobs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several large genomes using "Supplemental Protocol 20—Data Parallelization" #374

{{title}}

Replies: 0 comments

Select a reply

Several large genomes using "Supplemental Protocol 20—Data Parallelization" #374

silviaprietob Jan 10, 2023

Replies: 0 comments

silviaprietob
Jan 10, 2023