Several large genomes using "Supplemental Protocol 20—Data Parallelization" #374
Unanswered
silviaprietob
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi!
I have been following the Supplemental Protocol 20 described in "Hoff, K. J., & Stanke, M. (2018). Predicting Genes in Single Genomes with AUGUSTUS. Current Protocols in Bioinformatics, e57. https://doi.org/10.1002/cpbi.57" to predict the genes in 20 "large" genomes. I am using this protocol and not the one in "Multi-Genome Annotation with AUGUSTUS" as I am interested in predicting them independently. However, I have encountered the following issue:
I have downloaded the genomes that I am using from Ensembl, and they are large chordata species (mostly vertebrates) such as human, mouse, frog, ciona intestinalis, chick, kakapo, chrysemys picta bellii (turtle), several fishes... . Therefore, they are quite variable in the number and length of "sequences" in their genome and also in the number of splits and jobs returned by splitMfasta.pl and createAugustusJoblist.pl. The number of jobs ranges from 476 to 80100, with all but one species having more jobs than slurm allows to submit. In the example from the protocol there are only 30, so I was wondering what to do in this case with so many jobs. I am attaching the version of "jobs.sh" that I have been using so far which has given me the mentioned issue (runjobs.txt, originally runjobs.sh). There is the added issue that I am using a for loop to loop through the different species. I am still unsure about how the job array really works, as when assigning for example a 1-500 one, higher numbers such as 2000 jobs are reached when requeing jobs (with other software, I can't figure out how to do it for this purpose). Could you give me a hand or do you have some idea on how I could solve this?
Many many thanks!
runjobs.txt
Beta Was this translation helpful? Give feedback.
All reactions