Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advice for increasing speed of --annotate_hits_table #80

Closed
colinbrislawn opened this issue Feb 2, 2018 · 8 comments
Closed

advice for increasing speed of --annotate_hits_table #80

colinbrislawn opened this issue Feb 2, 2018 · 8 comments

Comments

@colinbrislawn
Copy link

I'm trying to annotate 800k proteins on a compute cluster, and the instructions for large scale analysis have been great!

However, I'm running into serious IO bounds with the final --annotate_hits_table, especially when using worker nodes. You mention.

We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having eggnog.db under the /dev/shm disk, but you can of course run many of those instances in parallel.

How do I set up local caching? I tried copying the database to /dev/shm, but this was not recognized and I can't find an option to set a database directory.

Thank you for supporting this excellent software.

@jhcepas
Copy link
Member

jhcepas commented Feb 3, 2018

@colinbrislawn you can 1) move the whole emapper directory to /dev/shm or 2) move the data/ dir and then use the --data_dir option.

@jhcepas jhcepas closed this as completed Feb 3, 2018
@colinbrislawn
Copy link
Author

colinbrislawn commented Feb 5, 2018

Thank you for the fast response!

Here is how I solved this problem:

# go to eggnog mapper dir
cd /people/me/bin/anaconda/envs/eggnog/lib/python2.7/site-packages/data/

# unzip database directly into the RAM disk. Reading the .gz version is faster then the 22GB file.
pigz -dc eggnog.db.gz > /dev/shm/eggnog.db

# Point to the database on the RAM disk
emapper.py --annotate_hits_table --data_dir /dev/shm/ ... <other settings go here>

Performance was fantastic:
Processed queries:596878 total_time: 701.507770061 rate:850.85 q/s

@jhcepas
Copy link
Member

jhcepas commented Feb 6, 2018

awesome. I didn't notice you were on a conda environment!
Your approach is just right

@evolu-tion
Copy link

That's fantastic approach!
Thank you so much for sharing this technique.

@saras224
Copy link

saras224 commented Jun 7, 2024

Hi Guys!
I also have large metagenomic dataset.... I want to run eggnog mapper to obtain the annotation files for the assembled metagenomic dataset. Can anyone @evolu-tion help me to guide to do faster analysis ? I also have installed eggnog mapper inside the conda environment.

@colinbrislawn
Copy link
Author

colinbrislawn commented Jun 7, 2024

Hello @saras224 👋

help me to guide to do faster analysis ?

Are you asking for some free advice or would you like to hire an expert?

Either way, I would write to the folks on the current team listed here:
http://eggnog-mapper.embl.de/about

@saras224
Copy link

Hey @colinbrislawn I thought this platform is where we could ask about the issues that we face while using eggNOG.
My concern is that how can I make the process faster to get the EggNOG annotation excel sheet results. my input files are assembled contigs ranging from 14 mega basepairs to 569 mega basepairs. when I did run eggnog mapper for the subsampled files which had approx 14mbps in each file it took 20 days to give the annotations. so my question is how can I make the analysis faster for the annotation of the assembled metagenomic contigs. what dabase should I use (MMseq/diamond)? and what should be the name of the output directory?

@Cantalapiedra
Copy link
Collaborator

Dear @saras224 ,

Sorry for the delay answering.
Please, could you give the details of how you did run eggnog-mapper?
Also, depending on the computational infrastructure available to you, you may use some options or not.

In general, it is important to know that eggNOG-mapper has 2 main stages (plus some additional ones, like the gene prediction stage, which you may be using for your contigs). The first stage is the "search" stage, and the second one is the "annotation" stage. You may even run these stages separately. If you are using large contigs as input, my advice is to use Prodigal for the gene prediction step, or maybe using proteins or CDS as input, if you already have prediction from prodigal from other means.

During the search stage, diamond should be fine and rather fast (at least with the default parameters). Note also that you should tune the filter thresholds to your needs. For the search step, the more CPUs you can assign to a given job, the better, and also the more sequences you can input into the same diamond process, the faster will be in the end.

During the annotation step there is one option which makes everything faster, which is the --dbmem option, but you need at least 44 GB RAM to be able to use it (since the eggNOG-mapper annotation database is loaded into memory). During this stage the number of CPUs is not so important, and you may need to split as much as possible the input data into different jobs, if you can fit the 44GB RAM multiple times in your hardware.

For more details into this, please check the wiki within the github project: https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.12

I hope this is of help, but if you provide your specific details it may be easier to help.

Best,
Carlos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants