advice for increasing speed of --annotate_hits_table #80

colinbrislawn · 2018-02-02T20:16:56Z

I'm trying to annotate 800k proteins on a compute cluster, and the instructions for large scale analysis have been great!

However, I'm running into serious IO bounds with the final --annotate_hits_table, especially when using worker nodes. You mention.

We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having eggnog.db under the /dev/shm disk, but you can of course run many of those instances in parallel.

How do I set up local caching? I tried copying the database to /dev/shm, but this was not recognized and I can't find an option to set a database directory.

Thank you for supporting this excellent software.

The text was updated successfully, but these errors were encountered:

jhcepas · 2018-02-03T11:15:00Z

@colinbrislawn you can 1) move the whole emapper directory to /dev/shm or 2) move the data/ dir and then use the --data_dir option.

colinbrislawn · 2018-02-05T19:13:32Z

Thank you for the fast response!

Here is how I solved this problem:

# go to eggnog mapper dir
cd /people/me/bin/anaconda/envs/eggnog/lib/python2.7/site-packages/data/

# unzip database directly into the RAM disk. Reading the .gz version is faster then the 22GB file.
pigz -dc eggnog.db.gz > /dev/shm/eggnog.db

# Point to the database on the RAM disk
emapper.py --annotate_hits_table --data_dir /dev/shm/ ... <other settings go here>

Performance was fantastic:
Processed queries:596878 total_time: 701.507770061 rate:850.85 q/s

jhcepas · 2018-02-06T08:46:19Z

awesome. I didn't notice you were on a conda environment!
Your approach is just right

evolu-tion · 2019-03-05T08:41:49Z

That's fantastic approach!
Thank you so much for sharing this technique.

saras224 · 2024-06-07T13:19:39Z

Hi Guys!
I also have large metagenomic dataset.... I want to run eggnog mapper to obtain the annotation files for the assembled metagenomic dataset. Can anyone @evolu-tion help me to guide to do faster analysis ? I also have installed eggnog mapper inside the conda environment.

colinbrislawn · 2024-06-07T13:29:45Z

Hello @saras224 👋

help me to guide to do faster analysis ?

Are you asking for some free advice or would you like to hire an expert?

Either way, I would write to the folks on the current team listed here:
http://eggnog-mapper.embl.de/about

saras224 · 2024-06-10T05:16:14Z

Hey @colinbrislawn I thought this platform is where we could ask about the issues that we face while using eggNOG.
My concern is that how can I make the process faster to get the EggNOG annotation excel sheet results. my input files are assembled contigs ranging from 14 mega basepairs to 569 mega basepairs. when I did run eggnog mapper for the subsampled files which had approx 14mbps in each file it took 20 days to give the annotations. so my question is how can I make the analysis faster for the annotation of the assembled metagenomic contigs. what dabase should I use (MMseq/diamond)? and what should be the name of the output directory?

Cantalapiedra · 2024-07-01T08:05:30Z

Dear @saras224 ,

Sorry for the delay answering.
Please, could you give the details of how you did run eggnog-mapper?
Also, depending on the computational infrastructure available to you, you may use some options or not.

In general, it is important to know that eggNOG-mapper has 2 main stages (plus some additional ones, like the gene prediction stage, which you may be using for your contigs). The first stage is the "search" stage, and the second one is the "annotation" stage. You may even run these stages separately. If you are using large contigs as input, my advice is to use Prodigal for the gene prediction step, or maybe using proteins or CDS as input, if you already have prediction from prodigal from other means.

During the search stage, diamond should be fine and rather fast (at least with the default parameters). Note also that you should tune the filter thresholds to your needs. For the search step, the more CPUs you can assign to a given job, the better, and also the more sequences you can input into the same diamond process, the faster will be in the end.

During the annotation step there is one option which makes everything faster, which is the --dbmem option, but you need at least 44 GB RAM to be able to use it (since the eggNOG-mapper annotation database is loaded into memory). During this stage the number of CPUs is not so important, and you may need to split as much as possible the input data into different jobs, if you can fit the 44GB RAM multiple times in your hardware.

For more details into this, please check the wiki within the github project: https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.12

I hope this is of help, but if you provide your specific details it may be easier to help.

Best,
Carlos

jhcepas closed this as completed Feb 3, 2018

shubavarshini mentioned this issue Feb 19, 2021

Loading the eggNOG database onto SGC #277

Closed

Tianhao-Gu mentioned this issue Jul 11, 2024

use dbmen for eggnog kbase/collections#746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advice for increasing speed of --annotate_hits_table #80

advice for increasing speed of --annotate_hits_table #80

colinbrislawn commented Feb 2, 2018

jhcepas commented Feb 3, 2018

colinbrislawn commented Feb 5, 2018 •

edited

Loading

jhcepas commented Feb 6, 2018

evolu-tion commented Mar 5, 2019

saras224 commented Jun 7, 2024

colinbrislawn commented Jun 7, 2024 •

edited

Loading

saras224 commented Jun 10, 2024

Cantalapiedra commented Jul 1, 2024

advice for increasing speed of --annotate_hits_table #80

advice for increasing speed of --annotate_hits_table #80

Comments

colinbrislawn commented Feb 2, 2018

jhcepas commented Feb 3, 2018

colinbrislawn commented Feb 5, 2018 • edited Loading

jhcepas commented Feb 6, 2018

evolu-tion commented Mar 5, 2019

saras224 commented Jun 7, 2024

colinbrislawn commented Jun 7, 2024 • edited Loading

saras224 commented Jun 10, 2024

Cantalapiedra commented Jul 1, 2024

colinbrislawn commented Feb 5, 2018 •

edited

Loading

colinbrislawn commented Jun 7, 2024 •

edited

Loading