Skip to content
Jaime Huerta-Cepas edited this page May 17, 2019 · 1 revision

Requirements

Software Requirements:

  • Python 2.7+
  • wget
  • HMMER 3 and/or DIAMOND binaries available (otherwise using the ones packaged with eggNOG-mapper)
  • BioPython (required only if using the --translate option)

Storage Requirements:

  • ~20GB for the eggNOG annotation database

  • ~20GB for eggNOG fasta files

  • ~130GB for the three optimized eggNOG databases (euk, bact, arch), and from 1GB to 35GB for each taxonomic-specific eggNOG HMM database. You don't have to download all, just pick the ones you are interested.

(you can check the size of individual datasets at http://beta-eggnogdb.embl.de/download/eggnog_4.5/hmmdb_levels/)

Memory requirements:

eggnog-mapper allows to run very fast searches by allocating the target HMM databases into memory (using the HMMER3 hmmpgmd program). This is enabled when using of the --usemem or --servermode flags, and it will require a lot of RAM memory (depending on the size of the target database). As a reference:

  • ~90GB to load the optimized eukaryotic databases (euk)
  • ~32GB to load the optimized bacterial database (bact)
  • ~10GB to load the optimized archeal database (arch)

Note:

  • Searches are still possible in low memory systems. However, Disk I/O will be the a bottleneck (specially when using multiple CPUs). Place databases in the fastest disk posible.

Installation

Download

git clone https://github.com/jhcepas/eggnog-mapper.git

eggNOG database retrieval

  • eggNOG mapper provides 107 taxonomically restricted HMM databases (xxxNOG), three optimized databases [Eukaryota (euk), Bacteria (bact) and Archea (arch)] and a virus specific database (viruses).

  • The three optimized databases include all HMM models from their corresponding taxonomic levels in eggNOG (euNOG, bactNOG, arNOG) plus additional models spliting large alignments into taxonomically restricted (smaller) HMM models. In particular, HMM models with more than 500 (euk) or 50 (bact) sequences are expanded. The arch database includes all models from all archeal taxonomic levels in eggNOG.

  • taxonomically restricted databases are listed here. They can be referred by its code (i.e. maNOG for Mammals).

To download a given database, execute the download script providing a the list of databases to fetch:

download_eggnog_data.py euk bact arch viruses

This will fetch and decompress all precomputed eggNOG data into the data/ directory.

Basic Usage

HMMER based searches

  • Disk based searches on the optimized bacterial database
python emapper.py -i test/polb.fa --output polb_bact -d bact
  • Disk based searches on the optimized database of viral models
python emapper.py -i test/polb.fa --output polb_viruses  -d viruses
  • Disk based searches on the mammal specific database
python emapper.py -i test/p53.fa --output p53_maNOG -d maNOG
  • Memory based searches on the mammal specific database
python emapper.py -i test/p53.fa --output p53_maNOG -d maNOG --usemem
  • Using DIAMOND as search method.

Note that no target database is required when using DIAMOND.

python emapper.py -i test/p53.fa --output p53_maNOG  -m diamond

[project_name].emapper.hmm_hits file

For each query sequence, a list of significant hits to eggNOG Orthologous Groups (OGs) is reported. Each line in the file represents a hit, where evalue, bit-score, query-coverage and the sequence coordinates of the match are reported. If multiple hits exist for a given query, results are sorted by e-value.

[project_name].emapper.seed_orthologs file

each line in the file provides the best match of each query within the best Orthologous Group (OG) reported in the [project].hmm_hits file, obtained running PHMMER against all sequences within the best OG. The seed ortholog is used to fetch fine-grained orthology relationships from eggNOG. If using the diamond search mode, seed orthologs are directly obtained from the best matching sequences by running DIAMOND against the whole eggNOG protein space.

[project_name].emapper.annotations file

This file provides final annotations of each query. Tab-delimited columns in the file are:

  1. query_name: query sequence name
  2. seed_eggNOG_ortholog: best protein match in eggNOG
  3. seed_ortholog_evalue: best protein match (e-value)
  4. seed_ortholog_score: best protein match (bit-score)
  5. predicted_gene_name: Predicted gene name for query sequences
  6. GO_terms: Comma delimited list of predicted Gene Ontology terms
  7. KEGG_KO: Comma delimited list of predicted KEGG KOs
  8. BiGG_Reactions: Comma delimited list of predicted BiGG metabolic reactions
  9. Annotation_tax_scope: The taxonomic scope used to annotate this query sequence
  10. Matching_OGs: Comma delimited list of matching eggNOG Orthologous Groups
  11. best_OG|evalue|score: Best matching Orthologous Groups (only in HMM mode)
  12. COG functional categories: COG functional category inferred from best matching OG
  13. eggNOG_HMM_model_annotation: eggNOG functional description inferred from best matching OG

Advance usage

Speeding up annotation using memory based multi-threaded based searches.

If only one input file is going to be annotated, simply use the --usemem and --cpu XX options. For instance:

python emapper.py -i test/polb.fa --output polb_pfam -d pfam/pfam.hmm --usemem --cpu 10

If you are planning to use the same database for annotating multiple files, you can start eggnog-mapper in server mode (this will load the target database in memory and keep it there until stopped). Then you can use another eggnog-mapper instance to connect to the server. For instance,

In terminal 1, execute:

python emapper.py -d arch --cpu 10 --servermode

This will load the memory and give you the address to connect to the database. Then, in a different terminal, execute:

python emapper.py -d arch:localhost:51600 -i test/polb.fa -o polb_arch

Mapping to custom databases

You can also provide a custom hmmpressed HMMR3 database. For this, just provide the path and base name of the database (removing the .h3f like extension).

python emapper.py -i test/polb.fa --output polb_pfam -d pfam/pfam.hmm

The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).

eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.

Phase 1. Homology searches

  1. Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.
split -l 2000000 -a 3 -d input_file.faa input_file.chunk_
  1. Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell emapper.py not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:
# generate all the commands that should be distributed in the cluster
for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f; 
done

Phase 2. Orthology and functional annotation

The annotation phase needs to query data/eggnog.db intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store eggnog.db in SSD disks or, if possible, under /dev/shm (memory based filesystem).

  1. Concatenate all chunk_*.emapper.seed_orthologs file.
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
  1. Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10

We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having eggnog.db under the /dev/shm disk, but you can of course run many of those instances in parallel. If you are running emapper.py from a conda environment, check these tips.

and voilà, you got your annotations.