-
Notifications
You must be signed in to change notification settings - Fork 107
eggNOG mapper v2.1.2 to v2.1.4
EggNOG-mapper (a.k.a. emapper.py
or just emapper) is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups (OGs) and phylogenies from the eggNOG database (http://eggnogdb.embl.de/) to transfer functional information from fine-grained orthologs only.
Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs.
The use of orthology predictions for functional annotation permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).
Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan are available at https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb.
EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de
- (master branch) Bug fix (issues #313 and #319), when parsing GFF files with "." in the "score" field.
-
(master branch)
--dmnd_frameshift INT
option to control Diamond's--frameshift/-F
option. -
(main_spec branch) try to address the
__spec__
bug (issue #299). Check also pre-release https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.1.4-main_spec
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.1.4
-
--excel
option to output annotations in .xlsx format. - Recalibration of PFAM annotation from orthologs.
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.1.3
- Bug fix (issue #301), when running
--pfam_realign realign
and there are seeds without annotations. - Bug fix (issue #303), when running
--pfam_realign denovo
and there are 100 queries or more (so thathmmpgmd
servers are used). - Other minor changes (see issue #302).
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.1.2
- Orthologs retrieved bottom-up from narrowest OG to max annotation level OG.
- Changes in format of annotations and gff output files (check Output format).
- Python 3.7 (or greater)
- BioPython 1.76 (python package)
- psutil 5.7.0 (python package)
- xlsxwriter 1.4.3 (python package), only if using the
--excel
option (v2.1.4 only) - wget (linux command, required for downloading the eggNOG-mapper databases with
download_eggnog_data.py
, and to create new Diamond/MMseqs2 databases withcreate_dbs.py
) - sqlite (>=3.8.2)
- ~40 GB for the eggNOG annotation databases (eggnog.db and eggnog.taxa.db)
- ~9 GB for Diamond database of eggNOG sequences (required if using
-m diamond
, which is the default search mode). - ~11 GB for MMseqs2 database of eggNOG sequences (~86 GB if MMseqs2 index is created) (required if using
-m mmseqs
). - ~3 GB for PFAM database (required if using
--pfam_realign
options for realignment of queries to PFAM domains). - The size of eggNOG diamond/mmseqs databases created with
create_dbs.py
is highly variable, depending on the size of the chosen taxonomic groups.
- Using
--dbmem
loads the whole eggnog.db sqlite3 annotation database during the annotation step, and therefore requires ~44 GB of memory. - Using the
--num_servers
option when running HMMER in server mode (a.k.a.hmmgpmd
, which is used for-m hmmer --usemem
,--pfam_realign denovo
orhmm_server.py
) loads the HMM database as many times as specified in the argument (e.g.--pfam_realign denovo --num_servers 2
loads the PFAM database into memory twice, with up to roughly 2 GB per instance).
https://pypi.org/project/eggnog-mapper/
pip install eggnog-mapper
https://anaconda.org/bioconda/eggnog-mapper
conda install -c bioconda eggnog-mapper
https://github.com/eggnogdb/eggnog-mapper/releases
- Download the latest version of eggnog-mapper from the next link: https://github.com/eggnogdb/eggnog-mapper/releases/latest
- Decompress the
.tar.gz
or.zip
file - Enter the decompressed directory and install the dependencies, either with:
- setuptools:
python setup.py install
- pip:
pip install -r requirements.txt
- conda:
conda install --file requirements.txt
- setuptools:
- Download (clone) the repository:
git clone https://github.com/eggnogdb/eggnog-mapper.git
- Enter the repository directory and install the dependencies, either with:
- setuptools:
python setup.py install
- pip:
pip install -r requirements.txt
- conda:
conda install --file requirements.txt
- setuptools:
If you want to be sure that eggNOG-mapper is using the bundled binaries for external tools (prodigal, hmmer, diamond, mmseqs), it may help adding the emapper scripts and binaries to your environment or PATH variable. If, for example, your eggnog-mapper path was /home/user/eggnog-mapper
:
export PATH=/home/user/eggnog-mapper:/home/user/eggnog-mapper/eggnogmapper/bin:"$PATH"
Also, if you want to store eggNOG-mapper databases in a specific directory, you may wish to create an environment variable to avoid using --data_dir
in all your commands. For example:
export EGGNOG_DATA_DIR=/home/user/eggnog-mapper-data
Next step would be downloading the eggNOG-mapper databases, running the next script:
download_eggnog_data.py
This will download the eggNOG annotation database (along with the taxa databases), and the database of eggNOG proteins for Diamond searches.
If no EGGNOG_DATA_DIR
variable was defined and no --data_dir
option was given to download_eggnog_data.py
, the latter will try to download the files to the data/
directory within your eggnog-mapper directory.
Note that the Diamond DB is optional. You may decide to perform instead HMMER or MMseqs2 searches, or you may wish to create a Diamond or MMseqs2 DB specific of a taxonomic range. To do so, check download_eggnog_data.py --help
(for downloading HMMER databases or whole Diamond or MMseqs2 databases), and check create_dbs.py --help
(to create taxa-specific Diamond/MMseqs2 databases). Also, you may wish to download the PFAM database to be able to run PFAM realignemnts with --pfam_realign realign
or --pfam_realign denovo
.
For example, some options of download_eggnog_data.py
are:
- The
-P
flag is required to download the PFAM database. - The
-M
flag is required to download the MMseqs2 database. The whole MMseqs2 database includes eggNOG proteins which do not belong to any eggNOG Orthologous Group (OG), whereas the Diamond database only includes those which belong to an OG. Also, note that no MMseqs2 index is provided. To create it, you could use themmseqs createindex "$EGGNOG_DATA_DIR"/mmseqs tmp
command (see https://mmseqs.com/latest/userguide.pdf for more details). - The
-H -d taxID
flag is required to download a HMMER database for a given taxID (check list of databases at http://eggnog5.embl.de/#/app/downloads).
create_dbs.py
. For example, to create a diamond database for Bacteria only:
create_dbs.py -m diamond --dbname bacteria --taxa Bacteria
This will create a bacteria.dmnd
diamond database in the default data directory or the one specified in EGGNOG_DATA_DIR
environment variable. Such database can be used with emapper.py --dmnd_db bacteria.dmnd
. Note that the first time create_dbs.py
is used it will take time to download the eggNOG proteins and create the Diamond or MMseqs2 database. Next calls to create_dbs.py
(to the same data directory pointed by EGGNOG_DATA_DIR
, or --data_dir
, or data/
by default) will not need to download the eggnog5 proteins again. If no more databases are going to be created, the proteins can be removed. For further info, check create_dbs.py --help
.
Depending on the workflow being used with eggNOG-mapper you will need different external tools. Nonetheless, all of them are actually included, bundled, along with eggNOG-mapper. If you are running eggNOG-mapper fine, you may not need to install anything else.
However, the bundled tools are compiled binaries and could cause trouble in some systems, or could not be the most optimized compiled binaries for your system. In such cases, you may wish to install some or all of these tools independently. The tools are:
- Prodigal: required if using
--itype genome
or--itype metagenome
along with the option--genepred prodigal
. Current bundled version is V2.6.3: February, 2016. - Diamond: required to run the search steps with
-m diamond
. Current bundled version is 2.0.4. - MMseqs2: required to run the search steps with
-m mmseqs
. Current bundled version is 113e3212c137d026e297c7540e1fcd039f6812b1. - HMMER: required to run the search steps with
-m hmmer
, to run the HMMER based scripts (hmm_mapper.py
,hmm_server.py
,hmm_worker.py
), and to perform realignments to PFAM with--pfam_realign realign
or--pfam_realign denovo
. Current bundled version is HMMER 3.1b2 (February 2015).
To start an annotation job, provide a FASTA file containing your query sequences (-i
option), specify a project name which will be used as a prefix for all the output files (-o
option), and run emapper.py
emapper.py -i FASTA_FILE_PROTEINS -o test
diamond blastp
search, and for those queries with hits to eggNOG proteins, will carry out functional annotation.
- Run search and annotation, using Diamond in blastx mode
emapper.py -m diamond --itype CDS -i FASTA_FILE_NTS -o test
- Run search and annotation, using MMseqs after translating input CDS to proteins. Add the search and annotation results to the attributes of an existing GFF file (GFF decoration), using the GeneID field to link features from the GFF to the annotation results.
emapper.py -m mmseqs --itype CDS --translate -i FASTA_FILE_CDS -o test \
--decorate_gff MY_GFF_FILE --decorate_gff_ID_field GeneID
- Run search and annotation for assembled contigs, using MMseqs2 "blastx" hits for gene prediction
emapper.py -m mmseqs --itype metagenome -i FASTA_FILE_NTS -o test
- Run search and annotation for a genome, using Diamond search on proteins predicted by Prodigal, changing the output directory.
emapper.py -m diamond --itype genome --genepred prodigal \
-i FASTA_FILE_NTS -o test --output_dir /home/me/mydir
- Run gene prediction using a genome to train Prodigal
emapper.py -m mmseqs --itype genome --genepred prodigal -i FASTA_FILE_NTS -o test \
--training_genome FASTA_FILE --training_file OUT_TRAIN_FILE
- Perform a 2-step (search + annotation) run, using Diamond in more-sensitive mode and loading the annotation DB into memory (--dbmem; requires ~44 GB free mem)
emapper.py -m diamond --sensmode more-sensitive --no_annot -i FASTA_FILE_PROTS -o test
emapper.py -m no_search --annotate_hits_file test.emapper.seed_orthologs -o test_annot_1 --dbmem
- Repeat the annotation step, using specific taxa as target and reporting the one-to-one orthologs found
emapper.py -m no_search --annotate_hits_file test.emapper.seed_orthologs -o test_annot_2 --dbmem \
--report_orthologs --target_orthologs one2one --target_taxa 72274,1123487
- Use HMMER to search a database of bacterial proteins, using current directory for temporary files, and using a "scratch" directory to write output on a different hard drive than the one used to read. Once emapper.py finishes, output files in the scratch dir will be moved to the actual output dir, and the scratch dir will be removed.
emapper.py -m hmmer -d bact -i FASTA_FILE_PROTS -o test --scratch_dir /scratch/test --temp_dir .
- Perform a Diamond search, and annotation also realigning queries to the PFAM domains found on the Orthologous Groups
emapper.py -i FASTA_FILE_PROTS -o test --pfam_realign realign
- Perform a Diamond search, and annotation also realigning queries to the whole PFAM database
emapper.py -i FASTA_FILE_PROTS -o test --pfam_realign denovo
- Perform a Diamond search and annotation, constraining the Orthologous Groups from which to retrieve annotations to the Bacteria taxon:
emapper.py -i FASTA_FILE_PROTS -o test --tax_scope Bacteria
- Perform a Diamond search and annotation, constraining the Orthologous Groups from which to retrieve annotations to a predefined tax scope (defined in the file eggnogmapper/annotation/tax_scopes/bacteria
) for Bacteria and its descendants. The difference with the previous example is that here the bacteria file used as tax scope contains a list of taxa instead of only Bacteria. If several OGs intersect with the scope, for a given protein, the narrowest will be used to transfer annotations to the query:
emapper.py -i FASTA_FILE_PROTS -o test --tax_scope bacteria
- Perform a Diamond search and annotation, constraining the Orthologous Groups from which to retrieve annotations to the Gammaproteobacteria tax scope. Also, from those OGs, force to retrieve annotations at the Bacteria level:
emapper.py -i FASTA_FILE_PROTS -o test --tax_scope Gammaproteobacteria --tax_scope_mode Bacteria
--version
- show version and exit.
--list_taxa
- List available taxonomic names and IDs for OGs and exit. These are valid tax names and IDs for
--tax_scope/--tax_scope_mode
- List available taxonomic names and IDs for OGs and exit. These are valid tax names and IDs for
--cpu NUM_CPU
- number of CPUs to be used whenever possible (diamond, annotation tasks, etc).
--cpu 0
to run with all available CPUs. Default is 1.
- number of CPUs to be used whenever possible (diamond, annotation tasks, etc).
- --resume
- resumes a previous emapper run, skipping results in existing output files.
- --override
- overwrites output files if they exist. By default, execution is aborted if conflicting files are detected.
-i FILE
- input FASTA file containing query sequences (proteins by default; see
--itype
and--translate
). Required unless-m no_search
.
- input FASTA file containing query sequences (proteins by default; see
--itype INPUT_TYPE
- Type of sequences in the input (
-i
) file. Default is--itype proteins
. All INPUT_TYPE options are shown the next table:
- Type of sequences in the input (
INPUT_TYPE | description |
---|---|
proteins | Input sequences are used directly for the search step. |
CDS |
Input sequences are used directly for Diamond and MMseqs2, which will run in "blastx" mode and will report the best hit, unless --translate is used, in which case input sequences will be translated to proteins and searched "blastp" mode. For HMMER, the input CDS are always translated to proteins (--translate is automatically activated).
|
genome |
Input sequences are considered chromosomes/contigs, and gene prediction will be performed (see --genepred ).
|
metagenome |
The same as genome, except that when --genepred prodigal Prodigal will be run in a different mode.
|
--translate
- When
--itype CDS --translate
, input sequences will be translated to proteins before search. If-m hmmer
and--itype CDS
, input sequences will be translated to proteins, as if--translate
was automatically activated. If-m diamond
or-m mmseqs
and--itype CDS
but no--translate
option is given, searches will be performed in "blastx" mode, and the best hit will be annotated. Note that this is different than using--itype genome
or--itype metagenome
, where more than one hit per query could be annotated.
- When
--annotate_hits_table FILE
- annotate TSV formatted table with at least 4 fields: query, hit, evalue, score. Usually, FILE is a emapper.seed_orthologs file from a previous eggNOG-mapper run. Required if
-m no_search
.
- annotate TSV formatted table with at least 4 fields: query, hit, evalue, score. Usually, FILE is a emapper.seed_orthologs file from a previous eggNOG-mapper run. Required if
-
-c FILE
,--cache FILE
- Annotations file with md5 checksums of sequences. Required if
-m cache
.
- Annotations file with md5 checksums of sequences. Required if
--data_dir DIR
- Specify a path to the eggNOG-mapper databases. By default,
data/
folder or the one specified by theEGGNOG_DATA_DIR
environment variable. The annotation DBs must be always present in this directory, whereas HMMER, Diamond and MMseqs2 DBs could be in a different directory when using-d/--database
,--dmnd_db
and--mmseqs_db
, respectively.
- Specify a path to the eggNOG-mapper databases. By default,
--genepred search|prodigal
- When
--itype genome
or--itype metagenome
is used, gene prediction is carried out. There are 2 gene prediction modes:
- When
- The default is
--genepred search
, which means that either Diamond or MMseqs2 (depending on-m
argument) is run in blastx mode. As of now, we cannot recommend using Diamond for complete genomes, unless the assembly is rather fragmented and/or contigs are not very large. MMseqs2 is faster than Diamond for assembled genomes, and it is the recommended other than for CDS. - If
--genepred prodigal
is specified, Prodigal is run for gene prediction, and the proteins predicted by Prodigal are used in the subsequent search and annotation steps. Prodigal will be run in a different mode depending whether--itype genome
or--itype metagenome
is used.
- The default is
--trans_table TRANS_TABLE_CODE
- Option to change the translation table used for CDS translation and gene prediction. It corresponds with Diamond's
--query-gencode
, MMseqs2's--translation-table
and Prodigal's-g/--trans-table
). Usually the value is an integer corresponding to a specific translation table (e.g. https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). Check each program's documentation for more info. It is also applied when--translate
is used, both for input and output sequences.
- Option to change the translation table used for CDS translation and gene prediction. It corresponds with Diamond's
--training_genome FASTA_FILE
- FASTA file of the genome to be used for Prodigal's training mode. Requires
--itype genome --genepred prodigal
and also requires--training_file FILE
. Note training will be run only if the training file does NOT exist. If the training file already exists, the latter will be used directly for gene prediction, and training will be skipped.
- FASTA file of the genome to be used for Prodigal's training mode. Requires
--training_file FILE
- Training file to be created and/or used by Prodigal. If the training file does not exist, the training genome (
--training_genome
option) will be used to create a training file, and then immediately perform gene prediction from such training file. If the training file already exists, the training is skipped, the--training_genome
option is ignored, and gene prediction is performed using the existing training file.
- Training file to be created and/or used by Prodigal. If the training file does not exist, the training genome (
--allow_overlaps none|strand|diff_frame|all
- When
--genepred search --itype genome/metagenome
, this option controls whether overlapping blastx hits are annotated or not. By default,--allow_overlaps none
best hit is annotated among overlapping hits. If--allow_overlaps strand
, best non-overlapping hit from each strand is annotated. If--allow_overlaps diff_frame
, best non-overlapping hit from each frame in each strand is annotated. If--allow_overlaps all
, all hits are annotated.
- When
--overlap_tol FLOAT
- Overlap tolerance (0.0 - 1.0). Two hits are considered to overlap if (overlap_size / hit length) > overlap_tol for at least one of the hits. By default, 0.0, any overlap is considered as such. In contrast,
--overlap_tol 1.0
means that one of the hits must overlap entirely to consider that hits do overlap to each other.
- Overlap tolerance (0.0 - 1.0). Two hits are considered to overlap if (overlap_size / hit length) > overlap_tol for at least one of the hits. By default, 0.0, any overlap is considered as such. In contrast,
-m MODE
- how input queries will be searched against eggNOG sequences. Default is
-m diamond
. All MODE options are shown in the next table:
- how input queries will be searched against eggNOG sequences. Default is
MODE | description | Notes |
---|---|---|
diamond | search queries against eggNOG sequences using Diamond. |
Requires -i FILE .
|
hmmer | search sequences/hmm against sequences/hmm using HMMER. |
Requires -i FILE and -d DB_NAME .
|
mmseqs | search queries against eggNOG sequences using MMseqs2. |
Requires -i FILE .
|
cache |
annotate queries using a previously annotated file (-c ) which includes md5 hashes of the annotated sequences.
|
Requires -i FILE and -c FILE .
|
no_search | skip search stage. Annotate an existing emapper.seed_orthologs file. |
Requires --annotate_hits_table FILE , unless --no_annot is used.
|
--pident FLOAT
- report only alignments equal or above this percentage of identity threshold (0.0 - 100.0). Default None. No effect if
-m hmmer
.
- report only alignments equal or above this percentage of identity threshold (0.0 - 100.0). Default None. No effect if
--evalue FLOAT
- report only alignments equal or above this e-value threshold. Default 0.001.
--score FLOAT
- report only alignments equal or above this bit score threshold. Default None.
--query_cover FLOAT
- report only alignments equal or above this query coverage fraction threshold (0.0 - 100.0). Default None. No effect if
-m hmmer
.
- report only alignments equal or above this query coverage fraction threshold (0.0 - 100.0). Default None. No effect if
--subject_cover FLOAT
- report only alignments equal or above this target (eggNOG sequence) coverage fraction threshold (0.0 - 100.0). Default None. No effect if
-m hmmer
.
- report only alignments equal or above this target (eggNOG sequence) coverage fraction threshold (0.0 - 100.0). Default None. No effect if
--dmnd_db FILE
- path to diamond-compatible database. Useful to specify a location different than
data/
,EGGNOG_DATA_DIR
or--data_dir
.
- path to diamond-compatible database. Useful to specify a location different than
--sensmode DIAMOND_SENS_MODE
- either fast, mid-sensitive, sensitive, more-sensitive, very-sensitive or ultra-sensitive. Default is sensitive (to be sure, check the default for your version in
emapper.py --help
).
- either fast, mid-sensitive, sensitive, more-sensitive, very-sensitive or ultra-sensitive. Default is sensitive (to be sure, check the default for your version in
--matrix MATRIX_NAME
- which substitution matrix to be used by diamond, among BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30.
--dmnd_frameshift INT
- controls Diamond's
--frameshift/-F
option. Default is Diamond's default (disabled). (master branch only)
- controls Diamond's
--gapopen INT
- gap open penalty used by diamond. Default is Diamond's default.
--gapextend INT
- gap extend penalty used by diamond. Default is Diamond's default.
--block_size FLOAT
- Diamond's -b/--block-size option. Default is Diamond's default.
--index_chunks INT
- Diamond's -c/--index-chunks option. Default is Diamond's default.
--outfmt_short
- Diamond will produce only the query, subject, evalue and score fields in its output, and seed_orthologs file will have only those fields also. This option could be useful to obtain better performance when no thresholds for pident, and query and subject coverages are used (for more info, see Diamond docs about traceback).
--mmseqs_db FILE
- path to MMseqs2-compatible database. Useful to specify a location different than data/, EGGNOG_DATA_DIR, or --data_dir.
--start_sens FLOAT
- Starting sensitivity for MMseqs2 iterative searches. Default 3.
--sens_steps INT
- Number of iterative searches with different sensitivities for MMseqs2. Default 3.
--final_sens FLOAT
- Final sensitivity for MMseqs2 iterative searches. Default 7.
--mmseqs_sub_mat MMSEQS_SUB_MAT
- Matrix to be used for --sub-mat option of MMseqs2. Default: the default one used by MMseqs2.
-
-d DB_NAME
,--database DB_NAME
- specify the target database for sequence searches. DB_NAME should be the name of a database downloaded using
download_eggnog_data.py -H -d taxID
, or such a database loaded in a server (e.g.db.hmm:host:port
; seehmm_server.py --help
)
- specify the target database for sequence searches. DB_NAME should be the name of a database downloaded using
--servers_list FILE
- A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format host:port. If
--servers_list
is specified, host and port from-d
option will be ignored.
- A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format host:port. If
--qtype QUERY_TYPE
- hmm or seq. Type of input data (
-i
).
- hmm or seq. Type of input data (
--dbtype DB_TYPE
- hmmdb or seqdb. Type of data in DB (
-db
).
- hmmdb or seqdb. Type of data in DB (
--usemem
- Use this option to allocate the whole HMMER database in memory. If
--dbtype hmmdb
, the database must be a hmmpress-ed database. If--dbtype seqdb
, the database must be a HMMER-format database created with esl-reformat. Database will be unloaded after execution. Note that--usemem
is different than--dbmem
, which is used to load into memory the annotation DB.
- Use this option to allocate the whole HMMER database in memory. If
-
-p INT
,--port INT
- Port used to setup HMM server, when
--usemem
. Also used for--pfam_realign
modes.
- Port used to setup HMM server, when
--end_port PORT
- Last port to be used to setup HMM server, when
--usemem
. Also used for--pfam_realign
modes.
- Last port to be used to setup HMM server, when
--num_servers INT
- When using
--usemem
, specify the number of servers to fire up. By default, only 1 server is used. Note that cpus specified with--cpu
will be distributed among servers and workers. Also used for--pfam_realign
modes. It is important to consider that, for each server instance, the HMM database will be loaded into memory, and therefore memory consumption will grow as--num_servers
increases.
- When using
--num_workers INT
- When using
--usemem
, specify the number of workers per server to fire up. By default, cpus specified with--cpu
will be distributed among servers and workers. Also used for--pfam_realign
modes. In our tests--num_workers
has not the expected impact on performance, and increasing--num_servers
is required to get an actual speed boost, although the memory requirements must be met. However, this could be different in other systems, compilations, or versions of hmmpgmd.
- When using
--hmm_maxhits INT
- Max number of hits to report (0 to report all). Default=1.
--report_no_hits
- Whether queries without hits should be included in the output table.
--hmm_maxseqlen INT
- Ignore query sequences larger than this length. Default=5000"
--Z FLOAT
- Fixed database size used in phmmer/hmmscan allows comparing e-values among databases. Default=40,000,000
--cut_ga
- Adds the --cut_ga to HMMER commands (useful for Pfam mappings, for example). See HMMER documentation for more info.
--clean_overlaps CLEAN_OVERLAPS_MODE
- Removes hits which overlap, keeping only the one with best e-value. Default none. Use the all and clans options when performing a hmmscan type search (i.e. domains are in the database). Use the hmmsearch_all and hmmsearch_clans options when using a hmmsearch type search (i.e. domains are the queries). The clans and hmmsearch_clans options will only have effect for hits to/from PFAM.'
--no_annot
- perform only the search stage and skip functional annotation, reporting only seed orthologs (emapper.seed_orthologs file).
--dbmem
- Store the whole eggNOG sqlite DB into memory before retrieving the annotations. This requires ~44 GB of RAM memory available, but can increase annotation speed considerably. Database will be unloaded after execution.
--seed_ortholog_evalue FLOAT
- min e-value expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. Default=0.001
--seed_ortholog_score FLOAT
- min bit score expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. Default=60
--tax_scope FILE|PREDEFINED_FILENAME|LIST_OF_TAXA|none
- fix the taxonomic scope used for annotation, so only speciation events from a particular clade are used for functional transfer. More specifically, the
--tax_scope
list is intersected with the seed orthologs clades (i.e. the taxa from seed orthologs' OGs), and the resulting clades are used for annotation based on--tax_scope_mode
. Note that those seed orthologs without clades intersecting with--tax_scope
will be filtered out, and won't be annotated. Possible arguments for--tax_scope
are shown in the following table. Default is auto, which is a PREDEFINED_FILENAME.
- fix the taxonomic scope used for annotation, so only speciation events from a particular clade are used for functional transfer. More specifically, the
tax_scope | description |
---|---|
FILE | A path to a file which contains a list of tax IDs and/or tax names. |
PREDEFINED_FILENAME | The filename of a file stored in eggnogmapper/annotation/tax_scopes/, and which contains a list of tax IDs and/or tax names. Available ones are: auto (synonym all), auto_broad (synonym all_broad), all_narrow, archaea, bacteria, bacteria_broad, eukaryota, eukaryota_broad, and prokaryota_broad. |
LIST_OF_TAXA |
A comma-separated list of tax IDs and/or tax names. For example, --tax_scope 2759,2157,2,1 for Eukaryota, Archaea, Bacteria and root.
|
none | Do not filter out annotations based on taxonomic scope. |
--tax_scope_mode broadest|inner_broadest|inner_narrowest|narrowest|FILE|PREDEFINED_FILENAME|LIST_OF_TAXA
- A second layer to control taxonomic scope from which to retrieve annotations. If
--tax_scope
acts as a filter,--tax_scope_mode
determines which taxonomic level (i.e. which Orthologous Group) from the seed ortholog is used to retrieve annotations. Default is inner_narrowest. Possible arguments are:
- A second layer to control taxonomic scope from which to retrieve annotations. If
tax_scope_mode | description |
---|---|
broadest | Use the OG with the broadest clade. |
inner_broadest |
From the OG which intersect --tax_scope , use the one with the broadest clade.
|
inner_narrowest |
From the OG which intersect --tax_scope , use the one with the narrowest clade.
|
narrowest | Use the OG with the narrowest clade. |
FILE, PREDEFINED_FILENAME, LIST_OF_TAXA |
The same values accepted by --tax_scope , except none. If this option is used, the taxa from OGs is intersected first with --tax_scope , to filter out seed orthologs outside the taxonomic range. Secondly, taxa from OGs are intersected with --tax_scope_mode . From the remaining OGs, the one with the narrowest clade is used (as in inner_narrowest). This is useful when we want to filter out queries using a different taxonomic scope than the one to use for annotation. For example, we want to annotate only queries within the Gammaproteobacteria range, but use the Bacteria level OGs for annotation of those queries: --tax_scope Gammaproteobacteria --tax_scope_mode Bacteria should work.
|
--target_orthologs one2one|many2one|one2many|many2many|all
- defines what type of orthologs (in relation to the seed ortholog) should be used for functional transfer. Default: all
--target_taxa all|LIST_OF_TAX_IDS
- Comma-separated list of taxa to be used for annotation. Note that this option interacts with the Ortholgous Group chosen due to the --tax_scope/--tax_scope_mode options. First, speciation events are identified among the OGs based on --tax_scope/--tax_scope_mode. Then, annotation will be transferred from the orthologs found within those speciation events: from all the orthologs if --target_taxa all (default), or only from the taxa from LIST_OF_TAX_IDS, when
--target_taxa LIST_OF_TAX_IDS
.
- Comma-separated list of taxa to be used for annotation. Note that this option interacts with the Ortholgous Group chosen due to the --tax_scope/--tax_scope_mode options. First, speciation events are identified among the OGs based on --tax_scope/--tax_scope_mode. Then, annotation will be transferred from the orthologs found within those speciation events: from all the orthologs if --target_taxa all (default), or only from the taxa from LIST_OF_TAX_IDS, when
--excluded_taxa none|LIST_OF_TAX_IDS
- the opposite behaviour than
--target_taxa
. Default is none (no species are excluded for annotation).
- the opposite behaviour than
--report_orthologs
- as a first step in functional annotation, eggNOG-mapper identifies the orthologs of each query, using seed orthologs from the search stage as an anchoring or starting point. A list of these orthologs is not reported by default. Use this option get the list of orthologs found for each query ('emapper.orthologs' file).
--go_evidence experimental|non-electronic|all
- defines what type of GO terms should be used for annotation. experimental = Use only terms inferred from experimental evidence. non-electronic (default) = Use only non-electronically curated terms. all = all GO terms will be retrieved.
--pfam_realign none|realign|denovo
- Defines how PFAM annotation will be performed.
- none
- A list of PFAMs, directly transferred from orthologs, will be reported.
- realign
- PFAMs from orthologs will be realigned to the query, and a list of PFAMs and their positions on the query will be reported.
- denovo
- Each query will be realigned to the whole PFAM DB, and a list of PFAMs and their positions on the query will be reported.
- Defines how PFAM annotation will be performed.
--md5
- Adds a column with the md5 hash of the query sequences in the annotations output file. An annotations output file created this way can be used as cache file (
-c CACHE_FILE
) for the-m cache
mode.
- Adds a column with the md5 hash of the query sequences in the annotations output file. An annotations output file created this way can be used as cache file (
--output,-o FILE_PREFIX
- base name for output files
--output_dir DIR
- where output files should be written. default is current working directory.
--scratch_dir DIR
- write output files in a temporary scratch dir, move them to the final output dir when finished. Speed up large computations using network file systems.
--temp_dir DIR
- where temporary files are created. Better if this is a local disk.
--no_file_comments
- no header lines nor stats are included in the output files
--decorate_gff no|yes|FILE
- Option to create/decorate a GFF file with emapper hits and/or annotations. Default is no.
- no: no GFF decoration will be performed. If running gene prediction with Prodigal, its GFF will be among the output files anyway. If running blastx-based gene prediction, the GFF with CDS of hits will be among output files anyway.
- yes: a new GFF will be created including hits and/or annotations.
-
FILE: a new GFF will be created, adding hits and/or annotations to the attributes already existing in the specified FILE.
--decorate_gff_ID_field
can be used to specify a GFF attribute (see--decorate_gff_ID_field
).
- Option to create/decorate a GFF file with emapper hits and/or annotations. Default is no.
--decorate_gff_ID_field ID
- ID or name of the GFF attribute to be used to link GFF features and annotation results, when using
--decorate_gff FILE
. Default is ID.
- ID or name of the GFF attribute to be used to link GFF features and annotation results, when using
-
--excel
(v2.1.4 only)
- Annotations will be output also in Excel (.xlsx) format.
- Search hits (prefix.emapper.hits)
- A file with the results from the search phase, from HMMER, Diamond or MMseqs2.
- Seed orthologs (prefix.emapper.seed_orthologs)
- A file with the results from parsing the hits. Each row links a query with a seed ortholog. This file has the same format independently of which searcher was used, except that it can be in short format (4 fields), or full.
- Annotations (prefix.emapper.annotations)
- A file with the results from the annotation phase. Therefore, each row represents the annotation reported for a given query.
- Excel (prefix.emapper.annotations.xlsx)
- Annotations in .xlsx format. Created only when using the
--excel
option (v2.1.4 only).
- Annotations in .xlsx format. Created only when using the
- Orthologs (prefix.emapper.orthologs)
- A file with the list of orthologs found for each query. This file is created only if using the
--report_orthologs
option.
- A file with the list of orthologs found for each query. This file is created only if using the
- Sequences of predicted CDS (prefix.emapper.genepred.fasta)
- A FASTA file with the sequences of the predicted CDS. It is generated when gene prediction is carried out, with
--itype genome
or--itype metagenome
.
- A FASTA file with the sequences of the predicted CDS. It is generated when gene prediction is carried out, with
- GFF of predicted CDS (prefix.emapper.gff)
- A GFF (version 3) file, with the search and/or annotation results. See
--decorate_gff
.
- A GFF (version 3) file, with the search and/or annotation results. See
- Sequences without annotation (prefix.emapper.no_annotations.fasta)
- A FASTA file with the sequences of queries for which an existing annotation was not found using the -m cache mode. This file can be used as input of another eggNOG-mapper run without using the cache, trying to annotate the sequences.
- PFAM hits (prefix.emapper.pfam)
- A file with the positions of the PFAM domains identified. Only created if
--pfam_realign realign
or--pfam_realign denovo
.
- A file with the positions of the PFAM domains identified. Only created if
All files contain rows with tab-separated columns or fields.
- query
- target
- The target is what is also known, in eggnog-mapper, as 'seed ortholog'. It is the eggNOG sequence representing the best hit found for a given query during the search phase, and it will be used, during the annotation phase, to retrieve orthologs from which to transfer annotations.
- e-value
- bit-score
- The e-value and bit-score fields are the values returned by the search tool being used (diamond by default, see -m option).
- pident
- Percentage of identity between the query and the subject.
- qstart
- First position of query in the alignment.
- qend
- End position of query in the alignment.
- sstart
- Start position of subject (a.k.a. target) in the alignment.
- send
- End position of subject (a.k.a. target) in the alignment.
- qcov
- Percentage of the query length which is part of the alignment.
- scov
- Percentage of the subject (a.k.a. target) length which is part of the alignment.
--outfmt_short
option can be used to output only the first 4 fields of the seed orthologs file, when running searches with Diamond (see --outfmt_short
option above).
- query_name
- seed_eggNOG_ortholog
- seed_ortholog_evalue
- seed_ortholog_score
- eggNOG_OGs
- a comma-separated, clade depth-sorted (broadest to narrowest), list of Orthologous Groups (OGs) identified for this query. Note that each OG is represented in the following format: OG@tax_id|tax_name
- max_annot_lvl
- tax_id|tax_name of the level of widest OG used to retrieve orthologs for annotations.
- COG_cat
- COG category of the narrowest OG with a valid one.
- Description
- Description of the narrowest OG with a valid one.
- Preferred_name
- GOs
- EC
- KEGG_ko
- KEGG_Pathway
- KEGG_Module
- KEGG_Reaction
- KEGG_rclass
- BRITE
- KEGG_TC
- CAZy
- BiGG_Reaction
- PFAMs
- md5 (only when
--md5
is used)
- query
- orth_type
- Type of orthologs in this row. See
--target_orthologs
.
- Type of orthologs in this row. See
- species
- comma-separated list of orthologs
- If an ortholog shows a "*", such ortholog was used to transfer its annotations to the query.
- query
- hit
- evalue
- sum_score
- query length
- HMM position "from"
- HMM position "to"
- Sequence position "from"
- Sequence position "to"
- query coverage
Depends on --outfmt_short
. See Diamond documentation for more info.
See MMseqs2 documentation for more info.
If gene prediction is performed using search hits (diamond or mmseqs "blastx" hits), sequence identifiers include the identifier of the original sequence from which the CDS has been found, followed by an underscore and a number to differentiate among CDS from the same original sequence (e.g. A CDS found in >query_seq will be named >query_seq_1. A second one will be >query_seq_2, ...). If gene prediction is performed using prodigal, this output file is the one generated by Prodigal (check Prodigal documentation for output formats).
All eggNOG-mapper attributes will be prefixed with "em_". If gene prediction is performed using Prodigal and --decorate_gff no
is used, this output file is the one generated by Prodigal (check Prodigal documentation for more info).
Just a FASTA file with the same identifiers as the original sequences.
- query_name
- hit (i.e. PFAM domain)
- evalue
- sum_score
- query_length
- hmmfrom
- hmmto
- seqfrom
- seqto
- query_coverage
The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).
eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.
1) Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.
split -l 2000000 -a 3 -d input_file.faa input_file.chunk_
2) Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell `emapper.py` not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:
# generate all the commands that should be distributed in the cluster
for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done
The annotation phase needs to query `data/eggnog.db` intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store `eggnog.db` in SSD disks or, if possible, under `/dev/shm` (memory based filesystem).
3) Concatenate all chunk_*.emapper.seed_orthologs file.
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
4) Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10
and _voilà_, you got your annotations.
Use MMseqs for the search step. If possible, create the MMseqs2 index first.
# generate all the commands that should be distributed in the cluster
for f in *.chunk_*; do
echo ./emapper.py -m mmseqs --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done
Load the annotation database into memory for the annotation step (~44 GB mem required)
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10 --dbmem
Also when running Diamond for the search step (-m diamond
) it can benefit from using large memory computers, by tuning the --block_size
and the --index_chunks
options. Also --index_chunks
could be required by diamond when running in computers with over 64GB RAM.
Please cite the following two papers if you use eggNOG-mapper v2
[1] eggNOG-mapper v2: functional annotation, orthology assignments, and domain
prediction at the metagenomic scale. Carlos P. Cantalapiedra,
Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas.
biorxiv (2021). doi: https://doi.org/10.1101/2021.06.03.446934
[2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated
orthology resource based on 5090 organisms and 2502 viruses. Jaime
Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia
K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars
J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8;
47(Database issue): D309–D314. doi: 10.1093/nar/gky1085