-
Notifications
You must be signed in to change notification settings - Fork 106
eggNOG mapper v2.0.2 v2.0.8
EggNOG-mapper (a.k.a. emapper.py
or just emapper) is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups (OGs) and phylogenies from the eggNOG database (http://eggnogdb.embl.de/) to transfer functional information from fine-grained orthologs only.
Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs.
The use of orthology predictions for functional annotation permits a higher precision than traditional homology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).
Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan are available at https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb.
EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de
(no news)
- Added GFF decoration (
--decorate_gff
option), to create/modify a GFF including emapper hits and/or annotations. - Output file comments start with "##", making easier filtering them without removing the header (which starts with "#")
- Fixed seed_orthologs header.
- Fixed
pident
not shown in seed_orthologs output.
- Added
--trans_table
(Diamond's--query-gencode
, MMseqs2's--translation-table
, Prodigal's-g/--trans-table
) option, to specify a translation table for gene prediction and blastx searches. - Added
--training_genome
and--training_file
options, to run Prodigal training mode. - Default search thresholds (pident, score, query and subject coverage) are set to None.
- Both Diamond and MMseqs2 seed_orthologs file includes now percentage identity (pident), position of hits (qstart, qend, sstart, send) and query coverage (qcov) and subject coverage (scov).
- Added
--outfmt_short
option for Diamond, to run it producing only query, subject, evalue and score as output. This option could be useful to obtain better performance when no thresholds for pident, and query and subject coverages are used (see Diamond docs about traceback). Of course, seed_orthologs file will contain only those 4 fields. - Added subject coverage (target_coverage) to gff output from blastx based gene predictions.
- Added
--block_size
(Diamond's-b/--block-size
) and--index_chunks
(Diamond's-c/--index-chunks
) options. - Bug fixes.
Added create_dbs.py
script, to create diamond/mmseqs eggnog5 databases from a user-specified list of taxa.
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.5
- if
--translate
option is used (along with--itype CDS
), input sequences will be translated to proteins before searching with either diamond "blastp", mmseqs "blastp" or hmmer. If--itype CDS
is used without--translate
, it will raise error for hmmer, but it will run diamond or mmseqs in "blastx" modes. - Bug fix when running hmmer with only-numerical identifiers in input sequences
- Other minor changes
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.4-rf1
- Gene prediction step using Prodigal.
- Search and annotation of ORFs using diamond blastx or MMseqs2. This can be used to annotate ORFs of contigs without using Prodigal.
- MMseqs2 support for the search step of eggNOG-mapper.
- Parameters to allow users to control sensitivity of diamond/MMseqs2 searches.
- Improved report of orthologs.
- NCBITaxa support is now included in eggNOG-mapper without relying on ete3.
-
--md5
option, which can be used to add the md5 hash of the query as a new column in the annotations output file. -
''-m cache
mode and-c FILE
options, to annotate using an annotations file with md5 hashes as cached results. A fasta file with unannotated sequences is output also, which can be used in a subsequent conventional emapper annotation run. - "Bottom-top" orthology search if no proper orthologs are retrieved from a priori chosen best OG.
-
--go_evidence all
option to report all GO terms. -
--dbmem
option to pre-load theeggnog.db
sqlite3 DB into memory.
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.3-rf1
- New eggnog DB version 5.0.1 including PFAM annotations and PFAM HMMs.
- Added expected eggNOG DB version, and warning if found version is different than expected one.
- Added PFAM annotations, which are directly transferred from orthologs.
- Added
--pfam transfer
option toemapper.py
.
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.2-rf1
- New
--tax_scope''
modes. - Added eggNOG DB version to
-v/--version
option.
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.1-rf1
- All code migrated to Python 3. Therefore, python3 is now required to run eggnog-mapper scripts.
- HMMER search capabilities moved to new scripts: hmm_search.py, hmm_server.py, hmm_worker.py. HMMER search options still available through emapper.py script only for searching against custom databases (no annotation), which is just equivalent to be using hmm_search.py.
- Changes in output format.
- Changes in available parameters and behaviour of existing ones.
- Added some integration and unit tests
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.1b
- Bug fixes, minor changes.
https://github.com/eggnogdb/eggnog-mapper/releases/tag/2.0.0
- Expanded database of precomputed orthology assignments, now based on eggNOG v5.0. This includes 5,090 representative genomes (4445 bacteria, 168 archaea and 477 eukaryota), as well as 2502 viral proteomes.
- HMMer search mode is deprecated. Read FAQ---Frequently-Asked-Questions#why-i-cannot-choose-hmmer-search-mode-in-version-20
- Updated functional sources (e.g. KEGG, GeneOntology)
- New output columns compared with eggNOG-mapper version 1 (see https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2).
- Python 3.7 (or greater)
- BioPython 1.76 (python package)
- psutil 5.7.0 (python package, required only if using the HMMER server mode)
- `wget` (linux command, required for downloading the eggNOG-mapper databases with
download_eggnog_data.py
)
- ~40 GB for the eggNOG annotation database (+ ~0.3 GB for taxa database)
- ~10 GB for Diamond database of eggNOG sequences (required if using
-m diamond
, which is the default search mode). - ~90 GB for MMseqs2 database of eggNOG sequences (required if using
-m mmseqs
). - ~3 GB for PFAM database (required if using
--pfam_realign
options for realignment of queries to PFAM domains). - The size of eggNOG diamond/mmseqs databases create with
create_dbs.py
is highly variable, depending on the size of the chosen taxonomic groups. - The size of eggNOG HMM databases is highly variable (check list of HMMER databases at http://eggnog5.embl.de/#/app/downloads).
- Using
-m mmseqs
requires no less than 224-256GB of RAM for the default DB, and therefore it is only recommended for large datasets to be processed in large memory systems. Databases created withcreate_dbs.py
will require less RAM, which will depend on the size of the chosen taxonomic groups. - Using
--dbmem
loads the whole eggnog.db sqlite3 annotation database during the annotation step, and therefore requires no less than 44-48GB of memory. - Using
--pfam_realign denovo
uses HMMER server mode when the number of queries is equal or greater than 100. Therefore the whole PFAM database is loaded into memory. - Also, using the
--num_servers
option when running HMMER in server mode (a.k.a.hmmgpmd
, which is used for-m hmmer --usemem
,--pfam_realign denovo
orhmm_server.py
) loads the HMM database as many times as specified in the argument (e.g.--pfam_realign denovo --num_servers 2
loads the PFAM database into memory twice).
pip install eggnog-mapper
- Download the latest version of eggnog-mapper from the next link: https://github.com/eggnogdb/eggnog-mapper/releases/latest
- Decompress the
.tar.gz
or.zip
file - Enter the decompressed directory and install the dependencies, either with:
- setuptools:
python setup.py install
- pip:
pip install -r requirements.txt
- conda:
conda install --file requirements.txt
- setuptools:
- Download (clone) the repository:
git clone https://github.com/jhcepas/eggnog-mapper.git
- Enter the repository directory and install the dependencies, either with:
- setuptools:
python setup.py install
- pip:
pip install -r requirements.txt
- conda:
conda install --file requirements.txt
- setuptools:
If you want to be sure that eggNOG-mapper is using the bundled binaries for external tools (hmmer, diamond, mmseqs), it may help adding the emapper scripts and binaries to the PATH. If for example your eggnog-mapper path was /home/user/eggnog-mapper
:
export PATH=/home/user/eggnog-mapper:/home/user/eggnog-mapper/eggnogmapper/bin:"$PATH"
Also, if you want to store eggNOG-mapper databases in a specific directory, you may wish to create an environment variable to avoid using --data_dir
in all your commands. For example:
export EGGNOG_DATA_DIR=/home/user/eggnog-mapper-data
Next step would be downloading the eggNOG-mapper databases, running the next script:
download_eggnog_data.py
This will download the eggNOG annotation database (along with the taxa databases), and the Diamond database of eggNOG proteins.
If no EGGNOG_DATA_DIR
variable was defined and no --data_dir
option was given to download_eggnog_data.py
, the latter will try to download the files to a `data` directory within your eggnog-mapper directory.
Also, check download_eggnog_data.py --help
for a detailed list of options. For example:
- The
-P
flag is required to download the PFAM database. - The
-M
flag is required to download the MMseqs2 database. Note that no MMseqs2 index is included, and because of this we recommend creating the index if using huge input datasets. To do it you could use themmseqs createindex "$EGGNOG_DATA_DIR"/mmseqs tmp
(see https://mmseqs.com/latest/userguide.pdf for more details). - The
-H -d taxID
flag is required to download a HMMER database (check list of databases at http://eggnog5.embl.de/#/app/downloads).
create_dbs.py
. For example, to create a diamond database for Bacteria only:
create_dbs.py -m diamond --dbname bacteria --taxa Bacteria
This will create a bacteria.dmnd
diamond database to the default data directory or the one specified in EGGNOG_DATA_DIR
environment variable. Such database can be used with emapper.py --dmnd_db bacteria.dmnd
. The first time create_dbs.py
is used it will take time to download the eggnog5 proteins and create the diamond or mmseqs database. Next calls to create_dbs.py
to the same data directory will not need to download the eggnog5 proteins again. For further info, check create_dbs.py --help
.
Depending on the workflow being used with eggNOG-mapper you will need different external tools. Nonetheless, all of them are actually included, bundled, along with eggNOG-mapper code. If you are running eggNOG-mapper fine, you may not need to install anything else.
However, the bundled tools are compiled binaries and could cause trouble in some systems, or could not be the most optimized compiled binaries for your system. In such cases, you may wish to install some or all of these tools independently. The tools are:
- Prodigal: required if using
--itype genome
or--itype metagenome
along with the option--genepred prodigal
. Current bundled version is V2.6.3: February, 2016. - Diamond: required to run the search steps with
-m diamond
. Current bundled version is 2.0.4. - MMseqs2: required to run the search steps with
-m mmseqs
. Current bundled version is 113e3212c137d026e297c7540e1fcd039f6812b1. - HMMER: required to run the search steps with
-m hmmer
, to run the HMMER based scripts (hmm_mapper.py
,hmm_server.py
,hmm_worker.py
), and to perform realignments to PFAM with--pfam_realign realign
or--pfam_realign denovo
. Current bundled version is HMMER 3.1b2 (February 2015).
To start an annotation job, provide a FASTA file containing your query sequences (-i
option), specify a project name which will be used as a prefix for all the output files (-o
option), and run emapper.py
emapper.py -i FASTA_FILE_PROTEINS -o test
- Run search and annotation, using diamond in blastx mode
emapper.py -m diamond -i FASTA_FILE_NTS --itype CDS -o test
- Run search and annotation, using MMseqs after translating input CDS to proteins
emapper.py -m mmseqs -i FASTA_FILE_CDS --itype CDS --translate -o test
- Run search and annotation for assembled contigs, using diamond "blastx" hits for gene prediction
emapper.py -m diamond -i FASTA_FILE_NTS --itype metagenome -o test
- Run search and annotation for a genome, using MMseqs search on proteins predicted by Prodigal
emapper.py -m mmseqs -i FASTA_FILE_NTS --itype genome --genepred prodigal -o test
- Run gene prediction using a genome to train Prodigal (since version 2.0.7)
emapper.py -m mmseqs -i FASTA_FILE_NTS --itype genome --genepred prodigal --training_genome FASTA_FILE --training_file OUT_TRAIN_FILE -o test
- 2-step run -- search step using diamond in "sensitive" mode -- annotation step loading the eggnog.db sqlite3 into memory (--dbmem; requires around 40GB free mem)
emapper.py -i FASTA_FILE_PROTS -m diamond --sensmode sensitive --no_annot -o test
emapper.py -m no_search --annotate_hits_file test.emapper.seed_orthologs --dbmem -o test_annot_1
- Repeat the annotation step, using specific taxa as target and reporting the one-to-one orthologs found, reading the eggnog.db from disk (no --dbmem option)
emapper.py -m no_search --annotate_hits_file test.emapper.seed_orthologs --report_orthologs --target_orthologs one2one --target_taxa 72274,1123487 -o test_annot_2
- Use HMMER to search a database of bacterial proteins, using a scratch dir to write output on a different drive than the one used to read. Once emapper.py finishes, output files in the scratch dir will be moved to the actual output dir, and the scratch dir will be removed.
emapper.py -m hmmer -i FASTA_FILE_PROTS -d bact -o test --scratch_dir /scratch/test
- Realign queries to the PFAM domains found on seed orthologs
emapper.py -i FASTA_FILE_PROTS -o test --pfam_transfer seed_ortholog --pfam_realign realign
- Realign queries to the whole PFAM database
emapper.py -i FASTA_FILE_PROTS -o test --pfam_realign denovo
--version
- show version and exit.
--list_taxa
- List available taxonomic names and IDs and exit.
--cpu NUM_CPU
- number of CPUs to be used whenever possible (diamond, annotation tasks, etc).
--cpu 0
to run with all available CPUs.
- number of CPUs to be used whenever possible (diamond, annotation tasks, etc).
-i FILE
- input FASTA file containing query sequences (proteins by default; see
--translate
). Required unless-m no_search
- input FASTA file containing query sequences (proteins by default; see
--itype INPUT_TYPE
- The type of sequences included in the input file. The options are:
-
--itype proteins
, which is the default. --itype CDS
--itype genome
--itype metagenome
-
- For
--itype proteins
the input file is used directly as input for the search step. With--itype CDS
, the input file will be used directly as input for diamond and MMseqs2, unless the--translate
is used (see below); for hmmer the input CDS are first translated to proteins. If--itype genome
is used, the input sequences are considered contigs, and gene prediction will be performed (see--genepred
option).--itype metagenome
is the same as--itype genome
, except that Prodigal will be run in a different mode when--genepred prodigal
is used.
- For
--translate
- if
--itype CDS
and the--translate
option is used, input sequences will be translated to proteins before search. If-m hmmer
and--itype CDS
, input sequences will be translated to proteins, as if--translate
was automatically activated. If-m diamond
or-m mmseqs
and--itype
CDS but no--translate
option is given, searches will be performed in "blastx" mode. Note that this is different than using--itype genome
or--itype metagenome
, in which case the hits are used to identify one or more ORFs within the input sequences, whereas using--itype CDS
without--translate
will just annotate the best hit found for each input sequence.
- if
--annotate_hits_table FILE
- annotate TSV formatted table with 4 fields: query, hit, evalue, score. Required if
-m no_search
.
- annotate TSV formatted table with 4 fields: query, hit, evalue, score. Required if
-
-c FILE
,--cache FILE
- Annotations file with md5 checksums of sequences. Required if
-m cache
.
- Annotations file with md5 checksums of sequences. Required if
--data_dir DIR
- path to eggnog-mapper databases (
data/
folder or the one specified by theEGGNOG_DATA_DIR
environment variable, by default).
- path to eggnog-mapper databases (
--genepred GENE_PRED_MODE
- When
--itype genome
or--itype metagenome
is used, gene prediction is carried out. There are 2 gene prediction modes:
- When
- The default is
--genepred search
, which means that either Diamond or MMseqs2 (depending on-m
argument) is run in blastx mode. As of now, we cannot recommend using Diamond for complete genomes, unless the assembly is rather fragmented and/or contigs are not very large. MMseqs2 is faster than Diamond for assembled genomes, and it is the recommended one if the memory requirements can be met. - If
--genepred prodigal
is specified, Prodigal is run for gene prediction, and the proteins predicted by Prodigal are used in the subsequent search and annotation steps. Prodigal will be run in a different mode depending whether--itype genome
or--itype metagenome
is used.
- The default is
-
--trans_table TRANS_TABLE_CODE
(since version 2.0.7)
- Option to change the translation table used for gene prediction. It corresponds with Diamond's
--query-gencode
, MMseqs2's--translation-table
and Prodigal's-g/--trans-table
). Usually the value is an integer corresponding to a specific translation table (e.g. https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). Check each program's documentation for more info.
- Option to change the translation table used for gene prediction. It corresponds with Diamond's
-
--training_genome FASTA_FILE
(since version 2.0.7)
- FASTA file of the genome to be used for Prodigal's training mode. Requires
--itype genome --genepred prodigal
and also requires--training_file FILE
. Note training will be run only if the training file does NOT exist. If the training file already exists, the latter will be used directly for gene prediction, and training will be skipped.
- FASTA file of the genome to be used for Prodigal's training mode. Requires
-
--training_file FILE
(since version 2.0.7)
- Training file to be created and/or used by Prodigal. If the training file does not exist, the training genome (
--training_genome
option) will be used to create a training file, and then immediately perform gene prediction from such training file. If the training file already exists, the training is skipped, the--training_genome
option is ignored, and gene prediction is performed using the existing training file.
- Training file to be created and/or used by Prodigal. If the training file does not exist, the training genome (
- -m MODE
- how input queries will be searched against eggNOG sequences. Default is -m diamond. All MODE options are shown in the next table:
MODE | Notes | |
---|---|---|
diamond | search queries against eggNOG sequences using diamond | requires -i FILE |
hmmer | search sequences/hmm against sequences/hmm using HMMER | requires -i FILE and -d DB_NAME. |
mmseqs | search queries against eggNOG sequences using MMseqs2 | requires -i FILE |
cache | search queries a file of previous annotation file which includes md5 hashes of the annotated sequences. | requires -i FILE and -c FILE |
no_search | skip search stage. Annotate an existing .seed_orthologs file. |
requires --annotate_hits_table FILE, unless --no_annot is used.
|
- --pident FLOAT
- report only alignments equal or above this percentage of identity threshold. Default None (since version 2.0.7). No effect if
-m hmmer
.
- report only alignments equal or above this percentage of identity threshold. Default None (since version 2.0.7). No effect if
- --evalue FLOAT
- report only alignments equal or above this e-value threshold. Default 0.001
- --score FLOAT
- report only alignments equal or above this bit score threshold. Default None (since version 2.0.7).
- --query-cover FLOAT
- report only alignments equal or above this query coverage fraction threshold. Default None (since version 2.0.7).
- --subject-cover FLOAT
- report only alignments equal or above this target (eggNOG sequence) coverage fraction threshold. Default None (since version 2.0.7). No effect if
-m hmmer
.
- report only alignments equal or above this target (eggNOG sequence) coverage fraction threshold. Default None (since version 2.0.7). No effect if
- --dmnd_db FILE
- path to diamond-compatible database. Useful to specify a location different than data/ or --data_dir.
- --sensmode DIAMOND_SENS_MODE
- either fast, mid-sensitive, sensitive, more-sensitive, very-sensitive or ultra-sensitive. Default is
sensitive
(to be sure, check the default for your version inemapper.py --help
).
- either fast, mid-sensitive, sensitive, more-sensitive, very-sensitive or ultra-sensitive. Default is
- --matrix MATRIX_NAME
- which substitution matrix to be used by diamond, among BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30.
- --gapopen INT
- gap open penalty used by diamond. Default is diamond default.
- --gapextend INT
- gap extend penalty used by diamond. Default is diamond default.
- --block_size FLOAT (since version 2.0.7)
- Diamond's -b/--block-size option. Default is Diamond's default.
- --index_chunks INT (since version 2.0.7)
- Diamond's -c/--index-chunks option. Default is Diamond's default.
- --outfmt_short (since version 2.0.7)
- Diamond will produce only the query, subject, evalue and score fields in its output, and seed_orthologs file will have only those fields also. This option could be useful to obtain better performance when no thresholds for pident, and query and subject coverages are used (see Diamond docs about traceback).
- --mmseqs_db FILE
- path to MMseqs2-compatible database. Useful to specify a location different than data/ or --data_dir.
--start_sens FLOAT
- Starting sensitivity for MMseqs2 iterative searches. Default 3.
--sens_steps INT
- Number of iterative searches with different sensitivities for MMseqs2. Default 3.
--final_sens FLOAT
- Final sensitivity for MMseqs2 iterative searches. Default 7.
--mmseqs_sub_mat MMSEQS_SUB_MAT
- Matrix to be used for --sub-mat option of MMseqs2. Default: the default one used by MMseqs2.
- '-d DB_NAME', '--database DB_NAME'
- specify the target database for sequence searches. DB_NAME should be the name of a database downloaded using `download_eggnog_data.py -H -d taxID", or such a database loaded in a server (e.g. db.hmm:host:port; see hmm_server.py documentation)
- '--servers_list FILE'
- A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format 'host:port'. If --servers_list is specified, host and port from -d option will be ignored.
- '--qtype QUERY_TYPE'
- hmm or seq. Type of input data (-i).
- '--dbtype DB_TYPE'
- hmmdb or seqdb. Type of data in DB (-db).
- '--usemem'
- Use this option to allocate the whole database (-d) in memory. If --dbtype hmm, the database must be a hmmpress-ed database. If --dbtype seqdb, the database must be a HMMER-format database created with esl-reformat. Database will be unloaded after execution.
- '-p INT', '--port INT'
- Port used to setup HMM server, when --usemem. Also used for --pfam_realign modes.
- '--end_port PORT'
- Last port to be used to setup HMM server, when --usemem. Also used for --pfam_realign modes.
- '--num_servers INT'
- When using --usemem, specify the number of servers to fire up. By default, only 1 server is used. Note that cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. It is important to consider that for each server the HMM database will be loaded into memory, and therefore memory consumption will grow as --num_servers is increased.
- '--num_workers INT'
- When using --usemem, specify the number of workers per server to fire up. By default, cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign modes. In our tests --num_workers has not the expected impact on performance, and increasing --num_servers is required to an actual speed boost, although the memory requirements must be met. However, this could be different in other systems or if hmmpgmd use of workers is fixed somehow.
- '--hmm_maxhits INT'
- Max number of hits to report (0 to report all). Default=1.
- '--report_no_hits'
- Whether queries without hits should be included in the output table.
- '--hmm_maxseqlen INT'
- Ignore query sequences larger than `maxseqlen`. Default=5000"
- '--Z FLOAT
- Fixed database size used in phmmer/hmmscan allows comparing e-values among databases. Default=40,000,000
- '--cut_ga'
- Adds the --cut_ga to hmmer commands (useful for Pfam mappings, for example). See hmmer documentation.
- '--clean_overlaps CLEAN_OVERLAPS_MODE'
- Removes those hits which overlap, keeping only the one with best evalue. Default "none". Use the "all" and "clans" options when performing a hmmscan type search (i.e. domains are in the database). Use the "hmmsearch_all" and "hmmsearch_clans" options when using a hmmsearch type search (i.e. domains are the queries from -i file). The "clans" and "hmmsearch_clans" and options will only have effect for hits to/from Pfam.'
- --no_annot
- perform only the search stage and skip functional annotation, reporting only seed orthologs (.seed_orthologs file).
- --dbmem
- Store the whole eggnog sqlite DB into memory before retrieving the annotations. This requires ~40GB of RAM memory available, but can increase annotation speed considerably. Database will be unloaded after execution.
- --seed_ortholog_evalue FLOAT
- min e-value expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. Default=0.001
- --seed_ortholog_score FLOAT
- min bit score expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. Default=60
- --tax_scope auto|narrowest|LIST_OF_TAX_IDS
- fix the taxonomic scope used for annotation, for each query sequence, so only speciation events from a particular clade are used for functional transfer. More in detail, each seed ortholog belongs to a list of Orthologous Groups (OGs). eggnog-mapper uses one of these OGs to analyze speciation events and retrieve orthologs from which to transfer functional annotation. This can be done from a broader or a narrower OG. The --tax_scope option helps controlling how this choice is carried out. Default is auto.
- auto
- eggnog-mapper uses a predefined list of tax IDs, so that the OG chosen will be the narrowest one which belongs to that list. Therefore, --tax_scope auto is equivalent to --tax_scope 10239,5794,33090,6231,6656,40674,78,8782,33208,4751,33154,2759,2157,2,1 (viruses,apicomplexa,plants,nematods,arthropods,mammals,fishes,avian,metazoa,fungi,opisthokonta,euk,arch,bact,root). For example, if the OGs for a given query of our eggnog-mapper run are COG0012@1,COG0012@2,1MVM4@1224, the OG chosen to retrieve orthologs in auto mode will be COG0012@2, since 1MVM4@1224 does not belong to the list of tax IDs, and COG0012@2 is narrower than COG0012@1.
- narrowest
- Instructs eggnog-mapper to use the narrowest (most specific) taxon among the OGs identified for each hit. This could lead to scarce annotation, specially for those less well-known clades. In the same example as before, COG0012@1,COG0012@2,1MVM4@1224, the OG chosen to retrieve orthologs will be 1MVM4@1224, which is the narrowest.
- LIST_OF_TAX_IDS
- Use a user-defined comma-separated list of tax IDs and/or tax names (you can use a mix of tax IDs and names; use the --list_taxa option to retrieve a list of the ones which are available). The order matters: the left-most tax IDs will have preference over the right-most ones. Furthermore, the list of tax IDs can be suffixed with none, narrowest or auto, to specify the behaviour when none of tax IDs are found among the OGs of a target seed ortholog. If only the list of tax IDs is specified, the default behaviour is none.
- none: no OG will be used for annotation, so no annotation will be obtained for this query.
- auto: an OG will be chosen using the predefined list of tax IDs, and therefore at least the root level will be applied if no other taxa fits the target OGs (see auto above).
- narrowest: the narrowest OG will be used for annotation, as if --tax_scope narrowest was chosen for this query.
- Use a user-defined comma-separated list of tax IDs and/or tax names (you can use a mix of tax IDs and names; use the --list_taxa option to retrieve a list of the ones which are available). The order matters: the left-most tax IDs will have preference over the right-most ones. Furthermore, the list of tax IDs can be suffixed with none, narrowest or auto, to specify the behaviour when none of tax IDs are found among the OGs of a target seed ortholog. If only the list of tax IDs is specified, the default behaviour is none.
- An example of list of tax IDs would be --tax_scope 2759,2157,2,1 for euk, arch, bact and root, in that order of preference.
- If, for example, the narrowest OG is preferred over root, the list could, instead of the previous, be --tax_scope 2759,2157,2,narrowest.
- Another example: if a user wants to annotate all bacteria using the Bacteria level, and auto for all other taxa, he should use --tax_scope 2,auto
- fix the taxonomic scope used for annotation, for each query sequence, so only speciation events from a particular clade are used for functional transfer. More in detail, each seed ortholog belongs to a list of Orthologous Groups (OGs). eggnog-mapper uses one of these OGs to analyze speciation events and retrieve orthologs from which to transfer functional annotation. This can be done from a broader or a narrower OG. The --tax_scope option helps controlling how this choice is carried out. Default is auto.
- --target_orthologs one2one|many2one|one2many|many2many|all
- defines what type of orthologs (in relation to the seed ortholog) should be used for functional transfer. Default: all
- --target_taxa all|TAX_ID
- broadest taxa which will used to search for orthologs. By default ('all'), orthologs from all taxa, within a given taxonomic scope, are used. Note that this option interacts with the OG chosen due to the --tax_scope option. First, speciation events are identified among the Orthologous Groups based on --tax_scope. Then, annotation will be transferred from the orthologs found within those speciation events: from all the orthologs if --target_taxa all, or only from orthologs of a specific TAX_ID if --target_taxa TAX_ID.
- --excluded_taxa TAXID
- the opposite behaviour than --target_taxa. (for debugging and benchmark purposes). Default is none.
- --report_orthologs
- as a first step in functional annotation, eggnog-mapper identifies the orthologs of each query, using seed orthologs from the search stage as an anchoring or starting point. A list of these orthologs is not reported by default. Use this option get the list of orthologs found for each query ('.orthologs' file).
- --go_evidence experimental|non-electronic|all
- defines what type of GO terms should be used for annotation. experimental = Use only terms inferred from experimental evidence. non-electronic (default) = Use only non-electronically curated terms. all = all GO terms will be retrieved.
- --pfam_transfer best_og|narrowest_og|seed_ortholog
- PFAM domains will be retrieved from either best OG, the narrowest OG or directly from the seed ortholog. It has no effect if
--pfam_realign denovo
is used.
- PFAM domains will be retrieved from either best OG, the narrowest OG or directly from the seed ortholog. It has no effect if
- --pfam_realign none|realign|denovo
- Defines how PFAM annotation will be performed.
- none
- A list of PFAMs, directly transferred from orthologs, will be reported.
- realign
- PFAMs from orthologs will be realigned to the query, and a list of PFAMs and their positions on the query will be reported.
- denovo
- Each query will be realigned to PFAM, and a list of PFAMs and their positions on the query will be reported.
- Defines how PFAM annotation will be performed.
- --md5
- Adds a column with the md5 hash of the query sequences in the annotations output file. An annotations output file created this way can be used as cache file (
-c CACHE_FILE
) for the-m cache
mode.
- Adds a column with the md5 hash of the query sequences in the annotations output file. An annotations output file created this way can be used as cache file (
- --output,-o FILE_PREFIX
- base name for output files
- --output_dir DIR
- where output files should be written. default is current working directory.
- --scratch_dir DIR
- write output files in a temporary scratch dir, move them to the final output dir when finished. Speed up large computations using network file systems.
- --resume
- resumes a previous execution skipping reported hits in the output file. Note that diamond runs (-m diamond) cannot be resumed, but search stage can be skipped with -m no_search --annotate_hits_table FILE.
- --override
- overwrites output files if they exist. By default, execution is aborted if conflicting files are detected.
- --temp_dir DIR
- where temporary files are created. Better if this is a local disk.
- --no_file_comments
- no header lines nor stats are included in the output files
- --decorate_gff no|yes|FILE[:FIELD]
- Option to create/decorate a GFF file with emapper hits and/or annotations.
- no: no GFF decoration will be performed. If running gene prediction with Prodigal, its GFF will be among the output files anyway. If running blastx-based gene prediction, the GFF with CDS of hits will be among output files anyway.
- yes: a new GFF will be created including hits and/or annotations.
-
FILE[:FIELD]: a new GFF will be created, adding hits and/or annotations to the attributes already existing in the specified FILE. A FIELD (a GFF attribute) can be specified, to help identify to which GFF feature should the hits and/or annotations be added. For example,
--decorate_gff genome_cds.gff:geneID
will add hits and/or annotations to the features in which geneID matches the query name of the hit/annotation. By default,--decorate_gff no
andFIELD
is ID.
- Option to create/decorate a GFF file with emapper hits and/or annotations.
- Seed orthologs (prefix.emapper.seed_orthologs)
- A file with the results from the search phase. Therefore, each row represents a query hit against a target eggNOG sequence.
- Annotations (prefix.emapper.annotations)
- A file with the results from the annotation phase. Therefore, each row represents the annotation reported for a given query.
- Orthologs (prefix.emapper.orthologs)
- A file with the list of orthologs found for each query. This file is created only if using the --report_orthologs option.
- HMM hits (prefix.emapper.hmm_hits)
- A file with the results from the search phase, using hmm_mapper or emapper -m hmmer, which reports query-HMM target pairs, including the e-value and score of the hit, the starting and ending positions of the hit, as well as the query covered by the alignment to the HMM hit.
- Sequences of predicted CDS (prefix.emapper.genepred.fasta)
- A FASTA file with the sequences of the predicted CDS. It is generated when gene prediction is carried out, with --itype genome or --itype metagenome.
- GFF of predicted CDS (prefix.emapper.genepred.gff)
- A GFF (version 3) file with the position of the predicted CDS on the original input sequences. It is generated when gene prediction is carried out, with --itype genome or --itype metagenome.
- Sequences without annotation (prefix.emapper.no_annotations.fasta)
- A FASTA file with the sequences of queries for which an existing annotation was not found using the -m cache mode. This file can be used as input of another eggNOG-mapper run without using the cache, trying to annotate the sequences.
- PFAM hits (prefix.emapper.pfam)
- A file with the positions of the PFAM domains identified. Only created if --pfam_realign realign or --pfam_realign denovo.
All files contain rows with tab-separated columns or fields.
- query
- target
- The target is what is also known, in eggnog-mapper, as 'seed ortholog'. It is the eggNOG sequence representing the best hit found for a given query during the search phase, and it will be used, during the annotation phase, to retrieve orthologs from which to transfer annotations.
- e-value
- bit-score
- The e-value and bit-score fields are the values returned by the search tool being used (diamond by default, see -m option).
- pident
- Percentage of identity between the query and the subject (since version 2.0.7).
- qstart
- First position of query in the alignment (since version 2.0.7).
- qend
- End position of query in the alignment (since version 2.0.7).
- sstart
- Start position of subject (a.k.a. target) in the alignment (since version 2.0.7).
- send
- End position of subject (a.k.a. target) in the alignment (since version 2.0.7).
- qcov
- Percentage of the query length which is part of the alignment (since version 2.0.7).
- scov
- Percentage of the subject (a.k.a. target) length which is part of the alignment (since version 2.0.7).
--outfmt_short
option can be used to output only the first 4 fields of the seed orthologs file, when running searches with Diamond (see --outfmt_short
option above).
- query_name
- seed_eggNOG_ortholog
- seed_ortholog_evalue
- seed_ortholog_score
- eggNOG OGs
- a comma-separated, clade depth-sorted (broadest to narrowest), list of Orthologous Groups (OGs) identified for this query. Note that each OG is represented in the following format: OG@tax_id|tax_name
- narr_og_name
- OG@tax_id|tax_name for the narrowest OG found for this query.
- narr_og_cat
- COG category corresponding to narr_og_name
- narr_og_desc
- Description corresponding to narr_og_name
- best_og_name
- OG@tax_id|tax_name for the OG chosen based on --tax_scope.
- best_og_cat
- COG category corresponding to best_og_name
- best_og_desc
- Description corresponding to best_og_name
- Preferred_name
- GOs
- EC
- KEGG_ko
- KEGG_Pathway
- KEGG_Module
- KEGG_Reaction
- KEGG_rclass
- BRITE
- KEGG_TC
- CAZy
- BiGG_Reaction
- query
- comma-separated list of orthologs
- query
- hit
- evalue
- sum_score
- query length
- HMM position "from"
- HMM position "to"
- Sequence position "from"
- Sequence position "to"
- query coverage
If gene prediction is performed using search hits (diamond or mmseqs "blastx" hits), sequence identifiers include the identifier of the original sequence from which the CDS has been found, followed by an underscore and a number to differentiate among CDS from the same original sequence (e.g. A CDS found in >query_seq will be named >query_seq_1. A second one will be >query_seq_2, ...). If gene prediction is performed using prodigal, this output file is the one generated by Prodigal (check Prodigal documentation for output formats).
If gene prediction is performed using search hits (diamond or mmseqs "blastx" hits), the source field (2nd column) show "eggNOG-mapper" and the attributes field (9th column) show results of the "blastx" search (e.g. ID=0_0;score=1597.8;evalue=0.0;eggnog5_target=316407.85674276;sstart=1;send=820;searcher=diamond). Also target_coverage is included since version 2.0.7. If gene prediction is performed using prodigal, this output file is the one generated by Prodigal (check Prodigal documentation for output formats).
Just a FASTA file with the same identifiers as the original sequences.
TODO
The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).
eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.
1) Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.
split -l 2000000 -a 3 -d input_file.faa input_file.chunk_
2) Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell `emapper.py` not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:
# generate all the commands that should be distributed in the cluster
for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done
The annotation phase needs to query `data/eggnog.db` intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store `eggnog.db` in SSD disks or, if possible, under `/dev/shm` (memory based filesystem).
3) Concatenate all chunk_*.emapper.seed_orthologs file.
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
4) Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10
and _voilà_, you got your annotations.
Use MMseqs for the search step (~200 GB mem required if using the whole eggnog5 DB).
# generate all the commands that should be distributed in the cluster
for f in *.chunk_*; do
echo ./emapper.py -m mmseqs --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done
Load the annotation database into memory for the annotation step (~44 GB mem required)
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10 --dbmem
Also when running Diamond for the search step (-m diamond
) it can benefit from using large memory computers, by tuning the --block_size
and the --index_chunks
options. Also --index_chunks
could be required by diamond when running in computers with over 64GB RAM. (These options are available since version 2.0.7).
Please cite the following two papers if you use eggNOG-mapper v2
[1] Fast genome-wide functional annotation through orthology assignment by
eggNOG-mapper. Jaime Huerta-Cepas, Damian Szklarczyk, Lars Juhl Jensen,
Christian von Mering and Peer Bork. Submitted (2016).
[2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated
orthology resource based on 5090 organisms and 2502 viruses. Jaime
Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia
K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars
J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8;
47(Database issue): D309–D314. doi: 10.1093/nar/gky1085