Skip to content

Description of download files

Jaime Huerta-Cepas edited this page Sep 20, 2022 · 7 revisions

The download section of eggNOG v6.0 provides digested data for OG annotations, trees, alignments.

To facilitate bulk processing, most files are either in Tab Delimited (TSV) or JSON format.

e6.all_raw_trees_and_algs.tsv (136G)

Contains phylogenetic trees and multiple sequence alignments for all OGs. Each file line represents one OG. Data is serialised in tab delimited columns:

ogname [TAB] taxlevel [TAB] Tree_in_newick_format [TAB] fasta_alignment_base64_encoded

Python example to iterate on data and unpack base64 alignments:

from ete3 import Tree # for phylogenetic tree processing
import base64
import gzip

for line in open('e6.all_raw_trees_and_algs.tsv'):
    ogname, taxlevel, newick, packed_alg = line.split('\t')

    t = Tree(newick)

    fasta = gzip.uncompress(base64.b64decode(packed_alg)).decode()
    

    # ....

e6.dup_profiles.json (1G)

e6.func_profiles.json (17G)

e6.og2level.tsv (198M)

e6.og2parents_and_children.new.tsv (375M)

  • col 1: og
  • col2: number of parents OGs
  • col3: number of children OGs
  • col4: comma separated list of parents
  • col5: comma separated list of children

Example:

==> e6.og2parents_and_children.new.tsv <==

H5RHW   1       1       HT0XS   5A441
BFU0R   1       0       GXBE6
6FXK3   1       0       E981F
BZG4S   1       0       E8EVH
EX7EC   1       0       DTNA5
4RBAD   1       1       8Z0IU   AI3Y2
FAUDP   1       2       699KS   ENEMU,EW6FD
FYY7D   1       2       COG0560 9376J,9E42M
EYSYI   1       0       5TNEE
CENW5   1       0       HWD8R

e6.og2seqs_and_species.tsv (10G)

  • col1: taxonomic level
  • col2: OG name
  • col3: number of species
  • col4: number of members
  • col5: comma separated list of species
  • col6: comma separated list of members

Each member respresent a protein sequence. ID format is always TAXID.sequence_name

Example:

==> e6.og2seqs_and_species.tsv <==

314146  4R1PH   3       3       591936,61622,336983     336983.ENSCANP00000003193,591936.ENSPTEP00000002293,61622.ENSRROP00000006056
314146  4R1PI   2       2       61622,336983    336983.ENSCANP00000035642,61622.ENSRROP00000039496
314146  4R1PJ   2       4       61622,336983    336983.ENSCANP00000009298,336983.ENSCANP00000014889,61622.ENSRROP00000001152,61622.ENSRROP00000003608
314146  4R1PK   3       3       591936,61622,61621      591936.ENSPTEP00000029268,61621.ENSRBIP00000000020,61622.ENSRROP00000043585
314146  4R1PM   2       2       591936,61621    591936.ENSPTEP00000024160,61621.ENSRBIP00000003159
314146  4R1PN   3       4       61621,61622,336983      336983.ENSCANP00000007271,336983.ENSCANP00000011724,61621.ENSRBIP00000019800,61622.ENSRROP00000015263
314146  4R1PP   2       2       61622,336983    336983.ENSCANP00000024027,61622.ENSRROP00000027054
314146  4R1PQ   2       2       61622,336983    336983.ENSCANP00000020242,61622.ENSRROP00000005057
314146  4R1PR   2       2       591936,336983   336983.ENSCANP00000031012,591936.ENSPTEP00000036989
314146  4R1PS   2       2       61621,336983    336983.ENSCANP00000038913,61621.ENSRBIP00000035373

e6.seq2ogs.tsv (4G)

e6.taxa_profiles.json (7G)

Each line contains a json document informing about the taxa content of each OG.

{"n": OG name
"tprof": Taxonomic profile
     [
      {"t": NCBI TaxID
       "c": Number of proteins belonging to this TaxID in this particular OG 
       }, 
     ....]]} 

Example:

==> e6.taxa_profiles.json <==
{"n": "4R1PH", "tprof": [{"t": "591936", "c": 1}, {"t": "336983", "c": 1}, {"t": "61622", "c": 1}]}
{"n": "4R1PI", "tprof": [{"t": "61622", "c": 1}, {"t": "336983", "c": 1}]}
{"n": "4R1PJ", "tprof": [{"t": "61622", "c": 2}, {"t": "336983", "c": 2}]}
{"n": "4R1PK", "tprof": [{"t": "61622", "c": 1}, {"t": "591936", "c": 1}, {"t": "61621", "c": 1}]}
{"n": "4R1PM", "tprof": [{"t": "591936", "c": 1}, {"t": "61621", "c": 1}]}
{"n": "4R1PN", "tprof": [{"t": "336983", "c": 2}, {"t": "61622", "c": 1}, {"t": "61621", "c": 1}]}
{"n": "4R1PP", "tprof": [{"t": "336983", "c": 1}, {"t": "61622", "c": 1}]}
{"n": "4R1PQ", "tprof": [{"t": "61622", "c": 1}, {"t": "336983", "c": 1}]}
{"n": "4R1PR", "tprof": [{"t": "336983", "c": 1}, {"t": "591936", "c": 1}]}
{"n": "4R1PS", "tprof": [{"t": "336983", "c": 1}, {"t": "61621", "c": 1}]}