-
Notifications
You must be signed in to change notification settings - Fork 0
Description of download files
Jaime Huerta-Cepas edited this page Sep 20, 2022
·
7 revisions
The download section of eggNOG v6.0 provides digested data for OG annotations, trees, alignments.
To facilitate bulk processing, most files are either in Tab Delimited (TSV) or JSON format.
Contains phylogenetic trees and multiple sequence alignments for all OGs. Each file line represents one OG. Data is serialised in tab delimited columns:
ogname [TAB] taxlevel [TAB] Tree_in_newick_format [TAB] fasta_alignment_base64_encoded
Python example to iterate on data and unpack base64 alignments:
from ete3 import Tree # for phylogenetic tree processing
import base64
import gzip
for line in open('e6.all_raw_trees_and_algs.tsv'):
ogname, taxlevel, newick, packed_alg = line.split('\t')
t = Tree(newick)
fasta = gzip.uncompress(base64.b64decode(packed_alg)).decode()
# ....
- col 1: og
- col2: number of parents OGs
- col3: number of children OGs
- col4: comma separated list of parents
- col5: comma separated list of children
Example:
==> e6.og2parents_and_children.new.tsv <==
H5RHW 1 1 HT0XS 5A441
BFU0R 1 0 GXBE6
6FXK3 1 0 E981F
BZG4S 1 0 E8EVH
EX7EC 1 0 DTNA5
4RBAD 1 1 8Z0IU AI3Y2
FAUDP 1 2 699KS ENEMU,EW6FD
FYY7D 1 2 COG0560 9376J,9E42M
EYSYI 1 0 5TNEE
CENW5 1 0 HWD8R
- col1: taxonomic level
- col2: OG name
- col3: number of species
- col4: number of members
- col5: comma separated list of species
- col6: comma separated list of members
Each member respresent a protein sequence. ID format is always TAXID.sequence_name
Example:
==> e6.og2seqs_and_species.tsv <==
314146 4R1PH 3 3 591936,61622,336983 336983.ENSCANP00000003193,591936.ENSPTEP00000002293,61622.ENSRROP00000006056
314146 4R1PI 2 2 61622,336983 336983.ENSCANP00000035642,61622.ENSRROP00000039496
314146 4R1PJ 2 4 61622,336983 336983.ENSCANP00000009298,336983.ENSCANP00000014889,61622.ENSRROP00000001152,61622.ENSRROP00000003608
314146 4R1PK 3 3 591936,61622,61621 591936.ENSPTEP00000029268,61621.ENSRBIP00000000020,61622.ENSRROP00000043585
314146 4R1PM 2 2 591936,61621 591936.ENSPTEP00000024160,61621.ENSRBIP00000003159
314146 4R1PN 3 4 61621,61622,336983 336983.ENSCANP00000007271,336983.ENSCANP00000011724,61621.ENSRBIP00000019800,61622.ENSRROP00000015263
314146 4R1PP 2 2 61622,336983 336983.ENSCANP00000024027,61622.ENSRROP00000027054
314146 4R1PQ 2 2 61622,336983 336983.ENSCANP00000020242,61622.ENSRROP00000005057
314146 4R1PR 2 2 591936,336983 336983.ENSCANP00000031012,591936.ENSPTEP00000036989
314146 4R1PS 2 2 61621,336983 336983.ENSCANP00000038913,61621.ENSRBIP00000035373
Each line contains a json document informing about the taxa content of each OG.
{"n": OG name
"tprof": Taxonomic profile
[
{"t": NCBI TaxID
"c": Number of proteins belonging to this TaxID in this particular OG
},
....]]}
Example:
==> e6.taxa_profiles.json <==
{"n": "4R1PH", "tprof": [{"t": "591936", "c": 1}, {"t": "336983", "c": 1}, {"t": "61622", "c": 1}]}
{"n": "4R1PI", "tprof": [{"t": "61622", "c": 1}, {"t": "336983", "c": 1}]}
{"n": "4R1PJ", "tprof": [{"t": "61622", "c": 2}, {"t": "336983", "c": 2}]}
{"n": "4R1PK", "tprof": [{"t": "61622", "c": 1}, {"t": "591936", "c": 1}, {"t": "61621", "c": 1}]}
{"n": "4R1PM", "tprof": [{"t": "591936", "c": 1}, {"t": "61621", "c": 1}]}
{"n": "4R1PN", "tprof": [{"t": "336983", "c": 2}, {"t": "61622", "c": 1}, {"t": "61621", "c": 1}]}
{"n": "4R1PP", "tprof": [{"t": "336983", "c": 1}, {"t": "61622", "c": 1}]}
{"n": "4R1PQ", "tprof": [{"t": "61622", "c": 1}, {"t": "336983", "c": 1}]}
{"n": "4R1PR", "tprof": [{"t": "336983", "c": 1}, {"t": "591936", "c": 1}]}
{"n": "4R1PS", "tprof": [{"t": "336983", "c": 1}, {"t": "61621", "c": 1}]}