-
Notifications
You must be signed in to change notification settings - Fork 15
FCS GX taxonomy report
The initial report from FCS-GX is provided in the file < basename of fasta file>.< tax-id provided>.taxonomy.rpt. For more FCS-GX details and quickstart instructions, please review the FCS-GX documentation.
The following table illustrates column numbers (first column) with corresponding column headers (second column):
1: #seq-id OU830638.1
2: seq-len 6422716
3: (xp,lc,co,n)-len 5104,9351,29940,0
4: cvg-by-all 6034920
5: sep1 |
6: tax-name-1 Neonectria ditissima
7: tax-id-1 78410
8: div-1 fung:ascomycetes
9: cvg-by-div-1 5994233
10: cvg-by-tax-1 5528541
11: score-1 10033
12: sep2 |
13: tax-id-2 1735992
14: div-2 fung:ascomycetes
15: cvg-by-div-2 5994233
16: cvg-by-tax-2 5366714
17: score-2 9852
18: sep3 |
19: tax-id-3 2940382
20: div-3 fung:budding yeasts
21: cvg-by-div-3 56273
22: cvg-by-tax-3 31377
23: score-3 420
24: sep4 |
25: tax-id-4 378046
26: div-4 fung:budding yeasts
27: cvg-by-div-4 56273
28: cvg-by-tax-4 8406
29: score-4 223
30: sep5 |
31: reserved n/a
32: result primary-div
33: div fung:ascomycetes
34: div_pct_cvg 93
-
Column 1: A seq-id (sequence ID). This can be in the following formats:
-
A whole sequence with a hit to a taxonomic division.
#seq-id OU830638.1
-
A sequence split on runs of 10+Ns. The seq-id includes the start and end for each split range of the sequence formatted as ~start..end.
#seq-id CH476754.1~1..212539 CH476754.1~212640..216643 CH476754.1~218504..255730
-
A sequence with hits to multiple taxonomic divisions, making it a putative chimeric sequence. The seq-id includes the start and end for each chimeric range of the sequence formatted as ~~start..end.
#seq-id CR382124.1~~1164..1687942 CR382124.1~~1694735..1696001
-
A split sequence that is also chimeric. The seq-id includes ~start..end~~substart..subend where the subranges are relative to the starting coordinate of the split sequence.
#seq-id UYJD01000002.1~1709646..1813733~~5112..84751 UYJD01000002.1~1709646..1813733~~100474..101416
-
-
Columns 2 and 3: The seq-len (sequence length) and masked-len (masked length) representing the length of the sequence (whole, split, or chimeric) and the masked length of the sequence, respectively. The masked length is a comma-separated tuple corresponding to regions masked on four tracks: transposons (xp), low-complexity (lc), highly-conserved regions (co), Ns (n).
-
Column 4: The cvg-by-all representing the total alignment length found from all sequences in the FCS-GX database.
-
Columns 6-30: The alignment information for a maximum of four sets of tax-ids along with their divisions. FCS-GX prints the taxonomic name (see column 6) for the first set. It also prints the tax-id, division (derived from the “BLAST name” divisions in taxonomy), total alignment length from the division hits or just the specified tax-id, and a score for all four sets. FCS-GX returns information for a maximum of two tax-ids from the same division.
-
Column 31: reserved column
-
Column 32: FCS-GX result. This result can be any one of the following:
Result Description primary-div sequence belongs to division of the input tax-id contaminant sequence identified as a contaminant contaminant(synthetic) one of the top four taxa belongs to the 'synthetic' division, and the score is close to nearest matching division contaminant(virus) one of the top four taxa belongs to the 'virus' division, and the score is close to nearest matching division contaminant(repeat) probably belongs to a contaminant division, but the sequence is highly repeat-specific contaminant(prok) matches to multiple prokaryotes and suggests the sequence is prokaryote-specific contaminant(close-div) strong and unambiguous hit from a closely-related division bogus inconclusive because the nearest matching taxon has high overlap with a different division repeat inconclusive because the sequence is highly repeat-specific low-coverage inconclusive due to low coverage inconclusive inconclusive for other reasons -
Column 33: The taxonomic division assigned to the sequence by FCS-GX.
-
Column 34: The percentage alignment coverage for the sequence in the taxonomic division.
The sequences below demonstrate some example outputs from taxonomy.rpt for a butterfly. The first sequence is insect. The second sequence is bacteria. While the third sequence is also insect, it has several weaker hits to bacteria.
# column numbers
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30 31
32
33
34
#seq-id seq-len (xp,lc,co,n)-len cvg-by-all sep1 tax-name-1 tax-id-1 div-1 cvg-by-div-1 cvg-by-tax-1 score-1 sep2
tax-id-2 div-2 cvg-by-div-2 cvg-by-tax-2 score-2 sep3
tax-id-3 div-3 cvg-by-div-3 cvg-by-tax-3 score-3 sep4
tax-id-4 div-4 cvg-by-div-4 cvg-by-tax-4 score-4 sep5 reserved
result
div
div_pct_cvg
# example sequence identified as insect (expected)
FARY01017106.1 14773 0,0,0,0 10677 | Melitaea cinxia 113334 anml:insects 10376 9804 262 |
171605 anml:insects 10376 9375 250 |
2829486 fung:basidiomycetes 92 92 12 |
29144 anml:fishes 86 86 11 | n/a
primary-div
anml:insects
70
# example Heliconius melpomene (a butterfly) sequence identified as an Enterobacter contaminant
FARY01000050.1 15785 0,0,0,0 15785 | Enterobacter chengduensis 2494701 prok:g-proteobacteria 15785 15761 886 |
1812935 prok:g-proteobacteria 15785 15723 885 |
|
| n/a
contaminant
prok:g-proteobacteria
100
# conflicting results (this probably is a butterfly sequence for a chitinase, with bacteria homologs)
FARY01021243.1 2942 0,0,0,0 2297 | Vanessa cardui 171605 anml:insects 2297 2107 112 |
7111 anml:insects 2297 2062 110 |
614 prok:g-proteobacteria 1683 1614 75 |
2864872 prok:g-proteobacteria 1683 1668 74 | n/a
primary-div
anml:insects
78
The following steps will help you parse/interpret the taxonomy.rpt output:
- Retrieve a list of sequences with at least one contaminant identifier (including chimeras):
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '$32~/contaminant/{print $1}' | cut -d '~' -f 1 | uniq
- Retrieve the FCS-GX output for all sequences with a mix of contaminant and primary-div ranges (putative chimeric sequences):
cat GCA_000006565.2.taxonomy.rpt | grep primary-div | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt | grep contaminant | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt
- Calculate the percentage of the total genome length classified as primary-div:
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '($32 == "primary-div"){ pr_div_len += $2 }; 1{ tot_len += $2 } END{ print pr_div_len/tot_len*100 }'
Please create an Issue if you encounter any problems.
For all other questions or comments, please contact us at [email protected]
-
FCS-adaptor
-
FCS-GX
-
Setting up FCS in the cloud
-
FCS in Galaxy