-
Notifications
You must be signed in to change notification settings - Fork 15
FCS GX taxonomy report
The initial report from FCS-GX is provided in the file < basename of fasta file>.< tax-id provided>.taxonomy.rpt. For more FCS-GX details and quickstart instructions, please review the FCS-GX documentation.
The following table illustrates column numbers (first column) with corresponding column headers (second column):
1: #seq-id OU830638.1
2: seq-len 6422716
3: masked-len 793013
4: cvg-by-all 5578796
5: sep1 |
6: tax-name-1 Neonectria sp. DH2
7: tax-id-1 1735992
8: div-1 ascomycetes
9: cvg-by-div-1 5562234
10: cvg-by-tax-1 5336039
11: score-1 8946
12: sep2 |
13: tax-id-2 930093
14: div-2 ascomycetes
15: cvg-by-div-2 5562234
16: cvg-by-tax-2 3899812
17: score-2 6463
18: sep3 |
19: tax-id-3 64524
20: div-3 fungi
21: cvg-by-div-3 35905
22: cvg-by-tax-3 12317
23: score-3 523
24: sep4 |
25: tax-id-4 108931
26: div-4 insects
27: cvg-by-div-4 45081
28: cvg-by-tax-4 32628
29: score-4 491
30: sep5 |
31: weight 4
32: result primary-div
33: div ascomycetes
34: div_pct_cvg 87
-
Column 1: A seq-id (sequence ID). This can be in the following formats:
-
A whole sequence with a hit to a taxonomic division.
#seq-id OU830638.1
-
A sequence split on runs of 10+Ns. The seq-id includes the start and end for each split range of the sequence formatted as ~start..end.
#seq-id CH476754.1~1..212539 CH476754.1~212640..216643 CH476754.1~218504..255730
-
A sequence with hits to multiple taxonomic divisions, making it a putative chimeric sequence. The seq-id includes the start and end for each chimeric range of the sequence formatted as ***~ ~start..end***.
#seq-id CR382124.1~~1164..1687942 CR382124.1~~1694735..1696001
-
A split sequence that is also chimeric. The seq-id includes
start..end~substart..subend where the subranges are relative to the starting coordinate of the split sequence.#seq-id UYJD01000002.1~1709646..1813733~~5112..84751 UYJD01000002.1~1709646..1813733~~100474..101416
-
-
Columns 2 and 3: The seq-len (sequence length) and masked-len (masked length) representing the length of the sequence (whole, split, or chimeric) and the masked length of the sequence, respectively.
-
Column 4: The cvg-by-all representing the total alignment length found from all sequences in the FCS-GX database.
-
Columns 6-30: The alignment information for a maximum of four sets of tax-ids along with their divisions. FCS-GX prints the taxonomic name (see column 6) for the first set. It also prints the tax-id, division (derived from the “BLAST name” divisions in taxonomy), total alignment length from the division hits or just the specified tax-id, and a score for all four sets. FCS-GX returns information for a maximum of two tax-ids from the same division.
-
Column 31: The sequence weight
-
Column 32: FCS-GX result. This result can be any one of the following:
Result Description primary-div sequence belongs to division of the input tax-id contaminant sequence identified as a contaminant contaminant(synthetic) one of the top four taxa belongs to the 'synthetic' division, and the score is close to nearest matching division contaminant(virus) one of the top four taxa belongs to the 'virus' division, and the score is close to nearest matching division contaminant(repeat) probably belongs to a contaminant division, but the sequence is highly repeat-specific contaminant(prok) matches to multiple prokaryotes and suggests the sequence is prokaryote-specific contaminant(close-div) strong and unambiguous hit from a closely-related division bogus inconclusive because the nearest matching taxon has high overlap with a different division repeat inconclusive because the sequence is highly repeat-specific low-coverage inconclusive due to low coverage inconclusive inconclusive for other reasons -
Column 33: The taxonomic division assigned to the sequence by FCS-GX.
-
Column 34: The percentage alignment coverage for the sequence in the taxonomic division.
The sequences below demonstrate some example outputs from taxonomy.rpt for a butterfly. The first sequence is insect. The second sequence is bacteria. While the third sequence is also insect, it has several weaker hits to bacteria.
# column numbers
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34
#seq-id seq-len masked-len cvg-by-all sep1 tax-name-1 tax-id-1 div-1 cvg-by-div-1 cvg-by-tax-1 score-1 sep2
tax-id-2 div-2 cvg-by-div-2 cvg-by-tax-2 score-2 sep3
tax-id-3 div-3 cvg-by-div-3 cvg-by-tax-3 score-3 sep4
tax-id-4 div-4 cvg-by-div-4 cvg-by-tax-4 score-4 sep5 weight result div div_pct_cvg
# example sequence identified as insect (expected)
FARY01017106.1 14773 406 10195 | Vanessa tameamea 334116 insects 10094 9207 256 |
278856 insects 10094 9238 235 |
241271 bony fishes 177 115 14 |
3662 plants 100 100 14 | 4 primary-div insects 68
# example Heliconius melpomene (a butterfly) sequence identified as an Enterobacter contaminant
FARY01000050.1 15785 1592 15785 | Enterobacter quasimori 2838947 prok|g-proteobacteria 15785 15740 561 |
550 prok|g-proteobacteria 15785 15724 561 |
32630 synthetic 15531 15531 479 |
85692 plants 2179 2169 78 | 3 contaminant prok|g-proteobacteria 100
# conflicting results (this probably is a butterfly sequence for a chitinase, with bacteria homologs)
FARY01021243.1 2942 0 2279 | Vanessa tameamea 334116 insects 2279 2068 110 |
116150 insects 2279 1945 110 |
137545 prok|g-proteobacteria 1747 1618 75 |
614 prok|g-proteobacteria 1747 1611 75 | 3 primary-div insects 77
The following steps will help you parse/interpret the taxonomy.rpt output:
- Retrieve a list of sequences with at least one contaminant identifier (including chimeras):
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '$32~/contaminant/{print $1}' | cut -d '~' -f 1 | uniq
- Retrieve the FCS-GX output for all sequences with a mix of contaminant and primary-div ranges (putative chimeric sequences):
cat GCA_000006565.2.taxonomy.rpt | grep primary-div | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt | grep contaminant | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt
- Calculate the percentage of the total genome length classified as primary-div:
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '($32 == "primary-div"){ pr_div_len += $2 }; 1{ tot_len += $2 } END{ print pr_div_len/tot_len*100 }'
Please create an Issue if you encounter any problems.
For all other questions or comments, please contact us at [email protected]
-
FCS-adaptor
-
FCS-GX
-
Setting up FCS in the cloud
-
FCS in Galaxy