Skip to content

FCS GX taxonomy report

Strope, Pooja edited this page Jul 15, 2022 · 7 revisions

Taxonomy Report Output

The initial report from FCS-GX is provided in the file < basename of fasta file>.< tax-id provided>.taxonomy.rpt. For more FCS-GX details and quickstart instructions, please review the FCS-GX documentation.

The following table illustrates column numbers (first column) with corresponding column headers (second column):

1:      #seq-id         OU830638.1
2:      seq-len         6422716
3:      masked-len      793013
4:      cvg-by-all      5578796
5:      sep1            |
6:      tax-name-1      Neonectria sp. DH2
7:      tax-id-1        1735992
8:      div-1           ascomycetes
9:      cvg-by-div-1    5562234
10:     cvg-by-tax-1    5336039
11:     score-1         8946
12:     sep2            |
13:     tax-id-2        930093
14:     div-2           ascomycetes
15:     cvg-by-div-2    5562234
16:     cvg-by-tax-2    3899812
17:     score-2         6463
18:     sep3            |
19:     tax-id-3        64524
20:     div-3           fungi
21:     cvg-by-div-3    35905
22:     cvg-by-tax-3    12317
23:     score-3         523
24:     sep4            |
25:     tax-id-4        108931
26:     div-4           insects
27:     cvg-by-div-4    45081
28:     cvg-by-tax-4    32628
29:     score-4         491
30:     sep5            |
31:     weight          4
32:     result          primary-div
33:     div             ascomycetes
34:     div_pct_cvg     87
  • Column 1: A seq-id (sequence ID). This can be in the following formats:

    • A whole sequence with a hit to a taxonomic division.

      #seq-id
      OU830638.1
      
    • A sequence split on runs of 10+Ns. The seq-id includes the start and end for each split range of the sequence formatted as ~start..end.

      #seq-id
      CH476754.1~1..212539
      CH476754.1~212640..216643
      CH476754.1~218504..255730
      
    • A sequence with hits to multiple taxonomic divisions, making it a putative chimeric sequence. The seq-id includes the start and end for each chimeric range of the sequence formatted as ***~ ~start..end***.

      #seq-id
      CR382124.1~~1164..1687942
      CR382124.1~~1694735..1696001
      
    • A split sequence that is also chimeric. The seq-id includes start..end ~substart..subend where the subranges are relative to the starting coordinate of the split sequence.

      #seq-id
      UYJD01000002.1~1709646..1813733~~5112..84751
      UYJD01000002.1~1709646..1813733~~100474..101416
      
  • Columns 2 and 3: The seq-len (sequence length) and masked-len (masked length) representing the length of the sequence (whole, split, or chimeric) and the masked length of the sequence, respectively.

  • Column 4: The cvg-by-all representing the total alignment length found from all sequences in the FCS-GX database.

  • Columns 6-30: The alignment information for a maximum of four sets of tax-ids along with their divisions. FCS-GX prints the taxonomic name (see column 6) for the first set. It also prints the tax-id, division (derived from the “BLAST name” divisions in taxonomy), total alignment length from the division hits or just the specified tax-id, and a score for all four sets. FCS-GX returns information for a maximum of two tax-ids from the same division.

  • Column 31: The sequence weight

  • Column 32: FCS-GX result. This result can be any one of the following:

    Result Description
    primary-div sequence belongs to division of the input tax-id
    contaminant sequence identified as a contaminant
    contaminant(synthetic) one of the top four taxa belongs to the 'synthetic' division, and the score is close to nearest matching division
    contaminant(virus) one of the top four taxa belongs to the 'virus' division, and the score is close to nearest matching division
    contaminant(repeat) probably belongs to a contaminant division, but the sequence is highly repeat-specific
    contaminant(prok) matches to multiple prokaryotes and suggests the sequence is prokaryote-specific
    contaminant(close-div) strong and unambiguous hit from a closely-related division
    bogus inconclusive because the nearest matching taxon has high overlap with a different division
    repeat inconclusive because the sequence is highly repeat-specific
    low-coverage inconclusive due to low coverage
    inconclusive inconclusive for other reasons
  • Column 33: The taxonomic division assigned to the sequence by FCS-GX.

  • Column 34: The percentage alignment coverage for the sequence in the taxonomic division.

Example Outputs

The sequences below demonstrate some example outputs from taxonomy.rpt for a butterfly. The first sequence is insect. The second sequence is bacteria. While the third sequence is also insect, it has several weaker hits to bacteria.

# column numbers
 1               2          3               4              5     6                          7               8       9               10              11      12
                                                                                            13              14      15              16              17      18
                                                                                            19              20      21              22              23      24
                                                                                            25              26      27              28              29      30     31      32              33       34

#seq-id          seq-len    masked-len      cvg-by-all     sep1  tax-name-1                 tax-id-1        div-1   cvg-by-div-1    cvg-by-tax-1    score-1 sep2
                                                                                            tax-id-2        div-2   cvg-by-div-2    cvg-by-tax-2    score-2 sep3
                                                                                            tax-id-3        div-3   cvg-by-div-3    cvg-by-tax-3    score-3 sep4
                                                                                            tax-id-4        div-4   cvg-by-div-4    cvg-by-tax-4    score-4 sep5    weight  result        div       div_pct_cvg

# example sequence identified as insect (expected)
FARY01017106.1   14773      406             10195          |     Vanessa tameamea           334116          insects                 10094     9207    256   |
                                                                                            278856          insects                 10094     9238    235   |
                                                                                            241271          bony fishes             177       115     14    |
                                                                                            3662            plants                  100       100     14    |       4       primary-div   insects   68

# example Heliconius melpomene (a butterfly) sequence identified as an Enterobacter contaminant
FARY01000050.1   15785      1592            15785          |     Enterobacter quasimori     2838947         prok|g-proteobacteria   15785   15740   561     |
                                                                                            550             prok|g-proteobacteria   15785   15724   561     |
                                                                                            32630           synthetic               15531   15531   479     |
                                                                                            85692           plants                  2179    2169    78      |       3       contaminant    prok|g-proteobacteria   100

# conflicting results (this probably is a butterfly sequence for a chitinase, with bacteria homologs)
FARY01021243.1   2942       0               2279           |      Vanessa tameamea          334116          insects                 2279    2068    110     |
                                                                                            116150          insects                 2279    1945    110     |
                                                                                            137545          prok|g-proteobacteria   1747    1618    75      |
                                                                                            614             prok|g-proteobacteria   1747    1611    75      |       3       primary-div     insects  77

Interpreting Outputs

The following steps will help you parse/interpret the taxonomy.rpt output:

  1. Retrieve a list of sequences with at least one contaminant identifier (including chimeras):
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '$32~/contaminant/{print $1}' |  cut -d '~' -f 1 | uniq  
  1. Retrieve the FCS-GX output for all sequences with a mix of contaminant and primary-div ranges (putative chimeric sequences):
cat GCA_000006565.2.taxonomy.rpt | grep primary-div | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt | grep contaminant | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt
  1. Calculate the percentage of the total genome length classified as primary-div:
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '($32 == "primary-div"){ pr_div_len += $2 }; 1{ tot_len += $2 } END{ print pr_div_len/tot_len*100 }'
Clone this wiki locally