Skip to content

FCS GX taxonomy report

Strope, Pooja edited this page Jul 27, 2022 · 7 revisions

Taxonomy Report Output

The initial report from FCS-GX is provided in the file < basename of fasta file>.< tax-id provided>.taxonomy.rpt. For more FCS-GX details and quickstart instructions, please review the FCS-GX documentation.

The following table illustrates column numbers (first column) with corresponding column headers (second column):

1:      #seq-id          OU830638.1
2:      seq-len          6422716
3:      (xp,lc,co,n)-len 5104,9351,29940,0
4:      cvg-by-all       6034920
5:      sep1             |
6:      tax-name-1       Neonectria ditissima
7:      tax-id-1         78410
8:      div-1            fung:ascomycetes
9:      cvg-by-div-1     5994233
10:     cvg-by-tax-1     5528541
11:     score-1          10033
12:     sep2             |
13:     tax-id-2         1735992
14:     div-2            fung:ascomycetes
15:     cvg-by-div-2     5994233
16:     cvg-by-tax-2     5366714
17:     score-2          9852
18:     sep3             |
19:     tax-id-3         2940382
20:     div-3            fung:budding yeasts
21:     cvg-by-div-3     56273
22:     cvg-by-tax-3     31377
23:     score-3          420
24:     sep4             |
25:     tax-id-4         378046
26:     div-4            fung:budding yeasts
27:     cvg-by-div-4     56273
28:     cvg-by-tax-4     8406
29:     score-4          223
30:     sep5             |
31:     weight           4
32:     result           primary-div
33:     div              fung:ascomycetes
34:     div_pct_cvg      93
  • Column 1: A seq-id (sequence ID). This can be in the following formats:

    • A whole sequence with a hit to a taxonomic division.

      #seq-id
      OU830638.1
      
    • A sequence split on runs of 10+Ns. The seq-id includes the start and end for each split range of the sequence formatted as ~start..end.

      #seq-id
      CH476754.1~1..212539
      CH476754.1~212640..216643
      CH476754.1~218504..255730
      
    • A sequence with hits to multiple taxonomic divisions, making it a putative chimeric sequence. The seq-id includes the start and end for each chimeric range of the sequence formatted as ***~ ~start..end***.

      #seq-id
      CR382124.1~~1164..1687942
      CR382124.1~~1694735..1696001
      
    • A split sequence that is also chimeric. The seq-id includes start..end ~substart..subend where the subranges are relative to the starting coordinate of the split sequence.

      #seq-id
      UYJD01000002.1~1709646..1813733~~5112..84751
      UYJD01000002.1~1709646..1813733~~100474..101416
      
  • Columns 2 and 3: The seq-len (sequence length) and masked-len (masked length) representing the length of the sequence (whole, split, or chimeric) and the masked length of the sequence, respectively. The masked length is a comma-separated tuple corresponding to regions masked on four tracks: transposons (xp), low-complexity (lc), highly-conserved regions (co), Ns (n).

  • Column 4: The cvg-by-all representing the total alignment length found from all sequences in the FCS-GX database.

  • Columns 6-30: The alignment information for a maximum of four sets of tax-ids along with their divisions. FCS-GX prints the taxonomic name (see column 6) for the first set. It also prints the tax-id, division (derived from the “BLAST name” divisions in taxonomy), total alignment length from the division hits or just the specified tax-id, and a score for all four sets. FCS-GX returns information for a maximum of two tax-ids from the same division.

  • Column 31: The sequence weight

  • Column 32: FCS-GX result. This result can be any one of the following:

    Result Description
    primary-div sequence belongs to division of the input tax-id
    contaminant sequence identified as a contaminant
    contaminant(synthetic) one of the top four taxa belongs to the 'synthetic' division, and the score is close to nearest matching division
    contaminant(virus) one of the top four taxa belongs to the 'virus' division, and the score is close to nearest matching division
    contaminant(repeat) probably belongs to a contaminant division, but the sequence is highly repeat-specific
    contaminant(prok) matches to multiple prokaryotes and suggests the sequence is prokaryote-specific
    contaminant(close-div) strong and unambiguous hit from a closely-related division
    bogus inconclusive because the nearest matching taxon has high overlap with a different division
    repeat inconclusive because the sequence is highly repeat-specific
    low-coverage inconclusive due to low coverage
    inconclusive inconclusive for other reasons
  • Column 33: The taxonomic division assigned to the sequence by FCS-GX.

  • Column 34: The percentage alignment coverage for the sequence in the taxonomic division.

Example Outputs

The sequences below demonstrate some example outputs from taxonomy.rpt for a butterfly. The first sequence is insect. The second sequence is bacteria. While the third sequence is also insect, it has several weaker hits to bacteria.

# column numbers
 1               2          3                 4              5     6                        7           8                 9          10              11      12
                                                                                            13              14      15              16              17      18
                                                                                            19              20      21              22              23      24
                                                                                            25              26      27              28              29      30     31      32            33       34

 #seq-id         seq-len    (xp,lc,co,n)-len  cvg-by-all     sep1  tax-name-1               tax-id-1    div-1             cvg-by-div-1  cvg-by-tax-1    score-1 sep-2
                                                                                            tax-id-2        div-2   cvg-by-div-2    cvg-by-tax-2    score-2 sep3
                                                                                            tax-id-3        div-3   cvg-by-div-3    cvg-by-tax-3    score-3 sep4
                                                                                            tax-id-4        div-4   cvg-by-div-4    cvg-by-tax-4    score-4 sep5    weight     result          div                     div_pct_cvg

# example sequence identified as insect (expected)
FARY01017106.1  14773       0,0,0,0          10677           |    Melitaea cinxia           113334     anml:insects            10376   9804    262          |  
                                                                                            171605     anml:insects            10376   9375    250          |  
                                                                                            2829486    fung:basidiomycetes     92      92      12           |  29144      anml:fishes             86      86      11           |  4          primary-div     anml:insects            70

# example Heliconius melpomene (a butterfly) sequence identified as an Enterobacter contaminant
FARY01000050.1  15785       0,0,0,0          15785           |    Enterobacter chengduensis 2494701    prok:g-proteobacteria   15785   15761   886          |  
                                                                                            1812935    prok:g-proteobacteria   15785   15723   885          |  
                                                                                                                                                            |
                                                                                                                                                            |
                                                                                            4          contaminant     prok:g-proteobacteria   100

# conflicting results (this probably is a butterfly sequence for a chitinase, with bacteria homologs)
FARY01021243.1  2942       0,0,0,0           2297            |    Vanessa cardui            171605     anml:insects            2297    2107    112          |  
                                                                                            7111       anml:insects            2297    2062    110          |  
                                                                                            614        prok:g-proteobacteria   1683    1614    75           |  2864872    prok:g-proteobacteria   1683    1668    74           |  
                                                                                            4          primary-div     anml:insects            78

Interpreting Outputs

The following steps will help you parse/interpret the taxonomy.rpt output:

  1. Retrieve a list of sequences with at least one contaminant identifier (including chimeras):
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '$32~/contaminant/{print $1}' |  cut -d '~' -f 1 | uniq  
  1. Retrieve the FCS-GX output for all sequences with a mix of contaminant and primary-div ranges (putative chimeric sequences):
cat GCA_000006565.2.taxonomy.rpt | grep primary-div | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt | grep contaminant | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt
  1. Calculate the percentage of the total genome length classified as primary-div:
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '($32 == "primary-div"){ pr_div_len += $2 }; 1{ tot_len += $2 } END{ print pr_div_len/tot_len*100 }'
Clone this wiki locally