Skip to content

Taxonomy Check

george-coulouris edited this page May 14, 2024 · 9 revisions

The pipeline will verify the organism name provided on input when the pgap.py flag --taxcheck or --taxcheck-only are used.

The taxonomy check assesses whether the organism name provided in the YAML input file matches the input genome sequence. Using average nucleotide identity (ANI), it compares the input genome sequence to the genomes of the type strains in GenBank. In a first step, the set of type assemblies to which the input sequence is most closely related is determined via k-mer analysis. This set of assemblies is then aligned to the input sequence with pairwise MegaBLAST. The percent identity of the resulting filtered reciprocal best hits is declared as the overall genome-to-type-assembly ANI.

For most species, we use an ANI threshold of 96% identity and a minimum coverage threshold of 80% of both the query and the type assembly to declare that a query assembly matches a type assembly with high confidence.

More information is available in this publication.

Possible ANI statuses

The status returned by the taxonomy check can be one of the following:

CONFIRMED: The submitted organism name has been confirmed by ANI. A species can be confirmed by the following methods:

  • The assembly matches a type and both are of the same species.
  • The assembly matches a type and at least one is subspecies of the same species.
  • The assembly lacks a submitted full binomial name (i.e., submitted organism is a "sp.", or at genus level), matches a type, and both share the same genus.
  • The assembly matches a type of a species that was added to a specialized synonymy list designed to cover difficult-to-handle cases of typing.

MISASSIGNED: The submitted organism name has been found to be misassigned to the query assembly.

  • The assembly matches a type for a different species.
  • If the submitted organism name is a "sp.", there is a mismatch at the genus level.

INCONCLUSIVE: The organism cannot be identified.

  • There is no type assembly available for the submitted organism name.
  • The assembly matches a type at the same species, but the ANI is below the species ANI threshold.
  • The assembly matches a type at a different species, but the ANI is below the species ANI threshold.
  • The assembly and closest type do not share enough sequence to make a determination.

CONTAMINATED: Contamination in genome assemblies will be reported if the following conditions are met:

  • We have a reference covering at least 50% of the assembly
  • We have a single taxon accounting for at least 10% of the coverage and at least half of the remaining sequence.

Description of the reports

The taxonomy check will produce three reports:

ani-tax-report.txt

This file provides the results of the taxonomy check in text format. It includes

  • Submitted organism name: the organism declared by the submitter, along with NCBI taxonomy identifier, rank (ex: species), and taxonomic lineage.
  • Predicted organism name: the organism identity determined by ANI. This may be the same as the submitted organism name.
  • Submitted organism has type: possible values are Yes and No. Indicates whether there is a public genome assembly available for the type strain of the declared species.
  • Status: possible values are CONFIRMED, MISASSIGNED, INCONCLUSIVE or CONTAMINATED (see above)
  • Confidence: possible values are HIGH or LOW. Indicates the confidence level of the stated contamination. Confidence HIGH: the ANI criteria meets the expected cutoff (96% for most prokaryotic taxa). Confidence LOW: the ANI criteria does not meet the expected cutoff, but has provided the best prediction possible based on currently available data.
  • ANI statistics: A table with the following columns:
  1. ANI The average nucleotide identity (ANI) of the assembly to the assembly from type material, expressed as a percentage
  2. (Coverages) query coverage: coverage of the assembly by the assembly from type material, expressed as a percentage subject coverage: coverage of the assembly from type material by the assembly, expressed as a percentage
  3. NewSeq the count of bases in the assembly best assigned to the assembly from type material
  4. CntmSeq the portion of NewSeq allocated for purposes of evaluating contamination
  5. Flg annotations for assembly from type material; C = contaminant; E = effectively published; T = trusted species
  6. Assembly GenBank release id of the assembly from type material
  7. Organism taxonomic name of the assembly from type material
  8. (assembly_accession, assembly_name) GenBank assembly accession and assembly name of the assembly from type material

ani-tax-report.xml

The same data as ani-tax-report.txt, but in XML format.

Example reports

Example of a MISSASSIGNED report:

ANI report for assembly: <fasta_file_name>
Submitted organism: Rickettsia hoogstraalii (taxid = 467174, rank = species, lineage = Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Rickettsieae; Rickettsia; spotted fever group)
Predicted organism: Rickettsia japonica (taxid = 35790, rank = species, lineage = Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Rickettsieae; Rickettsia; spotted fever group)
Submitted organism has type: Yes
Status: MISASSIGNED
Confidence: HIGH
99.975 (99.8 99.8)  406738 assembly  Rickettsia japonica YH (GCA_000283595.1, ASM28359v1)
99.985 (99.4 99.9)  864348 assembly  Rickettsia japonica YH (GCA_000302635.2, ASM30263v2)
97.722 (96.5 97.1)  320558 assembly  Rickettsia slovaca 13-B (GCA_000237845.1, ASM23784v1)
98.893 (95.3 84.4) 6004488 assembly  Rickettsia fournieri (GCA_900243065.1, PRJEB23962)
97.100 (96.3 91.7)  834068 assembly  Rickettsia gravesii BWI-1 (GCA_000485845.1, RicGra1.0)
97.246 (95.9 83.0) 1655938 assembly  Rickettsia raoultii (GCA_000940955.1, ASM94095v1)
97.114 (95.8 96.9) 3973378 assembly  Rickettsia rickettsii (GCA_001951015.1, ASM195101v1)
97.115 (95.8 96.9) 3973358 assembly  Rickettsia rickettsii (GCA_001950995.1, ASM195099v1)
97.115 (95.8 96.9) 1526588 assembly  Rickettsia rickettsii str. Iowa (GCA_000017445.3, ASM1744v3)
94.100 (84.7 74.8) 1199088 assembly  Rickettsia tamurae (GCA_000751075.1, Rickettsia tamurae AT-1)
94.312 (79.2 75.1) 1720158 assembly  Rickettsia monacensis (GCA_000499665.2, RMONA_1)
94.484 (76.4 59.1) 1086398 assembly  Rickettsia buchneri (GCA_000696365.1, REISMNv1)
99.121 (99.0 99.4)  296048 assembly  Rickettsia heilongjiangensis 054 (GCA_000221205.1, ASM22120v1)
97.446 (96.3 97.4)  380228 assembly  Rickettsia honei RB (GCA_000263055.1, Rho1.0)
[...]
96.865 (94.5 92.9)  407678 assembly  Rickettsia rhipicephali str. 3-7-female6-CWPP (GCA_000284075.1, ASM28407v1)
93.031 (83.9 72.5) 1485538 assembly  Rickettsia hoogstraalii (GCA_000825685.1, Rickettsia hoogstraalii Croatica)
[...]

In the above example, the organism was declared by the submitter to be Rickettsia hoogstraalii. The predicted organism is found to be Rickettsia japonica with high confidence, based on an ANI of 99.975% over 99.8% of the input sequence.

Example of a CONTAMINATED report:

ANI report for assembly: <fasta_file_name>
Submitted organism: Staphylococcus aureus (taxid = 1280, rank = species, lineage = Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus)
Predicted organism: Staphylococcus aureus (taxid = 1280, rank = species, lineage = Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae; Staphylococcus)
Submitted organism has type: Yes
Status: CONTAMINATED
Confidence: HIGH
99.045 (54.5 80.3) 4972758 assembly  Ochrobactrum quorumnocens (GCA_002278035.1, ASM227803v1)
99.450 (31.5 94.1) 11348628 assembly  Staphylococcus aureus (GCA_006364675.1, ASM636467v1)
99.450 (31.5 94.3) 10960368 assembly  Staphylococcus aureus subsp. aureus (GCA_006094915.1, ASM609491v1)
99.450 (31.5 94.3) 1806888 assembly  Staphylococcus aureus subsp. aureus DSM 20231 (GCA_001027105.1, ASM102710v1)
99.441 (31.5 94.2) 8986608 assembly  Staphylococcus aureus (GCA_900706775.1, 27323_B01)
99.450 (31.5 95.1) 2490008 assembly  Staphylococcus aureus subsp. aureus DSM 20231 (GCA_000330825.2, SASA1.0)
99.465 (31.4 96.1) 2855328 assembly  Staphylococcus aureus subsp. aureus NBRC 100910 (GCA_001544175.1, ASM154417v1)
97.857 (28.8 93.1) 5947508 assembly  Staphylococcus aureus subsp. anaerobius (GCA_002902425.1, ASM290242v1)
88.596 (38.7 58.4) 6727398 assembly  Ochrobactrum pituitosum (GCA_003049685.2, ASM304968v2)
[...]

In the above example, the organism was declared by the submitter to be Staphylococcus aureus. The predicted organism was in agreement, but there was contamination from Ochrobactrum quorumnocens, which has a 99.045 identity over 54.5% of the sequence, representing 80.3% of the contaminating organism's genome.