From 703eaa8aea16dafff1e33ea8391301633e6ca7e8 Mon Sep 17 00:00:00 2001 From: Lee Katz Date: Fri, 30 Jul 2021 14:44:45 -0400 Subject: [PATCH] Define contributions (#23) * validate taxonomy script * unit testing for taxonomy * unit testing for taxonomy * moved XXXXXX entries to a todo file * validating names.dmp and added new entries to make taxonomy more complete * Contributing.md doc * link to contributing.md * more description under contributions Co-authored-by: Lee Katz - Aspen --- .github/workflows/validateTaxonomy.yml | 29 +++++++++ CONTRIBUTING.md | 42 +++++++++++++ README.md | 33 +++++++---- bin/validateTaxonomy.pl | 81 ++++++++++++++++++++++++++ src/chromosomes-todo.tsv | 19 ++++++ src/chromosomes.tsv | 19 ------ src/plasmids.tsv | 4 +- src/taxonomy/names.dmp | 7 +++ src/taxonomy/nodes.dmp | 2 + 9 files changed, 203 insertions(+), 33 deletions(-) create mode 100644 .github/workflows/validateTaxonomy.yml create mode 100644 CONTRIBUTING.md create mode 100644 bin/validateTaxonomy.pl create mode 100644 src/chromosomes-todo.tsv diff --git a/.github/workflows/validateTaxonomy.yml b/.github/workflows/validateTaxonomy.yml new file mode 100644 index 0000000..febf7d9 --- /dev/null +++ b/.github/workflows/validateTaxonomy.yml @@ -0,0 +1,29 @@ +on: [push] +name: Validate taxonomy + +jobs: + build: + runs-on: ${{ matrix.os }} + strategy: + matrix: + os: ['ubuntu-18.04' ] + perl: [ '5.32' ] + name: Perl ${{ matrix.perl }} on ${{ matrix.os }} + steps: + - name: Set up perl + uses: shogo82148/actions-setup-perl@v1 + with: + perl-version: ${{ matrix.perl }} + multi-thread: "true" + - name: checkout my repo + uses: actions/checkout@v2 + with: + path: Kalamari + + - name: validate taxonomy + run: perl Kalamari/bin/validateTaxonomy.pl Kalamari/src/taxonomy + - name: matching taxids + run: | + echo "Making sure that all taxids in chromosomes.tsv and plasmids.tsv are present in nodes.tsv and names.tsv" + tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@node=`cat Kalamari/src/taxonomy/nodes.dmp`; for $n(@node){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }' + tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@name=`cat Kalamari/src/taxonomy/names.dmp`; for $n(@name){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }' diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..4ec728d --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,42 @@ +# Contributing + +There are many ways to contribute to this project and so here are a couple of ways to contribute. +Contributions will almost always result in a pull request. +Contributions must pass the automated testing. + +## Add a taxon + +To add a taxon, add it to src/nodes.dmp and src/names.dmp. +If it is present in the NCBI taxonomy, please use that identifier. +Please adhere to the [NCBI taxonomy format specification](https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt). +For names.dmp, the scientific name field is required. + +Step 2 for adding a taxon is also adding representative chromosome(s). +See the section below for details. +You cannot add a taxon to this project without a representative chromosome. + +## Add a chromosome + +Add an entry to either src/chromosomes.tsv or src/plasmids.tsv. +The format is four columns, separated by tab: + +* scientific name or similar +* NCBI nucleotide accession +* taxonomy ID +* parent taxonomy ID + +The taxonomy IDs in each line must be represented in names.dmp and nodes.dmp in the folder src/taxonomy. + +New nucleotide entries must be + +* Trusted - subject matter experts must agree that this is a representative genome for the taxon +* Completed - no gaps +* Nonredundant - for the most part, most taxa are not represented by multiple assemblies + +Note: some species such as _Vibrio cholerae_ have multiple chromosomes. +These can be denoted with multiple lines, one per nucleotide accession. + +## Other contributions + +Please make a new issues ticket on GitHub and describe the potential contribution. + diff --git a/README.md b/README.md index 65aa141..e1c9fd9 100644 --- a/README.md +++ b/README.md @@ -6,18 +6,7 @@ A database of completed assemblies for metagenomics-related tasks ## Synopsis Kalamari is a database of completed and public assemblies, backed by trusted institutions. -Completed assemblies means that you do not have to worry about the database itself being contaminated with "rogue" contigs. -Additionally, most assemblies were obtained by subject matter experts (SMEs) at -Centers for Disease Control and Prevention (CDC). -Those not from CDC come from other trusted institutions or projects such as -FDA-ARGOS. -Most genomes are from species that are either studied or are common contaminants -in the Enteric Diseases Laboratory Branch (EDLB) at CDC. - -Kalamari also comes with a custom taxonomy database such as defining -_Shigella_ as a subspecies of _Escherichia coli_ -or defining the four lineages of _Listeria monocytogenes_. -These changes have been backed by trusted SMEs in EDLB. +These assemblies can be further used in formatted databases such as Kraken or Blast. ## Download instructions @@ -40,6 +29,26 @@ where `VER` is the version of Kalamari. [How to format and query databases](docs/DATABASES.md) +## Further description + +Kalamari is a database of completed and public assemblies, backed by trusted institutions. +Completed assemblies means that you do not have to worry about the database itself being contaminated with "rogue" contigs. +Additionally, most assemblies were obtained by subject matter experts (SMEs) at +Centers for Disease Control and Prevention (CDC). +Those not from CDC come from other trusted institutions or projects such as +FDA-ARGOS. +Most genomes are from species that are either studied or are common contaminants +in the Enteric Diseases Laboratory Branch (EDLB) at CDC. + +Kalamari also comes with a custom taxonomy database such as defining +_Shigella_ as a subspecies of _Escherichia coli_ +or defining the four lineages of _Listeria monocytogenes_. +These changes have been backed by trusted SMEs in EDLB. + +## Contributing + +Please see [CONTRIBUTING.md](CONTRIBUTING.md) + ## Citation Please refer to the ASM 2018 poster under docs diff --git a/bin/validateTaxonomy.pl b/bin/validateTaxonomy.pl new file mode 100644 index 0000000..b1ec1d3 --- /dev/null +++ b/bin/validateTaxonomy.pl @@ -0,0 +1,81 @@ +#!/usr/bin/env perl +use strict; +use warnings; +use Getopt::Long qw/GetOptions/; +use File::Basename qw/basename/; +use File::Path qw/make_path/; +use Data::Dumper qw/Dumper/; + +local $0 = basename $0; +sub logmsg{ print STDERR "$0: @_\n";} + +exit main(); + +sub main{ + my $settings={}; + GetOptions($settings,qw(help)) or die $!; + die usage() if($$settings{help} || !@ARGV); + + for my $taxdir (@ARGV){ + my $is_valid = validateTaxonomy($taxdir, $settings); + logmsg "Valid: $taxdir"; + } + + return 0; + +} + +# Return 1 if the taxonomy is good and 0 if not +sub validateTaxonomy{ + my($dir, $settings) = @_; + + my $names = readDmp("$dir/names.dmp", $settings); + my $nodes = readDmp("$dir/nodes.dmp", $settings); + + # See if every element in nodes has a parent + while(my($taxid, $taxinfo) = each(%$nodes)){ + my $parent = $$taxinfo[0]; + + # Die with a useful message if the parent node is not present + # and if the parent node is not 1 or 0 + if(! $$nodes{$parent} && $parent > 1){ + logmsg "ERROR: could not find node $parent which is the parent of $taxid"; + return 0; + } + + # Find matching entries in names.dmp + if($taxid > 1 && !$$names{$taxid}){ + logmsg "ERROR: could not find an entry in names.dmp for $taxid"; + return 0; + } + if($parent > 1 && !$$names{$parent} ){ + logmsg "ERROR: could not find an entry in names.dmp for $parent"; + return 0; + } + } + + return 1; +} + +sub readDmp{ + my($dmp, $settings) = @_; + my %dmp; + open(my $fh, $dmp) or die "ERROR: could not read $dmp: $!"; + while(<$fh>){ + chomp; + my @F = split /\t\|\t/; + $F[-1] =~s/\t\|$//; # remove trailing chars for last field + my $taxid = shift(@F); + $dmp{$taxid} = \@F; + } + + return \%dmp; +} + +sub usage{ + print "Validate a folder of taxonomy containing nodes.dmp and names.dmp + Usage: $0 taxonomy/ [taxonomy2...] + "; + exit 0; +} + diff --git a/src/chromosomes-todo.tsv b/src/chromosomes-todo.tsv new file mode 100644 index 0000000..55237f0 --- /dev/null +++ b/src/chromosomes-todo.tsv @@ -0,0 +1,19 @@ +Arcobacter butzleri XXXXXX 28197 28196 +Arcobacter cloacae XXXXXX 1054034 28196 +Arcobacter cryaerophilus XXXXXX 28198 28196 +Arcobacter nitrofigilis XXXXXX 28199 28196 +Arcobacter venerupis XXXXXX 1054033 28196 +Campylobacter canadensis XXXXXX 449520 194 +Campylobacter corcagiensis XXXXXX 1448857 194 +Campylobacter curvus XXXXXX 200 194 +Campylobacter iguaniorum XXXXXX 1244531 194 +Campylobacter jejuni doylei XXXXXX 32021 197 +Campylobacter mucosalis XXXXXX 202 194 +Campylobacter rectus XXXXXX 203 194 +Campylobacter showae XXXXXX 204 194 +Campylobacter upsaliensis XXXXXX 28080 194 +Helicobacter bilis XXXXXX 37372 209 +Helicobacter cinaedi XXXXXX 213 209 +Helicobacter pullorum XXXXXX 35818 209 +Helicobacter winghamensis XXXXXX 157268 209 +Helicobacter valdiviensis XXXXXX 1458358 209 diff --git a/src/chromosomes.tsv b/src/chromosomes.tsv index 5472edb..dec1988 100644 --- a/src/chromosomes.tsv +++ b/src/chromosomes.tsv @@ -12,20 +12,15 @@ Amycolatopsis mediterranei NC_014318 33910 1813 Aquifex aeolicus NC_000918 63363 2713 Architeuthis dux NC_011581 256136 34555 Arcobacter bivalviorum CP031217 663364 2321115 -Arcobacter butzleri XXXXXX 28197 28196 Arcobacter cibarius CP043857 255507 28196 -Arcobacter cloacae XXXXXX 1054034 28196 -Arcobacter cryaerophilus XXXXXX 28198 28196 Arcobacter ellisii CP032097 913109 28196 Arcobacter halophilus CP031218 197482 28196 Arcobacter molluscorum CP032098 1032072 28196 Arcobacter mytili CP031219 603050 28196 -Arcobacter nitrofigilis XXXXXX 28199 28196 Arcobacter skirrowii CP032099 28200 28196 Arcobacter suis CP032100 1278212 28196 Arcobacter thereius CP035926 544718 28196 Arcobacter trophiarum CP031367 708186 28196 -Arcobacter venerupis XXXXXX 1054033 28196 Arripis trutta AP006810 270544 163128 Atlantibacter hermannii CP042941 565 1903434 Bacillus cereus NC_016771 1396 86661 @@ -47,30 +42,21 @@ Buchnera aphidicola NC_002528 9 32199 Burkholderia pseudomallei NC_006350 28450 111527 Burkholderia pseudomallei NC_006351 28450 111527 Campylobacter avium CP022347 522484 522485 -Campylobacter canadensis XXXXXX 449520 194 Campylobacter coli CP028187 195 194 Campylobacter concisus CP012541 199 194 -Campylobacter corcagiensis XXXXXX 1448857 194 Campylobacter cuniculorum CP020867 374106 194 -Campylobacter curvus XXXXXX 200 194 Campylobacter fetus CP006833 196 194 Campylobacter gracilis CP012196 824 194 Campylobacter helveticus NZ_CP020478 28898 194 Campylobacter hyointestinalis hyointestinalis CP015575 91352 198 Campylobacter hyointestinalis lawsonii CP015575 91353 198 -Campylobacter iguaniorum XXXXXX 1244531 194 Campylobacter insulaenigrae CP007770 260714 194 -Campylobacter jejuni doylei XXXXXX 32021 197 Campylobacter jejuni jejuni NC_002163 32022 197 Campylobacter lanienae CP015578 75658 194 Campylobacter lari CP000932 201 194 -Campylobacter mucosalis XXXXXX 202 194 Campylobacter peloridis CP007766 488546 194 -Campylobacter rectus XXXXXX 203 194 -Campylobacter showae XXXXXX 204 194 Campylobacter sputorum CP019683 202 194 Campylobacter subantarcticus CP007772 497724 194 -Campylobacter upsaliensis XXXXXX 28080 194 Campylobacter ureolyticus CP012195 827 194 Campylobacter volucris CP007774 1031542 194 Candidatus Desulforudis audaxviator CP000860 471827 471826 @@ -149,12 +135,7 @@ Haemophilus influenzae NC_000907 727 724 Haemophilus somnus CP000947 228400 731 Halobacterium salinarum AM774415 478009 2242 Helianthus annuus MG770607 4232 4231 -Helicobacter bilis XXXXXX 37372 209 -Helicobacter cinaedi XXXXXX 213 209 Helicobacter pylori NC_000915 210 209 -Helicobacter pullorum XXXXXX 35818 209 -Helicobacter winghamensis XXXXXX 157268 209 -Helicobacter valdiviensis XXXXXX 1458358 209 Heliobacterium modesticaldum CP000930 35701 2697 Homo sapiens NC_012920 9606 9605 Ketogulonicigenium vulgare NC_017384 92945 92944 diff --git a/src/plasmids.tsv b/src/plasmids.tsv index a8a6c42..536b358 100644 --- a/src/plasmids.tsv +++ b/src/plasmids.tsv @@ -34,7 +34,7 @@ Staphylococcus aureus NC_002096 1280 1279 Staphylococcus aureus NC_002129 1280 1279 Lysinibacillus sphaericus AY325804 1421 400634 Gluconacetobacter diazotrophicus AM889286 33996 89583 -Frankia symbiont of Datisca glomerata CP002802 656024 1854 +Frankia symbiont of Datisca glomerata CP002802 2716812 1854 Enterobacteriaceae CP011981 543 1854 Enterobacteriaceae CP023916 543 1854 Enterobacteriaceae HG969999 543 1854 @@ -4904,7 +4904,7 @@ Croceicoccus marinus CP019604 450378 1295327 Paenibacillus larvae CP019658 1464 44249 Paenibacillus larvae CP019657 1464 44249 Paenibacillus larvae CP019653 1464 44249 -Frankia symbiont of Datisca glomerata CP002803 656024 44249 +Frankia symbiont of Datisca glomerata CP002803 2716812 1854 Enterobacter cloacae KF998104 550 44249 Edwardsiella ictaluri KC249996 67780 44249 Enterobacter cloacae MF370188 550 44249 diff --git a/src/taxonomy/names.dmp b/src/taxonomy/names.dmp index 0bb0507..bb2c536 100644 --- a/src/taxonomy/names.dmp +++ b/src/taxonomy/names.dmp @@ -22685,3 +22685,10 @@ 9000017 | Salmonella enterica subsp. X | | scientific name | 2607663 | Yersinia canariae | | scientific name | 1604335 | Yersinia rochesterensis | | scientific name | +2716812 | Ca. Frankia datiscae | | equivalent name | +2716812 | "Candidatus Frankia datiscae" Persson et al. 2011 | | authority | +2716812 | Candidatus Frankia datiscae | | scientific name | +2716812 | Frankia datiscae | | equivalent name | +2716812 | Frankia symbiont of Datisca glomerata | | includes | +91352 | Campylobacter hyointestinalis subsp. hyointestinalis Gebhart et al. 1985 | | authority | +91352 | Campylobacter hyointestinalis subsp. hyointestinalis | | scientific name | diff --git a/src/taxonomy/nodes.dmp b/src/taxonomy/nodes.dmp index a5a4052..90cfb55 100644 --- a/src/taxonomy/nodes.dmp +++ b/src/taxonomy/nodes.dmp @@ -3507,3 +3507,5 @@ 9000017 | 28901 | no rank | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | | 2607663 | 629 | species | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | | 1604335 | 629 | species | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | | +91352 | 198 | subspecies | CH | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | | +2716812 | 1854 | species | CF | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |