Define contributions (#23)

* validate taxonomy script * unit testing for taxonomy * unit testing for taxonomy * moved XXXXXX entries to a todo file * validating names.dmp and added new entries to make taxonomy more complete * Contributing.md doc * link to contributing.md * more description under contributions Co-authored-by: Lee Katz - Aspen <[email protected]>
lskatz · Jul 30, 2021 · 703eaa8 · 703eaa8
1 parent 9917015
commit 703eaa8
Show file tree

Hide file tree

Showing 9 changed files with 203 additions and 33 deletions.
diff --git a/.github/workflows/validateTaxonomy.yml b/.github/workflows/validateTaxonomy.yml
@@ -0,0 +1,29 @@
+on: [push]
+name: Validate taxonomy
+
+jobs:
+  build:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: ['ubuntu-18.04' ]
+        perl: [ '5.32' ]
+    name: Perl ${{ matrix.perl }} on ${{ matrix.os }}
+    steps:
+      - name: Set up perl
+        uses: shogo82148/actions-setup-perl@v1
+        with:
+          perl-version: ${{ matrix.perl }}
+          multi-thread: "true"
+      - name: checkout my repo
+        uses: actions/checkout@v2
+        with:
+          path: Kalamari
+
+      - name: validate taxonomy
+        run:  perl Kalamari/bin/validateTaxonomy.pl Kalamari/src/taxonomy
+      - name: matching taxids
+        run:  |
+          echo "Making sure that all taxids in chromosomes.tsv and plasmids.tsv are present in nodes.tsv and names.tsv"
+          tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@node=`cat Kalamari/src/taxonomy/nodes.dmp`; for $n(@node){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
+          tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@name=`cat Kalamari/src/taxonomy/names.dmp`; for $n(@name){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,42 @@
+# Contributing
+
+There are many ways to contribute to this project and so here are a couple of ways to contribute.
+Contributions will almost always result in a pull request.
+Contributions must pass the automated testing.
+
+## Add a taxon
+
+To add a taxon, add it to src/nodes.dmp and src/names.dmp.
+If it is present in the NCBI taxonomy, please use that identifier.
+Please adhere to the [NCBI taxonomy format specification](https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt).
+For names.dmp, the scientific name field is required.
+
+Step 2 for adding a taxon is also adding representative chromosome(s).
+See the section below for details.
+You cannot add a taxon to this project without a representative chromosome.
+
+## Add a chromosome
+
+Add an entry to either src/chromosomes.tsv or src/plasmids.tsv.
+The format is four columns, separated by tab:
+
+* scientific name or similar
+* NCBI nucleotide accession
+* taxonomy ID
+* parent taxonomy ID
+
+The taxonomy IDs in each line must be represented in names.dmp and nodes.dmp in the folder src/taxonomy.
+
+New nucleotide entries must be
+
+* Trusted - subject matter experts must agree that this is a representative genome for the taxon
+* Completed - no gaps
+* Nonredundant - for the most part, most taxa are not represented by multiple assemblies
+
+Note: some species such as _Vibrio cholerae_ have multiple chromosomes.
+These can be denoted with multiple lines, one per nucleotide accession.
+
+## Other contributions
+
+Please make a new issues ticket on GitHub and describe the potential contribution.
+
diff --git a/README.md b/README.md
@@ -6,18 +6,7 @@ A database of completed assemblies for metagenomics-related tasks
 ## Synopsis
 
 Kalamari is a database of completed and public assemblies, backed by trusted institutions.
-Completed assemblies means that you do not have to worry about the database itself being contaminated with "rogue" contigs.
-Additionally, most assemblies were obtained by subject matter experts (SMEs) at
-Centers for Disease Control and Prevention (CDC).
-Those not from CDC come from other trusted institutions or projects such as
-FDA-ARGOS.
-Most genomes are from species that are either studied or are common contaminants
-in the Enteric Diseases Laboratory Branch (EDLB) at CDC.
-
-Kalamari also comes with a custom taxonomy database such as defining
-_Shigella_ as a subspecies of _Escherichia coli_
-or defining the four lineages of _Listeria monocytogenes_.
-These changes have been backed by trusted SMEs in EDLB.
+These assemblies can be further used in formatted databases such as Kraken or Blast.
 
 ## Download instructions
 
@@ -40,6 +29,26 @@ where `VER` is the version of Kalamari.
 
 [How to format and query databases](docs/DATABASES.md)
 
+## Further description
+
+Kalamari is a database of completed and public assemblies, backed by trusted institutions.
+Completed assemblies means that you do not have to worry about the database itself being contaminated with "rogue" contigs.
+Additionally, most assemblies were obtained by subject matter experts (SMEs) at
+Centers for Disease Control and Prevention (CDC).
+Those not from CDC come from other trusted institutions or projects such as
+FDA-ARGOS.
+Most genomes are from species that are either studied or are common contaminants
+in the Enteric Diseases Laboratory Branch (EDLB) at CDC.
+
+Kalamari also comes with a custom taxonomy database such as defining
+_Shigella_ as a subspecies of _Escherichia coli_
+or defining the four lineages of _Listeria monocytogenes_.
+These changes have been backed by trusted SMEs in EDLB.
+
+## Contributing
+
+Please see [CONTRIBUTING.md](CONTRIBUTING.md)
+
 ## Citation
 
 Please refer to the ASM 2018 poster under docs
diff --git a/bin/validateTaxonomy.pl b/bin/validateTaxonomy.pl
@@ -0,0 +1,81 @@
+#!/usr/bin/env perl
+use strict;
+use warnings;
+use Getopt::Long qw/GetOptions/;
+use File::Basename qw/basename/;
+use File::Path qw/make_path/;
+use Data::Dumper qw/Dumper/;
+
+local $0 = basename $0;
+sub logmsg{ print STDERR "$0: @_\n";}
+
+exit main();
+
+sub main{
+  my $settings={};
+  GetOptions($settings,qw(help)) or die $!;
+  die usage() if($$settings{help} || !@ARGV);
+
+  for my $taxdir (@ARGV){
+    my $is_valid = validateTaxonomy($taxdir, $settings);
+    logmsg "Valid: $taxdir";
+  }
+
+  return 0;
+
+}
+
+# Return 1 if the taxonomy is good and 0 if not
+sub validateTaxonomy{
+  my($dir, $settings) = @_;
+
+  my $names = readDmp("$dir/names.dmp", $settings);
+  my $nodes = readDmp("$dir/nodes.dmp", $settings);
+
+  # See if every element in nodes has a parent
+  while(my($taxid, $taxinfo) = each(%$nodes)){
+    my $parent = $$taxinfo[0];
+
+    # Die with a useful message if the parent node is not present
+    # and if the parent node is not 1 or 0
+    if(! $$nodes{$parent} && $parent > 1){
+      logmsg "ERROR: could not find node $parent which is the parent of $taxid";
+      return 0;
+    }
+
+    # Find matching entries in names.dmp
+    if($taxid > 1 && !$$names{$taxid}){
+      logmsg "ERROR: could not find an entry in names.dmp for $taxid";
+      return 0;
+    }
+    if($parent > 1 && !$$names{$parent} ){
+      logmsg "ERROR: could not find an entry in names.dmp for $parent";
+      return 0;
+    }
+  }
+
+  return 1;
+}
+
+sub readDmp{
+  my($dmp, $settings) = @_;
+  my %dmp;
+  open(my $fh, $dmp) or die "ERROR: could not read $dmp: $!";
+  while(<$fh>){
+    chomp;
+    my @F = split /\t\|\t/;
+    $F[-1] =~s/\t\|$//; # remove trailing chars for last field
+    my $taxid = shift(@F);
+    $dmp{$taxid} = \@F;
+  }
+
+  return \%dmp;
+}
+
+sub usage{
+  print "Validate a folder of taxonomy containing nodes.dmp and names.dmp
+  Usage: $0 taxonomy/ [taxonomy2...]
+  ";
+  exit 0;
+}
+
diff --git a/src/chromosomes-todo.tsv b/src/chromosomes-todo.tsv
@@ -0,0 +1,19 @@
+Arcobacter butzleri	XXXXXX	28197	28196
+Arcobacter cloacae	XXXXXX	1054034	28196
+Arcobacter cryaerophilus	XXXXXX	28198	28196
+Arcobacter nitrofigilis	XXXXXX	28199	28196
+Arcobacter venerupis	XXXXXX	1054033	28196
+Campylobacter canadensis	XXXXXX	449520	194
+Campylobacter corcagiensis	XXXXXX	1448857	194
+Campylobacter curvus	XXXXXX	200	194
+Campylobacter iguaniorum	XXXXXX	1244531	194
+Campylobacter jejuni doylei	XXXXXX	32021	197
+Campylobacter mucosalis	XXXXXX	202	194
+Campylobacter rectus	XXXXXX	203	194
+Campylobacter showae	XXXXXX	204	194
+Campylobacter upsaliensis	XXXXXX	28080	194
+Helicobacter bilis	XXXXXX	37372	209
+Helicobacter cinaedi	XXXXXX	213	209
+Helicobacter pullorum	XXXXXX	35818	209
+Helicobacter winghamensis	XXXXXX	157268	209
+Helicobacter valdiviensis	XXXXXX	1458358	209
diff --git a/src/chromosomes.tsv b/src/chromosomes.tsv
@@ -12,20 +12,15 @@ Amycolatopsis mediterranei	NC_014318	33910	1813
 Aquifex aeolicus	NC_000918	63363	2713
 Architeuthis dux	NC_011581	256136	34555
 Arcobacter bivalviorum	CP031217	663364	2321115
-Arcobacter butzleri	XXXXXX	28197	28196
 Arcobacter cibarius	CP043857	255507	28196
-Arcobacter cloacae	XXXXXX	1054034	28196
-Arcobacter cryaerophilus	XXXXXX	28198	28196
 Arcobacter ellisii	CP032097	913109	28196
 Arcobacter halophilus	CP031218	197482	28196
 Arcobacter molluscorum	CP032098	1032072	28196
 Arcobacter mytili	CP031219	603050	28196
-Arcobacter nitrofigilis	XXXXXX	28199	28196
 Arcobacter skirrowii	CP032099	28200	28196
 Arcobacter suis	CP032100	1278212	28196
 Arcobacter thereius	CP035926	544718	28196
 Arcobacter trophiarum	CP031367	708186	28196
-Arcobacter venerupis	XXXXXX	1054033	28196
 Arripis trutta	AP006810	270544	163128
 Atlantibacter hermannii	CP042941	565	1903434
 Bacillus cereus	NC_016771	1396	86661
@@ -47,30 +42,21 @@ Buchnera aphidicola	NC_002528	9	32199
 Burkholderia pseudomallei	NC_006350	28450	111527
 Burkholderia pseudomallei	NC_006351	28450	111527
 Campylobacter avium	CP022347	522484	522485
-Campylobacter canadensis	XXXXXX	449520	194
 Campylobacter coli	CP028187	195	194
 Campylobacter concisus	CP012541	199	194
-Campylobacter corcagiensis	XXXXXX	1448857	194
 Campylobacter cuniculorum	CP020867	374106	194
-Campylobacter curvus	XXXXXX	200	194
 Campylobacter fetus	CP006833	196	194
 Campylobacter gracilis	CP012196	824	194
 Campylobacter helveticus	NZ_CP020478	28898	194
 Campylobacter hyointestinalis hyointestinalis	CP015575	91352	198
 Campylobacter hyointestinalis lawsonii	CP015575	91353	198
-Campylobacter iguaniorum	XXXXXX	1244531	194
 Campylobacter insulaenigrae	CP007770	260714	194
-Campylobacter jejuni doylei	XXXXXX	32021	197
 Campylobacter jejuni jejuni	NC_002163	32022	197
 Campylobacter lanienae	CP015578	75658	194
 Campylobacter lari	CP000932	201	194
-Campylobacter mucosalis	XXXXXX	202	194
 Campylobacter peloridis	CP007766	488546	194
-Campylobacter rectus	XXXXXX	203	194
-Campylobacter showae	XXXXXX	204	194
 Campylobacter sputorum	CP019683	202	194
 Campylobacter subantarcticus	CP007772	497724	194
-Campylobacter upsaliensis	XXXXXX	28080	194
 Campylobacter ureolyticus	CP012195	827	194
 Campylobacter volucris	CP007774	1031542	194
 Candidatus Desulforudis audaxviator	CP000860	471827	471826
@@ -149,12 +135,7 @@ Haemophilus influenzae	NC_000907	727	724
 Haemophilus somnus	CP000947	228400	731
 Halobacterium salinarum	AM774415	478009	2242
 Helianthus annuus	MG770607	4232	4231
-Helicobacter bilis	XXXXXX	37372	209
-Helicobacter cinaedi	XXXXXX	213	209
 Helicobacter pylori	NC_000915	210	209
-Helicobacter pullorum	XXXXXX	35818	209
-Helicobacter winghamensis	XXXXXX	157268	209
-Helicobacter valdiviensis	XXXXXX	1458358	209
 Heliobacterium modesticaldum	CP000930	35701	2697
 Homo sapiens	NC_012920	9606	9605
 Ketogulonicigenium vulgare	NC_017384	92945	92944

diff --git a/src/plasmids.tsv b/src/plasmids.tsv
@@ -34,7 +34,7 @@ Staphylococcus aureus	NC_002096	1280	1279
 Staphylococcus aureus	NC_002129	1280	1279
 Lysinibacillus sphaericus	AY325804	1421	400634
 Gluconacetobacter diazotrophicus	AM889286	33996	89583
-Frankia symbiont of Datisca glomerata	CP002802	656024	1854
+Frankia symbiont of Datisca glomerata	CP002802	2716812	1854
 Enterobacteriaceae	CP011981	543	1854
 Enterobacteriaceae	CP023916	543	1854
 Enterobacteriaceae	HG969999	543	1854
@@ -4904,7 +4904,7 @@ Croceicoccus marinus	CP019604	450378	1295327
 Paenibacillus larvae	CP019658	1464	44249
 Paenibacillus larvae	CP019657	1464	44249
 Paenibacillus larvae	CP019653	1464	44249
-Frankia symbiont of Datisca glomerata	CP002803	656024	44249
+Frankia symbiont of Datisca glomerata	CP002803	2716812	1854
 Enterobacter cloacae	KF998104	550	44249
 Edwardsiella ictaluri	KC249996	67780	44249
 Enterobacter cloacae	MF370188	550	44249

diff --git a/src/taxonomy/names.dmp b/src/taxonomy/names.dmp
@@ -22685,3 +22685,10 @@
 9000017	|	Salmonella enterica subsp. X	|		|	scientific name	|
 2607663	|	Yersinia canariae	|		|	scientific name	|
 1604335	|	Yersinia rochesterensis	|		|	scientific name	|
+2716812	|	Ca. Frankia datiscae	|		|	equivalent name	|
+2716812	|	"Candidatus Frankia datiscae" Persson et al. 2011	|		|	authority	|
+2716812	|	Candidatus Frankia datiscae	|		|	scientific name	|
+2716812	|	Frankia datiscae	|		|	equivalent name	|
+2716812	|	Frankia symbiont of Datisca glomerata	|		|	includes	|
+91352	|	Campylobacter hyointestinalis subsp. hyointestinalis Gebhart et al. 1985	|		|	authority	|
+91352	|	Campylobacter hyointestinalis subsp. hyointestinalis	|		|	scientific name	|
diff --git a/src/taxonomy/nodes.dmp b/src/taxonomy/nodes.dmp
@@ -3507,3 +3507,5 @@
 9000017	|	28901	|	no rank	|		|	0	|	1	|	11	|	1	|	0	|	1	|	1	|	0	|		|
 2607663	|	629	|	species	|		|	0	|	1	|	11	|	1	|	0	|	1	|	1	|	0	|		|
 1604335	|	629	|	species	|		|	0	|	1	|	11	|	1	|	0	|	1	|	1	|	0	|		|
+91352	|	198	|	subspecies	|	CH	|	0	|	1	|	11	|	1	|	0	|	1	|	1	|	0	|		|
+2716812	|	1854	|	species	|	CF	|	0	|	1	|	11	|	1	|	0	|	1	|	0	|	0	|		|