Skip to content

Commit

Permalink
Define contributions (#23)
Browse files Browse the repository at this point in the history
* validate taxonomy script

* unit testing for taxonomy

* unit testing for taxonomy

* moved XXXXXX entries to a todo file

* validating names.dmp and added new entries to make taxonomy more complete

* Contributing.md doc

* link to contributing.md

* more description under contributions

Co-authored-by: Lee Katz - Aspen <[email protected]>
  • Loading branch information
lskatz and lskatz authored Jul 30, 2021
1 parent 9917015 commit 703eaa8
Show file tree
Hide file tree
Showing 9 changed files with 203 additions and 33 deletions.
29 changes: 29 additions & 0 deletions .github/workflows/validateTaxonomy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
on: [push]
name: Validate taxonomy

jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: ['ubuntu-18.04' ]
perl: [ '5.32' ]
name: Perl ${{ matrix.perl }} on ${{ matrix.os }}
steps:
- name: Set up perl
uses: shogo82148/actions-setup-perl@v1
with:
perl-version: ${{ matrix.perl }}
multi-thread: "true"
- name: checkout my repo
uses: actions/checkout@v2
with:
path: Kalamari

- name: validate taxonomy
run: perl Kalamari/bin/validateTaxonomy.pl Kalamari/src/taxonomy
- name: matching taxids
run: |
echo "Making sure that all taxids in chromosomes.tsv and plasmids.tsv are present in nodes.tsv and names.tsv"
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@node=`cat Kalamari/src/taxonomy/nodes.dmp`; for $n(@node){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@name=`cat Kalamari/src/taxonomy/names.dmp`; for $n(@name){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
42 changes: 42 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Contributing

There are many ways to contribute to this project and so here are a couple of ways to contribute.
Contributions will almost always result in a pull request.
Contributions must pass the automated testing.

## Add a taxon

To add a taxon, add it to src/nodes.dmp and src/names.dmp.
If it is present in the NCBI taxonomy, please use that identifier.
Please adhere to the [NCBI taxonomy format specification](https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt).
For names.dmp, the scientific name field is required.

Step 2 for adding a taxon is also adding representative chromosome(s).
See the section below for details.
You cannot add a taxon to this project without a representative chromosome.

## Add a chromosome

Add an entry to either src/chromosomes.tsv or src/plasmids.tsv.
The format is four columns, separated by tab:

* scientific name or similar
* NCBI nucleotide accession
* taxonomy ID
* parent taxonomy ID

The taxonomy IDs in each line must be represented in names.dmp and nodes.dmp in the folder src/taxonomy.

New nucleotide entries must be

* Trusted - subject matter experts must agree that this is a representative genome for the taxon
* Completed - no gaps
* Nonredundant - for the most part, most taxa are not represented by multiple assemblies

Note: some species such as _Vibrio cholerae_ have multiple chromosomes.
These can be denoted with multiple lines, one per nucleotide accession.

## Other contributions

Please make a new issues ticket on GitHub and describe the potential contribution.

33 changes: 21 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,7 @@ A database of completed assemblies for metagenomics-related tasks
## Synopsis

Kalamari is a database of completed and public assemblies, backed by trusted institutions.
Completed assemblies means that you do not have to worry about the database itself being contaminated with "rogue" contigs.
Additionally, most assemblies were obtained by subject matter experts (SMEs) at
Centers for Disease Control and Prevention (CDC).
Those not from CDC come from other trusted institutions or projects such as
FDA-ARGOS.
Most genomes are from species that are either studied or are common contaminants
in the Enteric Diseases Laboratory Branch (EDLB) at CDC.

Kalamari also comes with a custom taxonomy database such as defining
_Shigella_ as a subspecies of _Escherichia coli_
or defining the four lineages of _Listeria monocytogenes_.
These changes have been backed by trusted SMEs in EDLB.
These assemblies can be further used in formatted databases such as Kraken or Blast.

## Download instructions

Expand All @@ -40,6 +29,26 @@ where `VER` is the version of Kalamari.

[How to format and query databases](docs/DATABASES.md)

## Further description

Kalamari is a database of completed and public assemblies, backed by trusted institutions.
Completed assemblies means that you do not have to worry about the database itself being contaminated with "rogue" contigs.
Additionally, most assemblies were obtained by subject matter experts (SMEs) at
Centers for Disease Control and Prevention (CDC).
Those not from CDC come from other trusted institutions or projects such as
FDA-ARGOS.
Most genomes are from species that are either studied or are common contaminants
in the Enteric Diseases Laboratory Branch (EDLB) at CDC.

Kalamari also comes with a custom taxonomy database such as defining
_Shigella_ as a subspecies of _Escherichia coli_
or defining the four lineages of _Listeria monocytogenes_.
These changes have been backed by trusted SMEs in EDLB.

## Contributing

Please see [CONTRIBUTING.md](CONTRIBUTING.md)

## Citation

Please refer to the ASM 2018 poster under docs
81 changes: 81 additions & 0 deletions bin/validateTaxonomy.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long qw/GetOptions/;
use File::Basename qw/basename/;
use File::Path qw/make_path/;
use Data::Dumper qw/Dumper/;

local $0 = basename $0;
sub logmsg{ print STDERR "$0: @_\n";}

exit main();

sub main{
my $settings={};
GetOptions($settings,qw(help)) or die $!;
die usage() if($$settings{help} || !@ARGV);

for my $taxdir (@ARGV){
my $is_valid = validateTaxonomy($taxdir, $settings);
logmsg "Valid: $taxdir";
}

return 0;

}

# Return 1 if the taxonomy is good and 0 if not
sub validateTaxonomy{
my($dir, $settings) = @_;

my $names = readDmp("$dir/names.dmp", $settings);
my $nodes = readDmp("$dir/nodes.dmp", $settings);

# See if every element in nodes has a parent
while(my($taxid, $taxinfo) = each(%$nodes)){
my $parent = $$taxinfo[0];

# Die with a useful message if the parent node is not present
# and if the parent node is not 1 or 0
if(! $$nodes{$parent} && $parent > 1){
logmsg "ERROR: could not find node $parent which is the parent of $taxid";
return 0;
}

# Find matching entries in names.dmp
if($taxid > 1 && !$$names{$taxid}){
logmsg "ERROR: could not find an entry in names.dmp for $taxid";
return 0;
}
if($parent > 1 && !$$names{$parent} ){
logmsg "ERROR: could not find an entry in names.dmp for $parent";
return 0;
}
}

return 1;
}

sub readDmp{
my($dmp, $settings) = @_;
my %dmp;
open(my $fh, $dmp) or die "ERROR: could not read $dmp: $!";
while(<$fh>){
chomp;
my @F = split /\t\|\t/;
$F[-1] =~s/\t\|$//; # remove trailing chars for last field
my $taxid = shift(@F);
$dmp{$taxid} = \@F;
}

return \%dmp;
}

sub usage{
print "Validate a folder of taxonomy containing nodes.dmp and names.dmp
Usage: $0 taxonomy/ [taxonomy2...]
";
exit 0;
}

19 changes: 19 additions & 0 deletions src/chromosomes-todo.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Arcobacter butzleri XXXXXX 28197 28196
Arcobacter cloacae XXXXXX 1054034 28196
Arcobacter cryaerophilus XXXXXX 28198 28196
Arcobacter nitrofigilis XXXXXX 28199 28196
Arcobacter venerupis XXXXXX 1054033 28196
Campylobacter canadensis XXXXXX 449520 194
Campylobacter corcagiensis XXXXXX 1448857 194
Campylobacter curvus XXXXXX 200 194
Campylobacter iguaniorum XXXXXX 1244531 194
Campylobacter jejuni doylei XXXXXX 32021 197
Campylobacter mucosalis XXXXXX 202 194
Campylobacter rectus XXXXXX 203 194
Campylobacter showae XXXXXX 204 194
Campylobacter upsaliensis XXXXXX 28080 194
Helicobacter bilis XXXXXX 37372 209
Helicobacter cinaedi XXXXXX 213 209
Helicobacter pullorum XXXXXX 35818 209
Helicobacter winghamensis XXXXXX 157268 209
Helicobacter valdiviensis XXXXXX 1458358 209
19 changes: 0 additions & 19 deletions src/chromosomes.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,15 @@ Amycolatopsis mediterranei NC_014318 33910 1813
Aquifex aeolicus NC_000918 63363 2713
Architeuthis dux NC_011581 256136 34555
Arcobacter bivalviorum CP031217 663364 2321115
Arcobacter butzleri XXXXXX 28197 28196
Arcobacter cibarius CP043857 255507 28196
Arcobacter cloacae XXXXXX 1054034 28196
Arcobacter cryaerophilus XXXXXX 28198 28196
Arcobacter ellisii CP032097 913109 28196
Arcobacter halophilus CP031218 197482 28196
Arcobacter molluscorum CP032098 1032072 28196
Arcobacter mytili CP031219 603050 28196
Arcobacter nitrofigilis XXXXXX 28199 28196
Arcobacter skirrowii CP032099 28200 28196
Arcobacter suis CP032100 1278212 28196
Arcobacter thereius CP035926 544718 28196
Arcobacter trophiarum CP031367 708186 28196
Arcobacter venerupis XXXXXX 1054033 28196
Arripis trutta AP006810 270544 163128
Atlantibacter hermannii CP042941 565 1903434
Bacillus cereus NC_016771 1396 86661
Expand All @@ -47,30 +42,21 @@ Buchnera aphidicola NC_002528 9 32199
Burkholderia pseudomallei NC_006350 28450 111527
Burkholderia pseudomallei NC_006351 28450 111527
Campylobacter avium CP022347 522484 522485
Campylobacter canadensis XXXXXX 449520 194
Campylobacter coli CP028187 195 194
Campylobacter concisus CP012541 199 194
Campylobacter corcagiensis XXXXXX 1448857 194
Campylobacter cuniculorum CP020867 374106 194
Campylobacter curvus XXXXXX 200 194
Campylobacter fetus CP006833 196 194
Campylobacter gracilis CP012196 824 194
Campylobacter helveticus NZ_CP020478 28898 194
Campylobacter hyointestinalis hyointestinalis CP015575 91352 198
Campylobacter hyointestinalis lawsonii CP015575 91353 198
Campylobacter iguaniorum XXXXXX 1244531 194
Campylobacter insulaenigrae CP007770 260714 194
Campylobacter jejuni doylei XXXXXX 32021 197
Campylobacter jejuni jejuni NC_002163 32022 197
Campylobacter lanienae CP015578 75658 194
Campylobacter lari CP000932 201 194
Campylobacter mucosalis XXXXXX 202 194
Campylobacter peloridis CP007766 488546 194
Campylobacter rectus XXXXXX 203 194
Campylobacter showae XXXXXX 204 194
Campylobacter sputorum CP019683 202 194
Campylobacter subantarcticus CP007772 497724 194
Campylobacter upsaliensis XXXXXX 28080 194
Campylobacter ureolyticus CP012195 827 194
Campylobacter volucris CP007774 1031542 194
Candidatus Desulforudis audaxviator CP000860 471827 471826
Expand Down Expand Up @@ -149,12 +135,7 @@ Haemophilus influenzae NC_000907 727 724
Haemophilus somnus CP000947 228400 731
Halobacterium salinarum AM774415 478009 2242
Helianthus annuus MG770607 4232 4231
Helicobacter bilis XXXXXX 37372 209
Helicobacter cinaedi XXXXXX 213 209
Helicobacter pylori NC_000915 210 209
Helicobacter pullorum XXXXXX 35818 209
Helicobacter winghamensis XXXXXX 157268 209
Helicobacter valdiviensis XXXXXX 1458358 209
Heliobacterium modesticaldum CP000930 35701 2697
Homo sapiens NC_012920 9606 9605
Ketogulonicigenium vulgare NC_017384 92945 92944
Expand Down
4 changes: 2 additions & 2 deletions src/plasmids.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Staphylococcus aureus NC_002096 1280 1279
Staphylococcus aureus NC_002129 1280 1279
Lysinibacillus sphaericus AY325804 1421 400634
Gluconacetobacter diazotrophicus AM889286 33996 89583
Frankia symbiont of Datisca glomerata CP002802 656024 1854
Frankia symbiont of Datisca glomerata CP002802 2716812 1854
Enterobacteriaceae CP011981 543 1854
Enterobacteriaceae CP023916 543 1854
Enterobacteriaceae HG969999 543 1854
Expand Down Expand Up @@ -4904,7 +4904,7 @@ Croceicoccus marinus CP019604 450378 1295327
Paenibacillus larvae CP019658 1464 44249
Paenibacillus larvae CP019657 1464 44249
Paenibacillus larvae CP019653 1464 44249
Frankia symbiont of Datisca glomerata CP002803 656024 44249
Frankia symbiont of Datisca glomerata CP002803 2716812 1854
Enterobacter cloacae KF998104 550 44249
Edwardsiella ictaluri KC249996 67780 44249
Enterobacter cloacae MF370188 550 44249
Expand Down
7 changes: 7 additions & 0 deletions src/taxonomy/names.dmp
Original file line number Diff line number Diff line change
Expand Up @@ -22685,3 +22685,10 @@
9000017 | Salmonella enterica subsp. X | | scientific name |
2607663 | Yersinia canariae | | scientific name |
1604335 | Yersinia rochesterensis | | scientific name |
2716812 | Ca. Frankia datiscae | | equivalent name |
2716812 | "Candidatus Frankia datiscae" Persson et al. 2011 | | authority |
2716812 | Candidatus Frankia datiscae | | scientific name |
2716812 | Frankia datiscae | | equivalent name |
2716812 | Frankia symbiont of Datisca glomerata | | includes |
91352 | Campylobacter hyointestinalis subsp. hyointestinalis Gebhart et al. 1985 | | authority |
91352 | Campylobacter hyointestinalis subsp. hyointestinalis | | scientific name |
2 changes: 2 additions & 0 deletions src/taxonomy/nodes.dmp
Original file line number Diff line number Diff line change
Expand Up @@ -3507,3 +3507,5 @@
9000017 | 28901 | no rank | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
2607663 | 629 | species | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1604335 | 629 | species | | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
91352 | 198 | subspecies | CH | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
2716812 | 1854 | species | CF | 0 | 1 | 11 | 1 | 0 | 1 | 0 | 0 | |

0 comments on commit 703eaa8

Please sign in to comment.