-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* building taxonomy files but this script will be deprecated right away * deprecated * script to build taxonomy with src files * m * move old taxonomy to deprecated * remove old 'versioned' files outside of git versioning * filter taxonomy script * complete the taxonomy * updated scripts for compiling databases * dev branch testing * fix lmono test a bit * .
- Loading branch information
Showing
20 changed files
with
230 additions
and
340 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
on: | ||
push: | ||
branches: [master] | ||
branches: [master, dev] | ||
name: Pull-down-all-accessions | ||
|
||
jobs: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
on: | ||
push: | ||
branches: [fix-CI, master] | ||
branches: [master, dev] | ||
name: Validate taxonomy | ||
|
||
jobs: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
edirect | ||
share |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,20 +39,22 @@ using your own email address instead of `[email protected]`. | |
|
||
## Download instructions | ||
|
||
For usage, run `perl bin/downloadKalamari.pl --help` | ||
First, build the taxonomy. | ||
The script `buildTaxonomy.sh` uses the diffs in Kalamari to enhance the default NCBI taxonomy. | ||
Next, `filterTaxonomy.sh` reduces the taxonomy files to just those found in Kalamari. | ||
`filterTaxonomy.sh` uses `taxonkit` and so this needs to be in your | ||
environment before starting. | ||
|
||
SRC=Kalamari | ||
perl bin/downloadKalamari.pl -o $SRC src/chromosomes.tsv | ||
bash bin/buildTaxonomy.sh | ||
bash bin/filterTaxonomy.sh | ||
|
||
### ...with plasmids | ||
To download the chromosomes and plasmids, use the `.tsv` files, respectively, with `downloadKalamari.pl`. | ||
Run `downloadKalamari.pl --help` for usage. | ||
However, to download the files to a standard location, | ||
please simply use `downloadKalamari.sh` which uses | ||
`downloadKalamari.pl` internally. | ||
|
||
SRC=Kalamari | ||
perl bin/downloadKalamari.pl -o $SRC src/chromosomes.tsv src/plasmids.tsv | ||
|
||
### taxonomy | ||
|
||
The taxonomy files `nodes.dmp` and `names.dmp` are under `src/taxonomy-VER` | ||
where `VER` is the version of Kalamari. | ||
bash bin/downloadKalamari.pl | ||
|
||
## Database formatting instructions | ||
|
||
|
@@ -80,4 +82,4 @@ Please see [CONTRIBUTING.md](CONTRIBUTING.md) | |
|
||
## Citation | ||
|
||
Please refer to the ASM 2018 poster under docs | ||
Please refer to the ASM 2018 poster under docs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#!/bin/bash | ||
|
||
set -eu | ||
|
||
thisdir=$(dirname $0) | ||
KALAMARI_VER=$(downloadKalamari.pl --version) | ||
|
||
sharedir=$thisdir/../share/kalamari-$KALAMARI_VER | ||
SRC="$sharedir/kalamari" | ||
TAXDIR="$sharedir/taxonomy/filtered" | ||
|
||
# Test prereqs | ||
which kraken-build | ||
which jellyfish | ||
|
||
DB="$sharedir/kalamari-kraken1" | ||
mkdir -pv $DB | ||
cp -rv $TAXDIR $DB/taxonomy | ||
find $SRC -name '*.fasta' \ | ||
-exec kraken-build --db $DB --add-to-library {} \; | ||
kraken-build --db $DB --build --threads 1 | ||
kraken-build --db $DB --clean |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#!/bin/bash | ||
|
||
set -eu | ||
|
||
thisdir=$(dirname $0) | ||
KALAMARI_VER=$(downloadKalamari.pl --version) | ||
|
||
sharedir=$thisdir/../share/kalamari-$KALAMARI_VER | ||
SRC="$sharedir/kalamari" | ||
TAXDIR="$sharedir/taxonomy/filtered" | ||
|
||
# Test prereqs | ||
which kraken2-build | ||
which jellyfish | ||
|
||
DB="$sharedir/kalamari-kraken2" | ||
mkdir -pv $DB | ||
cp -rv $TAXDIR $DB/taxonomy | ||
find $SRC -name '*.fasta' \ | ||
-exec kraken2-build --db $DB --add-to-library {} \; | ||
kraken2-build --db $DB --build --threads 1 | ||
kraken2-build --db $DB --clean |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
#!/bin/bash | ||
|
||
set -eu | ||
|
||
thisdir=$(dirname $0) | ||
thisfile=$(basename $0) | ||
KALAMARI_VER=$(downloadKalamari.pl --version) | ||
|
||
# Set up some directories | ||
tempdir=$(mktemp -d $thisfile.XXXXXX) | ||
trap "rm -rf $tempdir" EXIT | ||
outdir="$thisdir/../share/kalamari-$KALAMARI_VER/taxonomy" | ||
mkdir -pv $outdir | ||
|
||
# output files | ||
outnodes="$outdir/nodes.dmp" | ||
outnames="$outdir/names.dmp" | ||
|
||
# Build files | ||
delnodes="$thisdir/../src/taxonomy/build/delnodes.txt" | ||
addnodes="$thisdir/../src/taxonomy/build/nodes.dmp" | ||
addnames="$thisdir/../src/taxonomy/build/names.dmp" | ||
|
||
# Source files | ||
srcnodes="$tempdir/nodes.dmp" | ||
srcnames="$tempdir/names.dmp" | ||
|
||
# First, download the standard taxonomy dump tar.gz file | ||
curl ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz > $tempdir/taxonomy.tar.gz | ||
tar -C $tempdir -xzf $tempdir/taxonomy.tar.gz | ||
|
||
# Next, build the taxonomy database. | ||
# Remove taxids in $delnodes from the source nodes file | ||
while read -r line; do | ||
# If we see a comment line, skip it | ||
if [[ "$line" =~ ^# ]]; then | ||
continue | ||
fi | ||
|
||
# Read each 'word' as a taxid and remove it from | ||
# $srcnodes using sed /d | ||
for taxid in $line; do | ||
echo "Removing taxid $taxid from $srcnodes" | ||
sed -i -e "/^$taxid\t/d" $srcnodes | ||
done | ||
done < $delnodes | ||
|
||
# Add in new nodes and names | ||
echo "Combining NCBI taxonomy with new additions from Kalamari" | ||
cat $srcnodes $addnodes > $outnodes | ||
cat $srcnames $addnames > $outnames | ||
|
||
# Copy in the rest of the source files | ||
echo "Copying any remaining taxonomy files to the target" | ||
for i in $tempdir/*.dmp; do | ||
cp -nv $i $outdir/ | ||
done | ||
|
||
echo "Output can be found in $outdir" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
#!/bin/bash | ||
|
||
set -eu | ||
|
||
thisdir=$(dirname $0) | ||
thisfile=$(basename $0) | ||
KALAMARI_VER=$(downloadKalamari.pl --version) | ||
|
||
# Set up some directories | ||
tempdir=$(mktemp -d $thisfile.XXXXXX) | ||
trap "rm -rf $tempdir" EXIT | ||
outdir="$thisdir/../share/kalamari-$KALAMARI_VER/taxonomy/filtered" | ||
srcdir="$thisdir/../share/kalamari-$KALAMARI_VER/taxonomy" | ||
mkdir -pv $outdir | ||
|
||
# output files | ||
outnodes="$outdir/nodes.dmp" | ||
outnames="$outdir/names.dmp" | ||
|
||
# source taxonomy | ||
srcnodes="$srcdir/nodes.dmp" | ||
srcnames="$srcdir/names.dmp" | ||
|
||
# source leaf taxids | ||
taxid=$(cut -f 3,4 $thisdir/../src/chromosomes.tsv $thisdir/../src/plasmids.tsv | grep -v taxid | tr '\t' '\n' | sort -n | uniq) | ||
|
||
# Getting all necessary taxids | ||
alltaxids=$(echo "$taxid" | taxonkit --data-dir=$srcdir lineage -t | cut -f 3 | tr ';' '\n' | grep . | sort -n | uniq) | ||
numtaxids=$(wc -c <<< $alltaxids) | ||
echo "found $numtaxids taxids after calculating each taxon's lineage" | ||
|
||
# Filter nodes.dmp and names.dmp for $alltaxids | ||
echo "Finding all filtered taxids in $srcnodes" | ||
num=0 | ||
# Replace the for loop with regex for grep | ||
regex=$(echo "$alltaxids" | perl -plane 's/(\d+)/^$1\t/' | tr '\n' '|' | sed 's/|$//'); | ||
|
||
grep -E "$regex" $srcnodes > $outnodes | ||
grep -E "$regex" $srcnames > $outnames | ||
|
||
# Copy in the rest of the source files | ||
echo "Copying any remaining taxonomy files to the target" | ||
for i in $srcdir/*.dmp; do | ||
cp -nv $i $outdir/ | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.