Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More proper build #42

Merged
merged 5 commits into from
May 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 11 additions & 8 deletions .github/workflows/unit-testing.Listeria.Kraken1.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master]
branches: [master, dev, build-taxonomy]
name: Listeria-with-Kraken1

env:
Expand All @@ -23,7 +23,7 @@ jobs:
perl-version: ${{ matrix.perl }}
multi-thread: "true"
- name: checkout my repo
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
path: Kalamari

Expand All @@ -48,7 +48,9 @@ jobs:
perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -cv edirect.tar.gz | tar xf -
rm -v edirect.tar.gz
export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
echo $GITHUB_WORKSPACE/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
#export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
yes Y | ./edirect/setup.sh
tree edirect
- name: check-env
Expand All @@ -64,10 +66,6 @@ jobs:
run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ env.OUTDIR }} ${{ env.TSV }}
- name: check-results
run: tree ${{ env.OUTDIR }}
#- name: download-more
# run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ env.OUTDIR }} ${{ env.TSV }} --and protein --and nucleotide
#- name: check-results
# run: tree ${{ env.OUTDIR }}
- name: install kraken
run: |
wget https://github.com/DerrickWood/kraken/archive/refs/tags/v1.1.1.tar.gz -O kraken-v1.1.1.tar.gz
Expand All @@ -76,12 +74,17 @@ jobs:
chmod -v +x kraken-1.1.1/kraken-src/*
echo $(realpath kraken-1.1.1/kraken-src) >> $GITHUB_PATH
tree $(realpath) kraken-1.1.1
- name: build taxonomy
run: |
export PATH=$PATH:Kalamari/bin
buildTaxonomy.sh
ls -lh Kalamari/share
- name: Kraken1 database
run: |
echo $PATH
which kraken-build
mkdir -pv kraken
cp -rv Kalamari/src/taxonomy kraken/taxonomy
cp -rv Kalamari/share/kalamari-*/taxonomy kraken/taxonomy
find ${{ env.OUTDIR }} -name '*.fasta' -exec kraken-build --db kraken --add-to-library {} \;
tree kraken
# Some super debugging here with -x
Expand Down
79 changes: 0 additions & 79 deletions .github/workflows/unit-testing.Listeria.Kraken2.yml

This file was deleted.

38 changes: 22 additions & 16 deletions .github/workflows/unit-testing.Yersinia.Kraken2.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master]
name: Yersinia-with-Kraken2
branches: [master, dev, build-taxonomy]
name: Genera-with-Kraken2

env:
TSV: "Kalamari/src/genus.tsv"
OUTDIR: "Yersinia.out"
DB: "kraken2"
SRC_TAX: "Kalamari/src/taxonomy"
SRC_CHR: "Kalamari/src/chromosomes.tsv"
SRC_PLD: "Kalamari/src/plasmids.tsv"
GENUS: Yersinia
Expand All @@ -20,15 +18,16 @@ jobs:
matrix:
os: ['ubuntu-20.04' ]
perl: [ '5.32' ]
name: Perl ${{ matrix.perl }} on ${{ matrix.os }}
GENUS: [ 'Yersinia', 'Listeria']
name: ${{ matrix.GENUS }} Perl ${{ matrix.perl }} on ${{ matrix.os }}
steps:
- name: Set up perl
uses: shogo82148/actions-setup-perl@v1
with:
perl-version: ${{ matrix.perl }}
multi-thread: "true"
- name: checkout my repo
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
path: Kalamari

Expand All @@ -40,29 +39,37 @@ jobs:
- name: select for only for this genus
run: |
head -n 1 ${{ env.SRC_CHR }} > ${{ env.TSV }}
grep -m 2 ${{ env.GENUS }} ${{ env.SRC_CHR }} >> ${{ env.TSV }}
grep -m 2 ${{ env.GENUS }} ${{ env.SRC_PLD }} >> ${{ env.TSV }}
echo "These are the ${{ env.GENUS }} genomes for downstream tests"
grep -m 2 ${{ matrix.GENUS }} ${{ env.SRC_CHR }} >> ${{ env.TSV }}
grep -m 2 ${{ matrix.GENUS }} ${{ env.SRC_PLD }} >> ${{ env.TSV }}
echo "These are the ${{ matrix.GENUS }} genomes for downstream tests"
column -ts $'\t' ${{ env.TSV }}
hexdump -c ${{ env.TSV }}
- name: download
run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ env.OUTDIR }} ${{ env.TSV }}
run: perl Kalamari/bin/downloadKalamari.pl --outdir ${{ matrix.GENUS }} ${{ env.TSV }}
- name: check-results
run: tree ${{ env.OUTDIR }}
run: |
tree ${{ matrix.GENUS }}
echo "First two lines of each fasta file:"
find ${{ matrix.GENUS }} -name '*.fasta' | xargs head -n 2 | cut -c 1-60
- name: install kraken
run: |
wget https://github.com/DerrickWood/kraken2/archive/refs/tags/v2.1.2.tar.gz -O kraken-v2.1.2.tar.gz
tar zxvf kraken-v2.1.2.tar.gz
cd kraken2-2.1.2 && bash install_kraken2.sh target && cd -
ls -lhS kraken2-2.1.2/target
chmod +x kraken2-2.1.2/target/*
- name: build taxonomy
run: |
export PATH=$PATH:Kalamari/bin
buildTaxonomy.sh
ls -lh Kalamari/share
- name: Kraken2 database
run: |
export PATH=$PATH:kraken2-2.1.2/target
which kraken2-build
mkdir -pv ${{ env.DB }}
cp -rv ${{ env.SRC_TAX }} ${{ env.DB }}/taxonomy
find ${{ env.OUTDIR }} -name '*.fasta' -exec kraken2-build --db ${{ env.DB }} --add-to-library {} \;
cp -rv Kalamari/share/kalamari-*/taxonomy ${{ env.DB }}/taxonomy
find ${{ matrix.GENUS }} -name '*.fasta' -exec kraken2-build --db ${{ env.DB }} --add-to-library {} \;
tree ${{ env.DB }}
echo ".....Building the database....."
kraken2-build --build --db ${{ env.DB }} --threads 2
Expand All @@ -71,10 +78,9 @@ jobs:
export PATH=$PATH:kraken2-2.1.2/target
tree ${{ env.DB }}
ls -lhSR ${{ env.DB }}
QUERY=$(find ${{ env.OUTDIR }} -name '*.fasta' | head -n 1)
QUERY=$(find ${{ matrix.GENUS }} -name '*.fasta' | head -n 1)
echo "QUERY is $QUERY"
head -n 2 $QUERY
kraken2 --db ${{ env.DB }} --report kraken2.report --use-mpa-style --output kraken2.raw $QUERY
set -x; kraken2 --db ${{ env.DB }} --report kraken2.report --use-mpa-style --output kraken2.raw $QUERY; set +x;
head kraken2.report kraken2.raw


2 changes: 1 addition & 1 deletion .github/workflows/unit-testing.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
on:
push:
branches: [master]
branches: [master, dev]
name: Pull-down-all-accessions

jobs:
Expand Down
25 changes: 18 additions & 7 deletions .github/workflows/validateTaxonomy.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
on:
push:
branches: [fix-CI, master]
branches: [master, dev, build-taxonomy]
name: Validate taxonomy

jobs:
Expand All @@ -18,15 +18,26 @@ jobs:
perl-version: ${{ matrix.perl }}
multi-thread: "true"
- name: checkout my repo
uses: actions/checkout@v2
uses: actions/checkout@v4
with:
path: Kalamari

- name: validate taxonomy
- name: update PATH
run: |
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
echo $PATH
echo ""
cat $GITHUB_PATH
- name: build taxonomy
run: |
perl Kalamari/bin/validateTaxonomy.pl Kalamari/src
echo $PATH
bash Kalamari/bin/buildTaxonomy.sh
ls -lhR Kalamari/share/kalamari-*/taxonomy
#- name: validate taxonomy
# run: |
# perl Kalamari/bin/validateTaxonomy.pl Kalamari/share/kalamari-*/taxonomy/nodes.dmp Kalamari/share/kalamari-*/taxonomy/names.dmp
- name: matching taxids
run: |
export taxdir=$(\ls -d Kalamari/share/kalamari-*/taxonomy)
echo "Making sure that all taxids in chromosomes.tsv and plasmids.tsv are present in nodes.tsv and names.tsv"
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@node=`cat Kalamari/src/taxonomy/nodes.dmp`; for $n(@node){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@name=`cat Kalamari/src/taxonomy/names.dmp`; for $n(@name){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@node=`cat $ENV{taxdir}/nodes.dmp`; for $n(@node){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
tail -n +2 Kalamari/src/chromosomes.tsv Kalamari/src/plasmids.tsv -q | perl -F'\t' -lane 'BEGIN{@name=`cat $ENV{taxdir}/names.dmp`; for $n(@name){($taxid)=split(/\t/, $n); $taxid{$taxid}++; } } for my $t($F[2], $F[3]){ if(!$taxid{$t}){ print "Could not find $t taxid";} }'
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
edirect
share
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,20 +39,22 @@ using your own email address instead of `[email protected]`.

## Download instructions

For usage, run `perl bin/downloadKalamari.pl --help`
First, build the taxonomy.
The script `buildTaxonomy.sh` uses the diffs in Kalamari to enhance the default NCBI taxonomy.
Next, `filterTaxonomy.sh` reduces the taxonomy files to just those found in Kalamari.
`filterTaxonomy.sh` uses `taxonkit` and so this needs to be in your
environment before starting.

SRC=Kalamari
perl bin/downloadKalamari.pl -o $SRC src/chromosomes.tsv
bash bin/buildTaxonomy.sh
bash bin/filterTaxonomy.sh

### ...with plasmids
To download the chromosomes and plasmids, use the `.tsv` files, respectively, with `downloadKalamari.pl`.
Run `downloadKalamari.pl --help` for usage.
However, to download the files to a standard location,
please simply use `downloadKalamari.sh` which uses
`downloadKalamari.pl` internally.

SRC=Kalamari
perl bin/downloadKalamari.pl -o $SRC src/chromosomes.tsv src/plasmids.tsv

### taxonomy

The taxonomy files `nodes.dmp` and `names.dmp` are under `src/taxonomy-VER`
where `VER` is the version of Kalamari.
bash bin/downloadKalamari.pl

## Database formatting instructions

Expand Down Expand Up @@ -80,4 +82,4 @@ Please see [CONTRIBUTING.md](CONTRIBUTING.md)

## Citation

Please refer to the ASM 2018 poster under docs
Please refer to the ASM 2018 poster under docs.
22 changes: 22 additions & 0 deletions bin/buildKraken1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

set -eu

thisdir=$(dirname $0)
KALAMARI_VER=$(downloadKalamari.pl --version)

sharedir=$thisdir/../share/kalamari-$KALAMARI_VER
SRC="$sharedir/kalamari"
TAXDIR="$sharedir/taxonomy/filtered"

# Test prereqs
which kraken-build
which jellyfish

DB="$sharedir/kalamari-kraken1"
mkdir -pv $DB
cp -rv $TAXDIR $DB/taxonomy
find $SRC -name '*.fasta' \
-exec kraken-build --db $DB --add-to-library {} \;
kraken-build --db $DB --build --threads 1
kraken-build --db $DB --clean
22 changes: 22 additions & 0 deletions bin/buildKraken2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

set -eu

thisdir=$(dirname $0)
KALAMARI_VER=$(downloadKalamari.pl --version)

sharedir=$thisdir/../share/kalamari-$KALAMARI_VER
SRC="$sharedir/kalamari"
TAXDIR="$sharedir/taxonomy/filtered"

# Test prereqs
which kraken2-build
which jellyfish

DB="$sharedir/kalamari-kraken2"
mkdir -pv $DB
cp -rv $TAXDIR $DB/taxonomy
find $SRC -name '*.fasta' \
-exec kraken2-build --db $DB --add-to-library {} \;
kraken2-build --db $DB --build --threads 1
kraken2-build --db $DB --clean
Loading
Loading