Skip to content

Commit

Permalink
update genome list; stable efetching (#49)
Browse files Browse the repository at this point in the history
* Add genomes (#45)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* Esearch input (#47)

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* make symlink to avoid naming mistakes

* check whether taxonkit is loaded

* use efetch -input

* fix tr bug

* Esearch input flag (#48)

* Add genomes (#45) (#46)

* Corynebacterium diphtheriae

* added Bifidobacterium adolenscentis

* replaced S. enterica IIIa; Added hops (Humulus lupulus)

* added a Citrobacter species

* m

* replaced repressed genome accession for B. faecium

* remove random single quotes

* bump version

* helpful log messages

* v5.6.3

* make symlink to avoid naming mistakes

* check whether taxonkit is loaded

* use efetch -input

* fix tr bug

* get latest edirect

* update installation instructions

* update installation instructions: fix PATH

* bring in other tests

* update installation method for search with unit-testing

* update installation method for search with kraken2

* debug the ls statement

* debug the ls statement

* debug the ls statement

* debug building taxonomy

* exclusive unit testing for taxonomy for right now

* install taxonkit
  • Loading branch information
lskatz authored Jun 5, 2024
1 parent 18033f7 commit 2104a5d
Show file tree
Hide file tree
Showing 7 changed files with 53 additions and 42 deletions.
16 changes: 4 additions & 12 deletions .github/workflows/unit-testing.Listeria.Kraken1.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master, dev, validate-taxonomy]
branches: [master, dev]
name: Listeria-with-Kraken1

env:
Expand Down Expand Up @@ -41,18 +41,10 @@ jobs:
tree $(realpath .)
- name: install-edirect
run: |
sudo apt-get install ncbi-entrez-direct
echo "installed edirect the apt way"
exit
cd $HOME
perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -cv edirect.tar.gz | tar xf -
rm -v edirect.tar.gz
echo $GITHUB_WORKSPACE/edirect >> $GITHUB_PATH
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
echo $HOME/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
#export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
yes Y | ./edirect/setup.sh
tree edirect
tree $HOME/edirect
- name: check-env
run: echo "$PATH"
- name: select for only Listeria
Expand Down
10 changes: 9 additions & 1 deletion .github/workflows/unit-testing.Yersinia.Kraken2.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This is a subsampling unit test to get early results
on:
push:
branches: [master, dev, validate-taxonomy]
branches: [master, dev]
name: Genera-with-Kraken2

env:
Expand Down Expand Up @@ -34,6 +34,14 @@ jobs:
- name: env check
run: |
echo $PATH | tr ':' '\n' | sort
- name: install-edirect
run: |
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
echo $HOME/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
tree $HOME/edirect
- name: apt-get install
run: sudo apt-get install ca-certificates tree jellyfish ncbi-entrez-direct
- name: select for only for this genus
Expand Down
17 changes: 6 additions & 11 deletions .github/workflows/unit-testing.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
on:
push:
branches: [master, dev, validate-taxonomy]
branches: [master, dev]
name: Pull-down-all-accessions

jobs:
Expand All @@ -25,18 +25,13 @@ jobs:

- name: apt-get install
run: sudo apt-get install ca-certificates tree

- name: install-edirect
run: |
sudo apt-get install ncbi-entrez-direct
echo "installed edirect the apt way"
exit
cd $HOME
perl -MNet::FTP -e '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1); $ftp->login; $ftp->binary; $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -cv edirect.tar.gz | tar xf -
rm -v edirect.tar.gz
export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
yes Y | ./edirect/setup.sh
tree edirect
sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"
echo $HOME/edirect >> $GITHUB_PATH
echo $GITHUB_WORKSPACE/Kalamari/bin >> $GITHUB_PATH
tree $HOME/edirect
- name: check-env
run: echo "$PATH"
- name: download
Expand Down
13 changes: 10 additions & 3 deletions .github/workflows/validateTaxonomy.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
on:
push:
branches: [master, dev, validate-taxonomy]
branches: [master, dev, esearch-input]
name: Validate taxonomy

jobs:
Expand All @@ -27,11 +27,18 @@ jobs:
echo $PATH
echo ""
cat $GITHUB_PATH
- name: install taxonkit
run: |
wget https://github.com/shenwei356/taxonkit/releases/download/v0.16.0/taxonkit_linux_amd64.tar.gz
tar -xvf taxonkit_linux_amd64.tar.gz
rm -v taxonkit_linux_amd64.tar.gz
chmod +x taxonkit
echo $(realpath .) >> $GITHUB_PATH
- name: build taxonomy
run: |
echo $PATH
bash Kalamari/bin/buildTaxonomy.sh
bash Kalamari/bin/filterTaxonomy.sh
bash -x Kalamari/bin/buildTaxonomy.sh
bash -x Kalamari/bin/filterTaxonomy.sh
ls -lhR Kalamari/share/kalamari-*/taxonomy
- name: validate taxonomy
run: |
Expand Down
8 changes: 7 additions & 1 deletion bin/buildKraken1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,15 @@ cp -rv $TAXDIR $DB/taxonomy

# Make --add-to-library more efficient with
# concatenated fasta files
export nl=$'\n'
find $SRC -name '*.fasta.gz' | \
xargs -n 100 -P 1 bash -c '
for i in "$@"; do
gzip -cd $i
done > $tmpfile
echo -ne "ADDING to library:\n "
zgrep "^>" $tmpfile | sed "s/^>//" | tr '\n' ' '
zgrep "^>" $tmpfile | sed "s/^>//" | tr "$nl" " "
echo
echo "^^ contents of $tmpfile ^^"
kraken-build --db $DB --add-to-library $tmpfile
'
Expand All @@ -38,3 +40,7 @@ kraken-build --db $DB --build --threads 1
# Reduce the size of the database
kraken-build --db $DB --clean


if [ ! -e "$sharedir/kalamari-kraken1" ]; then
ln -sv kalamari-kraken "$sharedir/kalamari-kraken1"
fi
26 changes: 12 additions & 14 deletions bin/downloadKalamari.pl
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
use IO::Compress::Gzip;
use version 0.77;

our $VERSION = version->parse("5.6.3");
our $VERSION = version->parse("5.7.0");

use threads;

Expand Down Expand Up @@ -167,27 +167,25 @@ sub downloadEntries{
my $numEntries = scalar(@$entries);
my @acc = map{$$_{nuccoreAcc}} @$entries;
logmsg "Downloading ".scalar(@acc)." accessions";
my $queryArg = join("[accession] OR ", sort(@acc))."[accession]";
my $dir = tempdir("download.XXXXXX", DIR=>$$settings{tempdir});

# Make the input file for efetch
my $inputAcc = "$dir/input.acc";
open(my $fh, ">", $inputAcc) or die "ERROR: could not write to $inputAcc: $!";
print $fh join("\n", @acc)."\n";
close $fh;

# Accessions that had errors
my @err;

# Get the esearch xml in place for at least one downstream query
my $esearchXml = "$dir/esearch.xml";
my $esearchCmd = "esearch -db nuccore -query '$queryArg' > $esearchXml";
command($esearchCmd);
# Get started on the comprehensive assembly file
my $outfile = "$dir/all.fasta";
logmsg "Downloading all accessions to $outfile using input accessions in $inputAcc";
command("efetch -db nuccore -input $inputAcc -format fasta > $dir/all.fasta");
if($?){
die "ERROR running: $esearchCmd: $!";
die "ERROR: could not download all accessions";
}

# Get started on the assembly file
my $outfile = "$dir/all.fasta";

# Main query: efetch
my $efetchCmd = "cat $esearchXml | efetch -format fasta > $outfile";
system($efetchCmd);

my $seqsWithVersion = readSeqs($outfile);
my $seqs = {};
while(my($acc, $seq) = each(%$seqsWithVersion)){
Expand Down
5 changes: 5 additions & 0 deletions bin/filterTaxonomy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

set -eu

# Check for dependencies
echo "Check for dependencies"
which taxonkit
echo

thisdir=$(dirname $0)
thisfile=$(basename $0)
KALAMARI_VER=$(downloadKalamari.pl --version)
Expand Down

0 comments on commit 2104a5d

Please sign in to comment.