Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rosella to v0.5.0 #168

Merged
merged 17 commits into from
Nov 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .githooks/commit-msg
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env bash

commit_re="^(build|chore|ci|docs|feat|fix|perf|refactor|revert|style|test)(\([a-zA-Z 0-9 \-_]+\))?!?: .+$"
commit_message=$(cat "$1")

if [[ "$commit_message" =~ $commit_re ]]; then
exit 0
fi

echo "The commit message does not meet the Conventional Commit standard."
echo "An example of a valid message is: "
echo " feat(login): add the 'remember me' button"
echo "Details: https://www.conventionalcommits.org/en/v1.0.0/#summary"
exit 1
24 changes: 24 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Newell
given-names: Rhys J. P.
orcid: https://orcid.org/0000-0002-1300-6116
- family-names: Aroney
given-names: Samuel T. N.
orcid: https://orcid.org/0000-0001-9806-5846
- family-names: Zaugg
given-names: Julian
orcid: https://orcid.org/0000-0002-4919-1448
- family-names: Sternes
given-names: Peter
orcid: https://orcid.org/0000-0002-4456-150X
- family-names: Tyson
given-names: Gene W.
orcid: https://orcid.org/0000-0001-8559-9427
- family-names: Woodcroft
given-names: Ben J.
orcid: https://orcid.org/0000-0003-0670-7480
title: "Aviary: Hybrid assembly and genome recovery from metagenomes with Aviary"
version: 0.8.2
date-released: 2023-11-05
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ channels:
- defaults
```

#### Option 1) Install from Bioconda
#### Option 1: Install from Bioconda

Conda can handle the creation of the environment for you directly:

Expand All @@ -38,7 +38,7 @@ Or install into existing environment:
conda install -c bioconda aviary
```

#### Option 2) Install from pip
#### Option 2: Install from pip

Create the environment using the `aviary.yml` file then install from pip:
```
Expand All @@ -47,7 +47,7 @@ conda activate aviary
pip install aviary-genome
```

#### Option 3) Install from source
#### Option 3: Install from source

Initial requirements for aviary can be downloaded using the `aviary.yml`:
```
Expand Down
10 changes: 5 additions & 5 deletions aviary.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@ channels:
- bioconda
- anaconda
dependencies:
- python>=3.8
- snakemake>=7.0.0,<=7.32.3
- ruamel.yaml>=0.15.99 # needs to be explicit
- python >=3.8,<=3.11
- snakemake >=7.0.0,<=7.32.3
- ruamel.yaml >=0.15.99 # needs to be explicit
- numpy
- pandas
- biopython
- mamba>=0.8.2
- pigz = 2.6
- mamba >=0.8.2
- pigz =2.6
- parallel
- bbmap
- extern # for tests
10 changes: 9 additions & 1 deletion aviary/aviary.py
Original file line number Diff line number Diff line change
Expand Up @@ -580,12 +580,20 @@ def main():

binning_group.add_argument(
'--refinery-max-iterations', '--refinery_max_iterations',
help='Maximum number of iterations for Rosella refinery. Set to 0 to skip refinery.',
help='Maximum number of iterations for Rosella refinery. Set to 0 to skip refinery. Lower values will run faster but may result in lower quality MAGs.',
dest='refinery_max_iterations',
type=int,
default=5
)

binning_group.add_argument(
'--refinery-max-retries', '--refinery_max_retries',
help='Maximum number of retries rosella uses to generate valid reclustering within a refinery iteration. Lower values will run faster but may result in lower quality MAGs.',
dest='refinery_max_retries',
type=int,
default=3
)

binning_group.add_argument(
'--skip-binners', '--skip_binners', '--skip_binner', '--skip-binner',
help='Optional list of binning algorithms to skip. Can be any combination of: \n'
Expand Down
10 changes: 5 additions & 5 deletions aviary/envs/checkm2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ dependencies:
- python>=3.6, <3.9
- scikit-learn=0.23.2
- h5py=2.10.0
- numpy=1.19.2
- numpy=1.21.6
- diamond=2.0.4
- tensorflow >= 2.1.0, <2.6.0
- lightgbm = 3.2.1
- pandas <= 1.4.0
- tensorflow >= 2.1.0, <=2.6
- lightgbm =3.2.1
- pandas >=1.4.0, <2.0
- scipy
- prodigal>=2.6.3
- prodigal >=2.6.3
- setuptools
- requests
- packaging
Expand Down
65 changes: 14 additions & 51 deletions aviary/modules/benchmarking/benchmarking.smk
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ rule rerun_rosella:
"benchmarks/rosella_rerun.benchmark.txt"
shell:
"rm -f data/rosella_bins/*.fna; rm -f data/rosella_bins/checkm.out; rm -rf data/rosella_bins/checkm/; "
"rosella bin -r {input.fasta} -i {input.coverage} -t {threads} -o data/rosella_bins "
"--min-contig-size {params.min_contig_size} --min-bin-size {params.min_bin_size} --n-neighbors 200 && "
"rosella recover -r {input.fasta} -C {input.coverage} -t {threads} -o data/rosella_bins "
"--min-contig-size {params.min_contig_size} --min-bin-size {params.min_bin_size} --n-neighbors 100 && "
"touch data/rosella_bins/rerun"


Expand Down Expand Up @@ -188,9 +188,11 @@ rule rosella_refine_benchmark_1:
output_folder = "data/rosella_refine_rosella/",
min_bin_size = config["min_bin_size"],
max_iterations = 1,
max_retries = config["refinery_max_retries"],
pplacer_threads = config["pplacer_threads"],
max_contamination = 10,
final_refining = False
final_refining = False,
bin_prefix = "rosella"
threads:
config["max_threads"]
conda:
Expand All @@ -213,9 +215,11 @@ rule rosella_refine_benchmark_2:
output_folder = "data/rosella_refine_metabat2/",
min_bin_size = config["min_bin_size"],
max_iterations = 1,
max_retries = config["refinery_max_retries"],
pplacer_threads = config["pplacer_threads"],
max_contamination = 10,
final_refining = False
final_refining = False,
bin_prefix = "metabat2"
threads:
config["max_threads"]
conda:
Expand All @@ -238,9 +242,11 @@ rule rosella_refine_benchmark_3:
output_folder = "data/rosella_refine_semibin/",
min_bin_size = config["min_bin_size"],
max_iterations = 1,
max_retries = config["refinery_max_retries"],
pplacer_threads = config["pplacer_threads"],
max_contamination = 10,
final_refining = False
final_refining = False,
bin_prefix = "semibin2"
threads:
config["max_threads"]
conda:
Expand All @@ -264,9 +270,11 @@ rule rosella_refine_benchmark_4:
output_folder = "data/rosella_refine_das_tool/",
min_bin_size = config["min_bin_size"],
max_iterations = 1,
max_retries = config["refinery_max_retries"],
pplacer_threads = config["pplacer_threads"],
max_contamination = 10,
final_refining = True
final_refining = True,
bin_prefix = "dastool"
threads:
config["max_threads"]
conda:
Expand Down Expand Up @@ -645,51 +653,6 @@ rule das_tool_no_refine:
touch data/das_tool_bins_no_refine/done
"""

# rule das_tool_no_refine_with_semibin:
# """
# Runs dasTool on the output of all binning algorithms. If a binner failed to produce bins then their output is ignored
# """
# input:
# fasta = config["fasta"],
# metabat2_done = "data/metabat_bins_2/done",
# semibin_done = "data/semibin_bins/done",
# rosella_done = "data/rosella_bins/done",
# concoct_done = "data/concoct_bins/done",
# maxbin_done = "data/maxbin2_bins/done",
# metabat_sspec = "data/metabat_bins_sspec/done",
# metabat_spec = "data/metabat_bins_spec/done",
# metabat_ssens = "data/metabat_bins_ssens/done",
# metabat_sense = "data/metabat_bins_sens/done",
# # rosella_done = "data/rosella_refined/done",
# vamb_done = "data/vamb_bins/done",
# output:
# das_tool_done = "data/das_tool_bins_wr_and_sb/done"
# threads:
# config["max_threads"]
# conda:
# "../binning/envs/das_tool.yaml"
# benchmark:
# "benchmarks/das_tool.benchmark.txt"
# shell:
# """
# Fasta_to_Scaffolds2Bin.sh -i data/metabat_bins_sspec -e fa > data/metabat_bins_sspec.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/metabat_bins_ssens -e fa > data/metabat_bins_ssens.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/metabat_bins_sens -e fa > data/metabat_bins_sens.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/metabat_bins_spec -e fa > data/metabat_bins_spec.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/concoct_bins -e fa > data/concoct_bins.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/maxbin2_bins -e fasta > data/maxbin_bins.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/vamb_bins/bins -e fna > data/vamb_bins.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/rosella_bins/ -e fna > data/rosella_bins.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/metabat_bins_2/ -e fa > data/metabat2_bins.tsv;
# Fasta_to_Scaffolds2Bin.sh -i data/semibin_bins/output_recluster_bins/ -e fa > data/semibin_bins.tsv;
# scaffold2bin_files=$(find data/*bins*.tsv -not -empty -exec ls {{}} \; | tr "\n" ',' | sed "s/,$//g");
# DAS_Tool --search_engine diamond --write_bin_evals 1 --write_bins 1 -t {threads} --score_threshold -42 \
# -i $scaffold2bin_files \
# -c {input.fasta} \
# -o data/das_tool_bins_wr_and_sb/das_tool && \
# touch data/das_tool_bins_wr_and_sb/done
# """

rule checkm_das_tool_no_refine:
input:
done = "data/das_tool_bins_no_refine/done"
Expand Down
30 changes: 19 additions & 11 deletions aviary/modules/binning/binning.smk
Original file line number Diff line number Diff line change
Expand Up @@ -345,7 +345,7 @@ rule rosella:
min_contig_size = config["min_contig_size"],
min_bin_size = config["min_bin_size"]
output:
# kmers = "data/rosella_bins/rosella_kmer_table.tsv",
# kmers = "data/rosella_bins/kmer_frequencies.tsv",
done = "data/rosella_bins/done"
conda:
"envs/rosella.yaml"
Expand All @@ -360,8 +360,8 @@ rule rosella:
"benchmarks/rosella.benchmark.txt"
shell:
"rm -rf data/rosella_bins/; "
"rosella recover -r {input.fasta} -i {input.coverage} -t {threads} -o data/rosella_bins "
"--min-contig-size {params.min_contig_size} --min-bin-size {params.min_bin_size} --n-neighbors 200 > {log} 2>&1 && "
"rosella recover -r {input.fasta} -C {input.coverage} -t {threads} -o data/rosella_bins "
"--min-contig-size {params.min_contig_size} --min-bin-size {params.min_bin_size} --n-neighbors 100 > {log} 2>&1 && "
"touch {output.done} || touch {output.done}"


Expand Down Expand Up @@ -474,7 +474,7 @@ rule refine_rosella:
rosella = ancient('data/rosella_bins/done'),
coverage = ancient("data/coverm.cov"),
fasta = ancient(config["fasta"]),
# kmers = "data/rosella_bins/rosella_kmer_table.tsv"
# kmers = "data/rosella_bins/kmer_frequencies.tsv"
output:
'data/rosella_refined/done'
benchmark:
Expand All @@ -485,9 +485,11 @@ rule refine_rosella:
output_folder = "data/rosella_refined/",
min_bin_size = config["min_bin_size"],
max_iterations = config["refinery_max_iterations"],
max_retries = config["refinery_max_retries"],
pplacer_threads = lambda wildcards, threads: min(threads, config["pplacer_threads"]),
max_contamination = 15,
final_refining = False
final_refining = False,
bin_prefix = "rosella"
threads:
min(config["max_threads"], 16)
resources:
Expand All @@ -506,7 +508,7 @@ rule refine_metabat2:
rosella = ancient('data/metabat_bins_2/done'),
coverage = ancient("data/coverm.cov"),
fasta = ancient(config["fasta"]),
# kmers = "data/rosella_bins/rosella_kmer_table.tsv"
# kmers = "data/rosella_bins/kmer_frequencies.tsv"
output:
'data/metabat2_refined/done'
threads:
Expand All @@ -522,9 +524,11 @@ rule refine_metabat2:
output_folder = "data/metabat2_refined/",
min_bin_size = config["min_bin_size"],
max_iterations = config["refinery_max_iterations"],
max_retries = config["refinery_max_retries"],
pplacer_threads = lambda wildcards, threads: min(threads, config["pplacer_threads"]),
max_contamination = 15,
final_refining = False
final_refining = False,
bin_prefix = "metabat2"
log:
"logs/refine_metabat2.log"
conda:
Expand All @@ -538,7 +542,7 @@ rule refine_semibin:
rosella = ancient('data/semibin_bins/done'),
coverage = ancient("data/coverm.cov"),
fasta = ancient(config["fasta"]),
# kmers = "data/rosella_bins/rosella_kmer_table.tsv"
# kmers = "data/rosella_bins/kmer_frequencies.tsv"
threads:
min(config["max_threads"], 16)
resources:
Expand All @@ -554,9 +558,11 @@ rule refine_semibin:
output_folder = "data/semibin_refined/",
min_bin_size = config["min_bin_size"],
max_iterations = config["refinery_max_iterations"],
max_retries = config["refinery_max_retries"],
pplacer_threads = lambda wildcards, threads: min(threads, config["pplacer_threads"]),
max_contamination = 15,
final_refining = False
final_refining = False,
bin_prefix = "semibin2"
log:
"logs/refine_semibin.log"
conda:
Expand Down Expand Up @@ -656,7 +662,7 @@ rule refine_dastool:
das_tool = 'data/das_tool_bins_pre_refine/done',
coverage = ancient("data/coverm.cov"),
fasta = ancient(config["fasta"]),
# kmers = "data/rosella_bins/rosella_kmer_table.tsv"
# kmers = "data/rosella_bins/kmer_frequencies.tsv"
threads:
min(config["max_threads"], 16)
resources:
Expand All @@ -673,9 +679,11 @@ rule refine_dastool:
output_folder = "data/refined_bins/",
min_bin_size = config["min_bin_size"],
max_iterations = config["refinery_max_iterations"],
max_retries = config["refinery_max_retries"],
pplacer_threads = lambda wildcards, threads: min(threads, config["pplacer_threads"]),
max_contamination = 15,
final_refining = True
final_refining = True,
bin_prefix = "dastool"
log:
"logs/refine_dastool.log"
conda:
Expand Down
25 changes: 13 additions & 12 deletions aviary/modules/binning/envs/rosella.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,29 @@ channels:
- bioconda
- defaults
dependencies:
- pip
- python >= 3.8, <= 3.10
- gcc
- cxx-compiler
- numba
- numpy<=1.21
- rosella==0.4.2
- checkm-genome==1.1.3
- flight-genome>=1.5.0
- joblib==1.1.0 # For https://github.com/scikit-learn-contrib/hdbscan/pull/563 which is used by flight. Can remove when hdbscan releases past 0.8.28
- scikit-bio>=0.5.7
- seaborn
- rosella >= 0.5.1
- numba >= 0.53, <= 0.57
- numpy <= 1.24
- joblib >= 1.1.0, <= 1.3
- scikit-bio >= 0.5.7
- umap-learn >= 0.5.3
- scipy <= 1.8.1
- scipy <= 1.11
- pandas >= 1.3
- pynndescent >= 0.5.7
- hdbscan >= 0.8.28
- scikit-learn >= 1.0.2, <= 1.1
- flight-genome >= 1.6.1
- coverm >= 0.6.1
- seaborn
- imageio
- matplotlib
- tqdm
- tbb
- joblib
- pebble
- scikit-learn==1.0.2
- threadpoolctl
- biopython
- biopython
- checkm-genome==1.1.3
Loading
Loading