Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Ont downsampling normalization #489

Merged
merged 62 commits into from
Mar 15, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
a951fcb
wrote new script map_bin_amp_sampler.py for 'normalizing' reducing/co…
simakro Feb 15, 2022
b9bafe8
added write out and convert of picked reads from source fastq to out-…
simakro Feb 15, 2022
420e3a7
change in- and output file-extensions for all rules downstream of cap…
simakro Feb 15, 2022
d853977
fixed several snakemake related bugs
simakro Feb 15, 2022
c303aa8
multiple restorations to repair broken workflow
simakro Feb 16, 2022
e77580c
established custom script for trimming of primers work-title: map-trim
simakro Feb 17, 2022
35ca8ad
improved coverage-capping amp_covcap_sampler.py and amplicon primer-t…
simakro Feb 18, 2022
6412c6f
added improved logic for clipping of '-' strand aligning reads
simakro Feb 18, 2022
85f5a58
use only one map-trim to remove all, reinstated porechop; multiple ad…
simakro Feb 19, 2022
4372ab6
remove map-trim in order to reorganize
simakro Feb 25, 2022
b62d62e
perform all clipping and downsampling with notramp
simakro Feb 27, 2022
931f4d6
multiple renamings and adaptions
simakro Feb 28, 2022
e17b32f
cleanup
simakro Feb 28, 2022
1b2940b
adjusted output canu_correct
simakro Feb 28, 2022
e849ab6
Merge branch 'master' into ont_downsampling_normalization
simakro Feb 28, 2022
75bf4fb
linting
simakro Feb 28, 2022
80f4a62
Merge branch 'ont_downsampling_normalization' of https://github.com/I…
simakro Feb 28, 2022
b29b3b3
get_artic_primer
simakro Feb 28, 2022
6d92eae
Merge branch 'master' into ont_downsampling_normalization
thomasbtf Mar 1, 2022
630906b
Merge branch 'master' into ont_downsampling_normalization
alethomas Mar 1, 2022
c8d18c1
increased notramp version; reroute stdout+stderr to log in canu_correct
simakro Mar 2, 2022
c5493df
Merge branch 'ont_downsampling_normalization' of https://github.com/I…
simakro Mar 2, 2022
8c9165d
restored get_artic_primers to get the bed file
simakro Mar 2, 2022
4651b2e
get_output_dir comments
simakro Mar 2, 2022
82ef679
restore sample-table
simakro Mar 2, 2022
c7dc646
envs version nums
simakro Mar 2, 2022
3b1f372
envs conda-forge
simakro Mar 2, 2022
447a396
resolving comments
simakro Mar 2, 2022
c236336
updated primer paths
thomasbtf Mar 2, 2022
d91ac22
rmv amp_covcap_sampler.py
thomasbtf Mar 2, 2022
a7f12cd
Merge branch 'master' into ont_downsampling_normalization
thomasbtf Mar 3, 2022
1bd3e1d
canu_correct: moved corr_gz from output to parameters
simakro Mar 7, 2022
4554f0d
Merge branch 'ont_downsampling_normalization' of https://github.com/I…
simakro Mar 7, 2022
21aa92c
Merge branch 'master' into ont_downsampling_normalization
simakro Mar 7, 2022
2ea26d7
nanofilt accept de/compressed
simakro Mar 7, 2022
490826a
Merge branch 'ont_downsampling_normalization' of https://github.com/I…
simakro Mar 7, 2022
b0ae175
change gzip to compressed
simakro Mar 7, 2022
45a13de
with corr_gz in output
simakro Mar 7, 2022
78a5c4f
take out bcf rules
simakro Mar 7, 2022
5bae64f
orf
simakro Mar 7, 2022
6b86225
echo test
simakro Mar 7, 2022
1cd7d0e
remove parenthesis
simakro Mar 7, 2022
388befb
added gzip to nanofilt env and replace gunzip with gzip in rule nanofilt
simakro Mar 7, 2022
b5c4443
add file to nanofilt env
simakro Mar 7, 2022
f9b1463
set corMemory for testing to 6GB
simakro Mar 7, 2022
f76aae4
corona-spades in long_read.smk
simakro Mar 8, 2022
2118d24
solved all issues with wrong input-files for ont de-novo assembly
simakro Mar 9, 2022
18263ff
cleanup and temp-notes
simakro Mar 9, 2022
505c9f1
restored sample-table, removed echos, removed hard-coded path
simakro Mar 9, 2022
2010f28
resolve log issue
simakro Mar 11, 2022
a441053
notramp env
simakro Mar 11, 2022
b958a7a
undo changes in test config
thomasbtf Mar 14, 2022
41ae1ee
undo changes in test primer
thomasbtf Mar 14, 2022
50f76c6
add missing version in nanofilt.yaml
thomasbtf Mar 14, 2022
e33bca1
add remvoed temp flags back in
thomasbtf Mar 14, 2022
cdfe626
rmv whitespace
thomasbtf Mar 14, 2022
b1712f0
update env file
thomasbtf Mar 14, 2022
2c885c1
add canu logging
thomasbtf Mar 14, 2022
173747a
update config
thomasbtf Mar 14, 2022
da021a2
add temp
thomasbtf Mar 14, 2022
0a2b5fc
rmv hardcoded date path
thomasbtf Mar 14, 2022
d4965fd
add logs
thomasbtf Mar 15, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,238 changes: 0 additions & 1,238 deletions resources/ARTIC_v3_adapters.py

This file was deleted.

3 changes: 2 additions & 1 deletion workflow/envs/canu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@ channels:
- bioconda
- conda-forge
dependencies:
- canu =2.2
# do not use canu 2.2 !
simakro marked this conversation as resolved.
Show resolved Hide resolved
- canu =2.1.1
- minimap2 =2.22
2 changes: 2 additions & 0 deletions workflow/envs/nanofilt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ channels:
- conda-forge
dependencies:
- nanofilt =2.8
- gzip =1.11
- file =5.39
5 changes: 3 additions & 2 deletions workflow/envs/primechop.yaml → workflow/envs/notramp.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
channels:
- simakro
simakro marked this conversation as resolved.
Show resolved Hide resolved
- bioconda
simakro marked this conversation as resolved.
Show resolved Hide resolved
- conda-forge
dependencies:
- porechop =0.2
- python =3.9
- notramp =1.0.5
- minimap2 =2.22
2 changes: 1 addition & 1 deletion workflow/envs/porechop.yaml → workflow/envs/seqtk.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ channels:
- bioconda
- conda-forge
dependencies:
- porechop =0.2
- seqtk =1.3
8 changes: 3 additions & 5 deletions workflow/rules/assembly.smk
Original file line number Diff line number Diff line change
Expand Up @@ -179,11 +179,11 @@ rule assembly_polishing_illumina:
# polish ont de novo assembly
rule assembly_polishing_ont:
input:
fasta="results/{date}/corrected/{sample}/{sample}.correctedReads.fasta.gz",
fasta="results/{date}/norm_trim_raw_reads/{sample}/{sample}.cap.clip.fasta",
reference="results/{date}/contigs/ordered/{sample}.fasta",
output:
report(
"results/{date}/polishing/medaka/{sample}/{sample}.fasta",
"results/{date}/polishing/medaka/{sample}/consensus.fasta",
category="4. Sequences",
subcategory="1. De Novo Assembled Sequences",
caption="../report/assembly_ont.rst",
Expand All @@ -197,9 +197,7 @@ rule assembly_polishing_ont:
"../envs/medaka.yaml"
threads: 4
shell:
"(medaka_consensus -v -f -i {input.fasta} -o {params.outdir} -d {input.reference} -t {threads} &&"
" mv {params.outdir}/consensus.fasta {output}) "
" > {log} 2>&1"
"medaka_consensus -f -i {input.fasta} -o {params.outdir} -d {input.reference} -t {threads} > {log} 2>&1"


rule aggregate_polished_de_novo_sequences:
Expand Down
23 changes: 7 additions & 16 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -407,7 +407,7 @@ def get_reads(wildcards):
)

ont_pattern = expand(
"results/{date}/corrected/{sample}/{sample}.correctedReads.fasta.gz",
"results/{date}/corrected/{sample}/{sample}.correctedReads.clip.fasta",
**wildcards,
)

Expand Down Expand Up @@ -458,7 +458,7 @@ def get_reads_after_qc(wildcards, read="both"):
**wildcards,
)
ont_pattern = expand(
"results/{date}/nonhuman-reads/se/{sample}.fastq.gz", **wildcards
"results/{date}/nonhuman-reads/se/{sample}.fastq", **wildcards
)
ion_torrent_pattern = expand(
"results/{date}/read-clipping/fastq/se/{sample}.fastq", **wildcards
Expand Down Expand Up @@ -1297,15 +1297,6 @@ def get_include_flag_for_date(wildcards):
return [get_include_flag(sample) for sample in allsamplelist]


def get_artic_primer(wildcards):
# TODO add more _adapters.py (not preferred) or
# add a script to generate them from a link to a bed file.
# The bed file can be found in the artic repo. Related to #356
return "resources/ARTIC_v{}_adapters.py".format(
config["preprocessing"]["artic-primer-version"]
)


def get_trimmed_reads(wildcards):
"""Returns paths of files of the trimmed reads for parsing by kraken."""
return get_list_of_expanded_patters_by_technology(
Expand All @@ -1314,7 +1305,7 @@ def get_trimmed_reads(wildcards):
"results/{{{{date}}}}/trimmed/fastp-pe/{{sample}}.{read}.fastq.gz",
read=[1, 2],
),
ont_pattern="results/{{date}}/trimmed/porechop/adapter_barcode_trimming/{sample}.fastq.gz",
ont_pattern="results/{{date}}/corrected/{sample}/{sample}.correctedReads.fasta",
ion_torrent_pattern="results/{{date}}/trimmed/fastp-se/{sample}.fastq.gz",
)

Expand Down Expand Up @@ -1407,19 +1398,19 @@ def get_reads_by_stage(wildcards):
if wildcards.stage == "raw":
return get_fastqs(wildcards)
elif wildcards.stage == "trimmed":
return "results/{date}/trimmed/porechop/adapter_barcode_trimming/{sample}.fastq"
return "results/{date}/corrected/{sample}/{sample}.correctedReads.clip.fasta"
elif wildcards.stage == "clipped":
return "results/{date}/trimmed/porechop/primer_clipped/{sample}.fastq"
return "results/{date}/norm_trim_raw_reads/{sample}/{sample}.cap.clip.fasta"
elif wildcards.stage == "filtered":
return "results/{date}/trimmed/nanofilt/{sample}.fastq"
return "results/{date}/trimmed/nanofilt/{sample}.fasta"


def get_polished_sequence(wildcards):
"""Returns path to polished sequences, depend on sequencing technology."""
return get_pattern_by_technology(
wildcards,
illumina_pattern="results/{date}/polishing/bcftools-illumina/{sample}.fasta",
ont_pattern="results/{date}/polishing/medaka/{sample}/{sample}.fasta",
ont_pattern="results/{date}/consensus/medaka/{sample}/consensus.fasta",
ion_torrent_pattern="results/{date}/polishing/bcftools-illumina/{sample}.fasta",
)

Expand Down
140 changes: 65 additions & 75 deletions workflow/rules/long_read.smk
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2022 Thomas Battenfeld, Alexander Thomas, Johannes Köster.
# Copyright 2022 Thomas Battenfeld, Alexander Thomas, Johannes Köster, Simon Magin.
# Licensed under the BSD 2-Clause License (https://opensource.org/licenses/BSD-2-Clause)
# This file may not be copied, modified, or distributed
# except according to those terms.
Expand Down Expand Up @@ -32,123 +32,113 @@ rule count_fastq_reads:
"echo $(( $(cat {input} | wc -l ) / 4)) > {output} 2> {log}"


# Intermediate number of threads (4-8) achieve best speedup of a+btrimming.
# For large files 8 threads help accelerate some, small files are processed faster with 4 threads.
rule porechop_adapter_barcode_trimming:
rule nanofilt:
input:
get_fastqs,
output:
temp("results/{date}/trimmed/porechop/adapter_barcode_trimming/{sample}.fastq"),
conda:
"../envs/porechop.yaml"
temp("results/{date}/filtered/nanofilt/{sample}.fastq"),
log:
"logs/{date}/trimmed/porechop/adapter_barcode_trimming/{sample}.log",
threads: 8
shell:
"porechop -i {input} -o {output} -t {threads} -v 1 > {log} 2>&1"


rule customize_primer_porechop:
input:
get_artic_primer,
output:
temp("results/.indicators/replacement_notice.txt"),
"logs/{date}/nanofilt/{sample}.log",
params:
min_length=config["quality-criteria"]["ont"]["min-length-reads"],
min_PHRED=config["quality-criteria"]["ont"]["min-PHRED"],
conda:
"../envs/primechop.yaml"
log:
"logs/customize_primer_porechop.log",
"../envs/nanofilt.yaml"
shell:
"(cp {input} $CONDA_PREFIX/lib/python3.9/site-packages/porechop/adapters.py && "
'echo "replaced adpaters in adapter.py file in $CONDA_PREFIX/lib/python3.9/site-packages/porechop/adapters.py with ARTICv3 primers" > {output}) '
"2> {log}"
"""
if file {input} | grep -q compressed ; then
gzip -d -c {input} | NanoFilt --length {params.min_length} --quality {params.min_PHRED} --maxlength 700 > {output} 2> {log}
else
NanoFilt --length {params.min_length} --quality {params.min_PHRED} --maxlength 700 {input} > {output} 2> {log}
fi
"""


# Using a low number of threads (2-4) speed up primer-trimming significantly (>2x), even for large files,
# presumably due to the much higher number of target-sequences for trimming as compared
# to barcode+adapter-trimming. However, using only one thread is again very slow.
rule porechop_primer_trimming:
rule downsample_and_trim_raw:
input:
fastq_in=(
"results/{date}/trimmed/porechop/adapter_barcode_trimming/{sample}.fastq"
),
repl_flag="results/.indicators/replacement_notice.txt",
primer=config["preprocessing"]["amplicon-primers"],
reads="results/{date}/filtered/nanofilt/{sample}.fastq",
ref_genome="resources/genomes/main.fasta",
output:
temp("results/{date}/trimmed/porechop/primer_clipped/{sample}.fastq"),
conda:
"../envs/primechop.yaml"
"results/{date}/norm_trim_raw_reads/{sample}/{sample}.cap.fasta",
"results/{date}/norm_trim_raw_reads/{sample}/{sample}.cap.clip.fasta",
thomasbtf marked this conversation as resolved.
Show resolved Hide resolved
params:
outdir=get_output_dir,
log:
"logs/{date}/trimmed/porechop/primer_clipped/{sample}.log",
threads: 2
"results/{date}/norm_trim_raw_reads/{sample}/notramp.log",
conda:
"../envs/notramp.yaml"
shell:
"(porechop -i {input.fastq_in} -o {output} --no_split --end_size 35 --extra_end_trim 0 -t {threads} -v 1) > {log} 2>&1"
"notramp -a -r {input.reads} -p {input.primer} -g {input.ref_genome} -o {params.outdir} 2> {log}"


rule nanofilt:
use rule assembly_polishing_ont as medaka_consensus_reference with:
input:
"results/{date}/trimmed/porechop/primer_clipped/{sample}.fastq",
# Don´t ever use corrected reads as input for medaka, it is supposed to polish with raw-reads!
simakro marked this conversation as resolved.
Show resolved Hide resolved
fasta="results/{date}/norm_trim_raw_reads/{sample}/{sample}.cap.clip.fasta",
reference="resources/genomes/main.fasta",
output:
temp("results/{date}/trimmed/nanofilt/{sample}.fastq"),
log:
"logs/{date}/nanofilt/{sample}.log",
params:
min_length=config["quality-criteria"]["ont"]["min-length-reads"],
min_PHRED=config["quality-criteria"]["ont"]["min-PHRED"],
conda:
"../envs/nanofilt.yaml"
shell:
"NanoFilt --length {params.min_length} --quality {params.min_PHRED} --maxlength 500 {input} > {output} 2> {log}"
"results/{date}/consensus/medaka/{sample}/consensus.fasta",


rule canu_correct:
input:
"results/{date}/trimmed/nanofilt/{sample}.fastq",
"results/{date}/norm_trim_raw_reads/{sample}/{sample}.cap.fasta",
simakro marked this conversation as resolved.
Show resolved Hide resolved
output:
"results/{date}/corrected/{sample}/{sample}.correctedReads.fasta.gz",
temp(directory("results/{date}/corrected/{sample}/correction")),
temp(directory("results/{date}/corrected/{sample}/{sample}.seqStore")),
corr_gz="results/{date}/corrected/{sample}/{sample}.correctedReads.fasta.gz",
corr_fa="results/{date}/corrected/{sample}/{sample}.correctedReads.fasta",
log:
"logs/{date}/canu/assemble/{sample}.log",
"logs/{date}/canu/correct/{sample}.log",
params:
outdir=get_output_dir,
corr_gz="results/{date}/corrected/{sample}/{sample}.correctedReads.fasta.gz",
concurrency=lambda w, threads: get_canu_concurrency(threads),
min_length=config["quality-criteria"]["ont"]["min-length-reads"],
for_testing=lambda w, threads: get_if_testing(
f"corThreads={threads} redMemory=6 oeaMemory=6"
f"corThreads={threads} corMemory=6 redMemory=6 oeaMemory=6"
),
conda:
"../envs/canu.yaml"
threads: 16
shell:
"( if [ -d {params.outdir} ]; then rm -Rf {params.outdir}; fi &&"
" canu -correct -nanopore {input} -p {wildcards.sample} -d {params.outdir} genomeSize=30k minOverlapLength=10 minReadLength=200"
" useGrid=false {params.for_testing}"
" corMMapMerSize=10 corOutCoverage=50000 corMinCoverage=0 maxInputCoverage=20000"
" corOverlapper=minimap utgOverlapper=minimap obtOverlapper=minimap"
" corConcurrency={params.concurrency}"
" cormhapConcurrency={params.concurrency} cormhapThreads={params.concurrency}"
" cormmapConcurrency={params.concurrency} cormmapThreads={params.concurrency}"
" obtmmapConcurrency={params.concurrency} obtmmapThreads={params.concurrency}"
" utgmmapConcurrency={params.concurrency} utgmmapThreads={params.concurrency}"
" redConcurrency={params.concurrency} redThreads={params.concurrency}"
" ovbConcurrency={params.concurrency}"
" ovsConcurrency={params.concurrency}"
" oeaConcurrency={params.concurrency})"
"( if [ -d {params.outdir} ]; then rm -Rf {params.outdir}; fi && "
"canu -correct -nanopore {input} -p {wildcards.sample} -d {params.outdir} genomeSize=30k "
"minOverlapLength=10 minReadLength=200 useGrid=false {params.for_testing} "
"corMMapMerSize=10 corOutCoverage=50000 corMinCoverage=0 maxInputCoverage=20000 "
"corOverlapper=minimap utgOverlapper=minimap obtOverlapper=minimap "
"corConcurrency={params.concurrency} ovbConcurrency={params.concurrency} "
"cormhapConcurrency={params.concurrency} cormhapThreads={params.concurrency} "
"cormmapConcurrency={params.concurrency} cormmapThreads={params.concurrency} "
"obtmmapConcurrency={params.concurrency} obtmmapThreads={params.concurrency} "
"utgmmapConcurrency={params.concurrency} utgmmapThreads={params.concurrency} "
"redConcurrency={params.concurrency} redThreads={params.concurrency} "
"ovsConcurrency={params.concurrency} oeaConcurrency={params.concurrency} && "
"gzip -d {params.corr_gz} --keep)"
"> {log} 2>&1"


# rule medaka_consensus_reference:
use rule assembly_polishing_ont as medaka_consensus_reference with:
rule clip_adbc_corrected:
simakro marked this conversation as resolved.
Show resolved Hide resolved
input:
fasta="results/{date}/corrected/{sample}/{sample}.correctedReads.fasta.gz",
reference="resources/genomes/main.fasta",
primer=config["preprocessing"]["amplicon-primers"],
reads="results/{date}/corrected/{sample}/{sample}.correctedReads.fasta",
ref_genome="resources/genomes/main.fasta",
output:
temp("results/{date}/consensus/medaka/{sample}/{sample}.fasta"),
"results/{date}/corrected/{sample}/{sample}.correctedReads.clip.fasta",
params:
outdir=get_output_dir,
log:
"results/{date}/corrected/{sample}/notramp.log",
conda:
"../envs/notramp.yaml"
shell:
"notramp -t --incl_prim -r {input.reads} -p {input.primer} -g {input.ref_genome} -o {params.outdir} 2> {log}"


# polish consensus
rule bcftools_consensus_ont:
input:
fasta="results/{date}/consensus/medaka/{sample}/{sample}.fasta",
fasta="results/{date}/consensus/medaka/{sample}/consensus.fasta",
simakro marked this conversation as resolved.
Show resolved Hide resolved
bcf="results/{date}/filtered-calls/ref~{sample}/{sample}.subclonal.high+moderate-impact.orf.bcf", # clonal vs. subclonal?
bcfidx="results/{date}/filtered-calls/ref~{sample}/{sample}.subclonal.high+moderate-impact.orf.bcf.csi",
output:
Expand Down
8 changes: 3 additions & 5 deletions workflow/rules/qc.smk
Original file line number Diff line number Diff line change
Expand Up @@ -153,9 +153,7 @@ rule species_diversity_before_se:
kraken_output=temp(
"results/{date}/species-diversity/se/{sample}/{sample}.kraken"
),
report=(
"results/{date}/species-diversity/se/{sample}/{sample}.uncleaned.kreport2"
),
report="results/{date}/species-diversity/se/{sample}/{sample}.uncleaned.kreport2",
log:
"logs/{date}/kraken/se/{sample}.log",
threads: 8
Expand Down Expand Up @@ -236,7 +234,7 @@ rule order_nonhuman_reads_se:
input:
"results/{date}/mapped/ref~main+human/nonhuman/{sample}.bam",
output:
fq=temp("results/{date}/nonhuman-reads/se/{sample}.fastq.gz"),
fq=temp("results/{date}/nonhuman-reads/se/{sample}.fastq"),
bam_sorted=temp("results/{date}/nonhuman-reads/{sample}.sorted.bam"),
log:
"logs/{date}/order_nonhuman_reads/se/{sample}.log",
Expand All @@ -246,7 +244,7 @@ rule order_nonhuman_reads_se:
shell:
"(samtools sort -@ {threads} -n {input} -o {output.bam_sorted}; "
" samtools fastq -@ {threads} -0 {output.fq} {output.bam_sorted})"
" > {log} 2>&1"
"> {log} 2>&1"


# analysis of species diversity present AFTER removing human contamination
Expand Down