Skip to content

Commit

Permalink
feat(technology): add ion torrent processing (#383)
Browse files Browse the repository at this point in the history
* add ont toolchain

* move medka

* add primer clipping

* add env for primer clipping

* integrate reads in common.smk

* fmt

* add medaka_model to test config

* update sample sheet

* add check for benchmark data

* fmt

* add ci for ont

* change schema for new sample sheet

* linting

* add ARTIC_v3_adapters

* fmt

* add ARTIC primers to resources

* integrate ont pipieline

* fmt

* add missing wildcards

* agg is_sample_technongoly in is_technology

* change ont assembly from canu to spades se

* fmt

* make artic primers version adjustable

* adjust fastp, kranken and samtools depth

* rename rules

* adjust kallisto

* fmt

* fmt

* fix: kallisto_metrics log

* fmt

* fix: get_kraken_output date input

* ci: change to -o

* ci: add github ont reads

* config update

* adjust kallisto input

* fmt

* adjust assembler comparison for illumina only

* fmt

* fix: wildcard -> wildcards

* refactor: get_technology

* fmt

* fix: call function

* ci: rmv amplicon file

* add threads for canu

* add corThreads

* add maxMemory

* rmv restrictions

* add corThreads

* add threads

* add redMemory

* change to maxThreads and maxMemory

* Revert "change to maxThreads and maxMemory"

This reverts commit f053dcb.

* add oeaMemory

* add debug statement

* print kallisto metrics

* add missing space

* add print for debug

* add if

* add testing return

* fmt

* adjust test log date and names

* make canu params for testing

* fmt

* add lambda expression for testing

* fix typo

* fill empty rows in qc data sheet with "0"

* add nano qc

* fmt

* update spades env

* add vcf to medaka output

* add polishing with medeka on de novo assembly

* rmv print debug statements

* refactor masking script

* fmt

* fix masked sequence writer

* remove "manual" fasta parser

* fmt

* deal with empty rki filter

* update sample sheet generation

* extract ont read numbers for qc table

* fmt

* fix read counting

* fix spades assembeler path due to version update

* fmt

* change to contigs.fasta

* use raw_contigs with pe spades

* add if statement

* change to wildcard

* change *

* add missing wildcards

* rmv canu correct folder

* remove to long string from "Other Variants" column

* add medaka variant calling

* fmt

* fix kraken

* change to trimmed and not corrected reads for polishing

* Revert "change to trimmed and not corrected reads for polishing"

This reverts commit 81b3030.

* add missing gz

* make rki-filter less errorprone regarding samplenames

* fmt

* add human removal

* fix samtools in bamclipper

* add consesus

* add identity overview

* fmt

* fixes

* more fixes

* change path of indicators

* update report descriptions

* update more descs

* updats scripts

* Change Pangolin Call to Lineage

* comments

* fix quast call

* fixes, comments

* fmt

* fix in masking script

* add longshot

* fmt

* change logging of porechop_primer_trimming

* add porechop debug

* update report generation

* cleanup rule all

* fmt

* remove debug

* add gzip keep flag

* add polishing of consenus

* fmt

* fmt

* fmt

* fmt

* touch ups

* fmt

* fmt

* rmv unnecessary code

* add ion torrent

* fmt

* refactor todos

* improvements

* fmt

* update selection functions

* add docstrings

* updat samplesheet

* fmt

* renaming

* rmv patterns from get_reads_after_qc

* update testing

* fix technology matrix

* split all and benchmarks

* fmt

* fmt

* add data for compare_assemblers

* fix indent

* indent

* another indent

* rmv paranthesis

* Always download test data

* update artifacts

* add amplicon tests

* add missing patterns

* fmt

* update assembler config

* update test config, main.yml

* add qoutes

* add missing gz

* fix path

* add pe flags for assembler comparison

* rmv contigs output flag

* temp fix adapters

* add conda caching

Co-authored-by: simakro <[email protected]>
  • Loading branch information
thomasbtf and simakro authored Dec 9, 2021
1 parent 855061b commit 288777c
Show file tree
Hide file tree
Showing 21 changed files with 800 additions and 395 deletions.
162 changes: 143 additions & 19 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,124 @@ jobs:
snakefile: workflow/Snakefile
stagein: mamba install -n snakemake -c conda-forge peppy
args: "--lint"


Technology-Tests:
runs-on: ubuntu-latest
env:
GISAID_API_TOKEN: ${{ secrets.GISAID_API_TOKEN }}
needs:
- Formatting
- Linting
strategy:
matrix:
rule: [all, all -npr]
technology: [all, illumina, ont, ion]
seq_method: [shotgun, amplicon]
steps:
- uses: actions/checkout@v2

- name: Cache conda dependencies
uses: actions/cache@v2
with:
path: |
.tests/.snakemake/conda
key: technology-${{ runner.os }}-${{ matrix.rule }}-${{ matrix.technology }}-${{ matrix.seq_method }}-${{ hashFiles('*.tests/.snakemake/conda/*.yaml') }}

- name: Prepare test data for all technologies
if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'all' || matrix.rule == 'compare_assemblers')
run: |
if [[ "${{ matrix.seq_method }}" = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
mkdir -p .tests/data
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.1.fastq.gz
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.2.fastq.gz
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/ont_reads.fastq.gz > .tests/data/ont_reads.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR574/003/ERR5745913/ERR5745913.fastq.gz > .tests/data/ion_reads.fastq.gz
echo sample_name,fq1,fq2,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
echo illumina-test,data/B117.1.fastq.gz,data/B117.2.fastq.gz,2022-01-01,$AMPLICON,illumina >> .tests/config/pep/samples.csv
echo ont-test,data/ont_reads.fastq.gz,,2022-01-01,$AMPLICON,ont >> .tests/config/pep/samples.csv
echo ion-test,data/ion_reads.fastq.gz,,2022-01-01,$AMPLICON,ion >> .tests/config/pep/samples.csv
- name: Prepare test data for Illumina
if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'illumina' || matrix.rule == 'compare_assemblers')
run: |
if [[ "${{ matrix.seq_method }}" = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
mkdir -p .tests/data
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.1.fastq.gz
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.2.fastq.gz
echo sample_name,fq1,fq2,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
echo illumina-test,data/B117.1.fastq.gz,data/B117.2.fastq.gz,2022-01-01,$AMPLICON,illumina >> .tests/config/pep/samples.csv
- name: Prepare test data for Oxford Nanopore
if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'ont' || matrix.rule == 'compare_assemblers')
run: |
if [[ "${{ matrix.seq_method }}" = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
mkdir -p .tests/data
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/ont_reads.fastq.gz > .tests/data/ont_reads.fastq.gz
echo sample_name,fq1,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
echo ont-test,data/ont_reads.fastq.gz,2022-01-01,$AMPLICON,ont >> .tests/config/pep/samples.csv
- name: Prepare test data for Ion Torrent
if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'ion' || matrix.rule == 'compare_assemblers')
run: |
if [[ "${{ matrix.seq_method }}" = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
mkdir -p .tests/data
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR574/003/ERR5745913/ERR5745913.fastq.gz > .tests/data/ion_reads.fastq.gz
echo sample_name,fq1,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
echo ion-test,data/ion_reads.fastq.gz,2022-01-01,$AMPLICON,ion >> .tests/config/pep/samples.csv
- name: Use smaller reference files for testing
if: steps.test-resources.outputs.cache-hit != true
run: |
# mkdir -p .tests/resources/minikraken-8GB
# curl -SL https://github.com/thomasbtf/small-kraken-db/raw/master/human_k2db.tar.gz | tar zxvf - -C .tests/resources/minikraken-8GB --strip 1
mkdir -p .tests/resources/genomes
curl -SL "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=NC_000021.9&db=nuccore&report=fasta" | gzip -c > .tests/resources/genomes/human-genome.fna.gz
- name: Simulate GISAID download
run: |
mkdir -p .tests/results/benchmarking/tables
echo -e "resources/genomes/B.1.1.7.fasta\nresources/genomes/B.1.351.fasta" > .tests/results/benchmarking/tables/strain-genomes.txt
mkdir -p .tests/resources/genomes
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314997.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.1.7.fasta
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314998.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.351.fasta
- name: Test rule ${{ matrix.rule }} on ${{ matrix.technology }} ${{ matrix.seq_method }} data
uses: snakemake/[email protected]
with:
directory: .tests
snakefile: workflow/Snakefile
args: "--use-conda --show-failed-logs --cores 2 --resources ncbi_api_requests=1 --conda-cleanup-pkgs cache --conda-frontend mamba ${{ matrix.rule }}"

- name: Test report
uses: snakemake/[email protected]
if: startsWith(matrix.rule, 'all -npr') != true
with:
directory: .tests
snakefile: workflow/Snakefile
args: "${{ matrix.rule }} --report report.zip"

- name: Upload report
uses: actions/upload-artifact@v2
if: matrix.technology == 'all' && matrix.rule != 'all -npr'
with:
name: report-rule-${{ matrix.rule }}-${{ matrix.technology }}-${{ matrix.seq_method }}
path: .tests/report.zip

- name: Upload logs
uses: actions/upload-artifact@v2
if: matrix.technology == 'all' && matrix.rule != 'all -npr'
with:
name: log-rule-${{ matrix.rule }}-technology-${{ matrix.technology }}
path: .tests/logs/

- name: Change permissions for caching
run: sudo chmod -R 755 .tests/.snakemake/conda

- name: Print disk space
run: sudo df -h

Testing:
Benchmarks-Tests:
runs-on: ubuntu-latest
env:
GISAID_API_TOKEN: ${{ secrets.GISAID_API_TOKEN }}
Expand All @@ -60,10 +175,18 @@ jobs:
- Linting
strategy:
matrix:
rule: [all, all -npr, benchmark_strain_calling, benchmark_assembly, benchmark_mixtures, benchmark_non_sars_cov_2, compare_assemblers, benchmark_reads]
rule: [benchmark_strain_calling, benchmark_assembly, benchmark_mixtures, benchmark_non_sars_cov_2, benchmark_reads, compare_assemblers]
steps:
- uses: actions/checkout@v2

- name: Cache conda dependencies
uses: actions/cache@v2
with:
path: |
.tests/.snakemake/conda
key: benchmarks-${{ runner.os }}-${{ matrix.rule }}-${{ matrix.technology }}-${{ matrix.seq_method }}-${{ hashFiles('*.tests/.snakemake/conda/*.yaml') }}


# TODO caches are currently completely misleading, as they lead to certain files becoming present on disk which might
# then hide failures that would otherwise be seen.

Expand Down Expand Up @@ -145,14 +268,16 @@ jobs:
# ${{ runner.os }}-sars-cov-benchmark-dependencies-${{ steps.get-date.outputs.date }}-
# ${{ runner.os }}-sars-cov-benchmark-dependencies-

- name: Download test data
if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') || matrix.rule == 'compare_assemblers')

- name: Prepare test data
if: steps.test-data.outputs.cache-hit != true
run: |
mkdir -p .tests/data
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.1.fastq.gz
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.2.fastq.gz
curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/ont_reads.fastq.gz > .tests/data/ont_reads.fastq.gz
echo sample_name,fq1,fq2,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
echo illumina-test,data/B117.1.fastq.gz,data/B117.2.fastq.gz,2022-01-01,0,illumina >> .tests/config/pep/samples.csv
- name: Use smaller reference files for testing
if: steps.test-resources.outputs.cache-hit != true
run: |
Expand All @@ -169,8 +294,7 @@ jobs:
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314997.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.1.7.fasta
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314998.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.351.fasta
- name: Test rule ${{ matrix.rule }}
- name: Test rule ${{ matrix.rule }}
uses: snakemake/[email protected]
with:
directory: .tests
Expand All @@ -185,16 +309,16 @@ jobs:
snakefile: workflow/Snakefile
args: "${{ matrix.rule }} --report report.zip"

- name: Upload report
uses: actions/upload-artifact@v2
with:
name: report-${{ matrix.rule }}
path: .tests/report.zip
# - name: Upload report
# uses: actions/upload-artifact@v2
# with:
# name: report-rule-${{ matrix.rule }}
# path: .tests/report.zip

- name: Upload logs
uses: actions/upload-artifact@v2
with:
name: log-${{ matrix.rule }}
name: log-rule-${{ matrix.rule }}
path: .tests/logs/

# - name: Unit test
Expand Down Expand Up @@ -226,7 +350,7 @@ jobs:
cat .tests/results/benchmarking/assembly/pseudoassembly.csv
if [[ $(tail -1 .tests/results/benchmarking/assembly/pseudoassembly.csv) < 0.95 ]]
then
echo "Pseudoassembly bechmarking failed. There is at least one assembly where the contigs do not cover 95% of the original sequence (see above)."
echo "Pseudoassembly benchmarking failed. There is at least one assembly where the contigs do not cover 95% of the original sequence (see above)."
exit 1
else
echo "Pseudoassembly was successful."
Expand All @@ -238,7 +362,7 @@ jobs:
cat .tests/results/benchmarking/assembly/assembly.csv
if [[ $(tail -1 .tests/results/benchmarking/assembly/assembly.csv) < 0.8 ]]
then
echo "Assembly bechmarking failed. There is at least one assembly where the contigs do not cover 80% of the original sequence (see above)."
echo "Assembly benchmarking failed. There is at least one assembly where the contigs do not cover 80% of the original sequence (see above)."
exit 1
else
echo "Assembly was successful."
Expand All @@ -261,8 +385,8 @@ jobs:
echo "Workflow sucessfully identified samples as non-sars-cov-2 in all cases."
fi
- name: Print disk space
run: sudo df -h

- name: Change permissions for caching
run: sudo chmod -R 755 .tests/.snakemake/conda

- name: Print disk space
run: sudo df -h
29 changes: 20 additions & 9 deletions .tests/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -77,16 +77,27 @@ variant-calling:
high+moderate-impact: 'ANN["IMPACT"] in ["HIGH", "MODERATE"]'

assembly:
# minimum posterior probability for a clonal variant to be included in the generated pseudoassembly
illumina:
# assemblers used for shotgun sequencing with on Illumina technology
shotgun: "megahit-std"
# assemblers used for amplicon sequencing with on Illumina technology
amplicon: "metaspades"
oxford nanopore:
# assemblers used for shotgun sequencing with on Oxford Nanopore technology
shotgun: "megahit-std"
# assemblers used for amplicon sequencing with on Oxford Nanopore technology
amplicon: "spades"
# Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
# with the format: {pore}_{device}_{caller variant}_{caller version}
# See https://github.com/nanoporetech/medaka#models for more information.
medaka_model: r941_min_fast_g303
ion torrent:
# assemblers used for shotgun sequencing with on Ion Torrent technology
shotgun: "megahit-std"
# assemblers used for amplicon sequencing with on Torrent technology
amplicon: "spades"
# minimum posterior probability for a clonal variant to be included in the generated pseudo-assembly
min-variant-prob: 0.95
# assemblers used for shotgun sequencing for Illumina data
shotgun: "megahit-std"
# assemblers used for amplicon sequencing for Illumina data
amplicon: "metaspades"
# Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
# with the format: {pore}_{device}_{caller variant}_{caller version}
# See https://github.com/nanoporetech/medaka#models for more information.
medaka_model: r941_min_fast_g303


strain-calling:
Expand Down
3 changes: 0 additions & 3 deletions .tests/config/pep/samples.csv

This file was deleted.

29 changes: 20 additions & 9 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,16 +56,27 @@ adapters:
artic-primer-version: 3

assembly:
# minimum posterior probability for a clonal variant to be included in the generated pseudoassembly
illumina:
# assemblers used for shotgun sequencing with on Illumina technology
shotgun: "megahit-std"
# assemblers used for amplicon sequencing with on Illumina technology
amplicon: "metaspades"
oxford nanopore:
# assemblers used for shotgun sequencing with on Oxford Nanopore technology
shotgun: "megahit-std"
# assemblers used for amplicon sequencing with on Oxford Nanopore technology
amplicon: "spades"
# Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
# with the format: {pore}_{device}_{caller variant}_{caller version}
# See https://github.com/nanoporetech/medaka#models for more information.
medaka_model: r941_min_fast_g303
ion torrent:
# assemblers used for shotgun sequencing with on Ion Torrent technology
shotgun: "megahit-std"
# assemblers used for amplicon sequencing with on Torrent technology
amplicon: "spades"
# minimum posterior probability for a clonal variant to be included in the generated pseudo-assembly
min-variant-prob: 0.95
# assemblers used for shotgun sequencing for Illumina data
shotgun: "megahit-std"
# assemblers used for amplicon sequencing for Illumina data
amplicon: "metaspades"
# Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
# with the format: {pore}_{device}_{caller variant}_{caller version}
# See https://github.com/nanoporetech/medaka#models for more information.
medaka_model: r941_min_fast_g303

variant-calling:
# false discovery rate to control for
Expand Down
5 changes: 3 additions & 2 deletions config/pep/samples.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
sample_name,fq1,fq2,date,is_amplicon_data,technology
NAME,PATH/TO/fq1,PATH/TO/fq2,ID,0,illumina
NAME,PATH/TO/fq1,,ID,1,ont
SAMPLE_NAME_1,PATH/TO/fq1,PATH/TO/fq2,SEQUENCING_DATE,0,illumina # Required information for a sample sequencing on the Illumina platform
SAMPLE_NAME_2,PATH/TO/fq,,SEQUENCING_DATE,1,ont # Required information for a sample sequencing on the Oxford Nanopore platform
SAMPLE_NAME_3,PATH/TO/fq,,SEQUENCING_DATE,1,ion # Required information for a sample sequencing on the Ion Torrent platform
2 changes: 0 additions & 2 deletions workflow/envs/bamclipper.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,3 @@ channels:
- conda-forge
dependencies:
- bamclipper =1.0
- fgbio = 1.3
- samtools = 1.9
5 changes: 5 additions & 0 deletions workflow/envs/fgbio.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
channels:
- bioconda
- conda-forge
dependencies:
- fgbio = 1.3
2 changes: 1 addition & 1 deletion workflow/envs/samtools.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ channels:
- bioconda
- conda-forge
dependencies:
- samtools =1.10
- samtools =1.14
Loading

0 comments on commit 288777c

Please sign in to comment.