feat(technology): add ion torrent processing (#383)

* add ont toolchain * move medka * add primer clipping * add env for primer clipping * integrate reads in common.smk * fmt * add medaka_model to test config * update sample sheet * add check for benchmark data * fmt * add ci for ont * change schema for new sample sheet * linting * add ARTIC_v3_adapters * fmt * add ARTIC primers to resources * integrate ont pipieline * fmt * add missing wildcards * agg is_sample_technongoly in is_technology * change ont assembly from canu to spades se * fmt * make artic primers version adjustable * adjust fastp, kranken and samtools depth * rename rules * adjust kallisto * fmt * fmt * fix: kallisto_metrics log * fmt * fix: get_kraken_output date input * ci: change to -o * ci: add github ont reads * config update * adjust kallisto input * fmt * adjust assembler comparison for illumina only * fmt * fix: wildcard -> wildcards * refactor: get_technology * fmt * fix: call function * ci: rmv amplicon file * add threads for canu * add corThreads * add maxMemory * rmv restrictions * add corThreads * add threads * add redMemory * change to maxThreads and maxMemory * Revert "change to maxThreads and maxMemory" This reverts commit f053dcb. * add oeaMemory * add debug statement * print kallisto metrics * add missing space * add print for debug * add if * add testing return * fmt * adjust test log date and names * make canu params for testing * fmt * add lambda expression for testing * fix typo * fill empty rows in qc data sheet with "0" * add nano qc * fmt * update spades env * add vcf to medaka output * add polishing with medeka on de novo assembly * rmv print debug statements * refactor masking script * fmt * fix masked sequence writer * remove "manual" fasta parser * fmt * deal with empty rki filter * update sample sheet generation * extract ont read numbers for qc table * fmt * fix read counting * fix spades assembeler path due to version update * fmt * change to contigs.fasta * use raw_contigs with pe spades * add if statement * change to wildcard * change * * add missing wildcards * rmv canu correct folder * remove to long string from "Other Variants" column * add medaka variant calling * fmt * fix kraken * change to trimmed and not corrected reads for polishing * Revert "change to trimmed and not corrected reads for polishing" This reverts commit 81b3030. * add missing gz * make rki-filter less errorprone regarding samplenames * fmt * add human removal * fix samtools in bamclipper * add consesus * add identity overview * fmt * fixes * more fixes * change path of indicators * update report descriptions * update more descs * updats scripts * Change Pangolin Call to Lineage * comments * fix quast call * fixes, comments * fmt * fix in masking script * add longshot * fmt * change logging of porechop_primer_trimming * add porechop debug * update report generation * cleanup rule all * fmt * remove debug * add gzip keep flag * add polishing of consenus * fmt * fmt * fmt * fmt * touch ups * fmt * fmt * rmv unnecessary code * add ion torrent * fmt * refactor todos * improvements * fmt * update selection functions * add docstrings * updat samplesheet * fmt * renaming * rmv patterns from get_reads_after_qc * update testing * fix technology matrix * split all and benchmarks * fmt * fmt * add data for compare_assemblers * fix indent * indent * another indent * rmv paranthesis * Always download test data * update artifacts * add amplicon tests * add missing patterns * fmt * update assembler config * update test config, main.yml * add qoutes * add missing gz * fix path * add pe flags for assembler comparison * rmv contigs output flag * temp fix adapters * add conda caching Co-authored-by: simakro <[email protected]>
IKIM-Essen · Dec 9, 2021 · 288777c · 288777c
1 parent 855061b
commit 288777c
Show file tree

Hide file tree

Showing 21 changed files with 800 additions and 395 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -49,9 +49,124 @@ jobs:
           snakefile: workflow/Snakefile
           stagein: mamba install -n snakemake -c conda-forge peppy
           args: "--lint"
+
 
+  Technology-Tests:
+    runs-on: ubuntu-latest
+    env:
+      GISAID_API_TOKEN: ${{ secrets.GISAID_API_TOKEN }}
+    needs:
+      - Formatting
+      - Linting
+    strategy:
+      matrix:
+        rule: [all, all -npr]
+        technology: [all, illumina, ont, ion]
+        seq_method: [shotgun, amplicon]
+    steps:
+      - uses: actions/checkout@v2
+
+      - name: Cache conda dependencies
+        uses: actions/cache@v2
+        with:
+          path: |
+            .tests/.snakemake/conda
+          key: technology-${{ runner.os }}-${{ matrix.rule }}-${{ matrix.technology }}-${{ matrix.seq_method }}-${{ hashFiles('*.tests/.snakemake/conda/*.yaml') }}
+
+      - name: Prepare test data for all technologies
+        if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'all' || matrix.rule == 'compare_assemblers')
+        run: |
+          if [[ "${{ matrix.seq_method }}"  = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
+          mkdir -p .tests/data
+          curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.1.fastq.gz
+          curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.2.fastq.gz
+          curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/ont_reads.fastq.gz > .tests/data/ont_reads.fastq.gz 
+          curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR574/003/ERR5745913/ERR5745913.fastq.gz > .tests/data/ion_reads.fastq.gz 
+          echo sample_name,fq1,fq2,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
+          echo illumina-test,data/B117.1.fastq.gz,data/B117.2.fastq.gz,2022-01-01,$AMPLICON,illumina >> .tests/config/pep/samples.csv
+          echo ont-test,data/ont_reads.fastq.gz,,2022-01-01,$AMPLICON,ont >> .tests/config/pep/samples.csv
+          echo ion-test,data/ion_reads.fastq.gz,,2022-01-01,$AMPLICON,ion >> .tests/config/pep/samples.csv
+
+      - name: Prepare test data for Illumina
+        if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'illumina' || matrix.rule == 'compare_assemblers')
+        run: |
+          if [[ "${{ matrix.seq_method }}"  = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
+          mkdir -p .tests/data
+          curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.1.fastq.gz
+          curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.2.fastq.gz
+          echo sample_name,fq1,fq2,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
+          echo illumina-test,data/B117.1.fastq.gz,data/B117.2.fastq.gz,2022-01-01,$AMPLICON,illumina >> .tests/config/pep/samples.csv
+
+      - name: Prepare test data for Oxford Nanopore
+        if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'ont' || matrix.rule == 'compare_assemblers')
+        run: |
+          if [[ "${{ matrix.seq_method }}"  = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
+          mkdir -p .tests/data
+          curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/ont_reads.fastq.gz > .tests/data/ont_reads.fastq.gz 
+          echo sample_name,fq1,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
+          echo ont-test,data/ont_reads.fastq.gz,2022-01-01,$AMPLICON,ont >> .tests/config/pep/samples.csv
+
+      - name: Prepare test data for Ion Torrent
+        if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') && matrix.technology == 'ion' || matrix.rule == 'compare_assemblers')
+        run: |
+          if [[ "${{ matrix.seq_method }}"  = "shotgun" ]] ; then export AMPLICON=0; else export AMPLICON=1; fi
+          mkdir -p .tests/data
+          curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR574/003/ERR5745913/ERR5745913.fastq.gz > .tests/data/ion_reads.fastq.gz 
+          echo sample_name,fq1,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
+          echo ion-test,data/ion_reads.fastq.gz,2022-01-01,$AMPLICON,ion >> .tests/config/pep/samples.csv
+
+      - name: Use smaller reference files for testing
+        if: steps.test-resources.outputs.cache-hit != true
+        run: |
+          # mkdir -p .tests/resources/minikraken-8GB
+          # curl -SL https://github.com/thomasbtf/small-kraken-db/raw/master/human_k2db.tar.gz | tar zxvf - -C .tests/resources/minikraken-8GB --strip 1
+          mkdir -p .tests/resources/genomes
+          curl -SL "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=NC_000021.9&db=nuccore&report=fasta" | gzip -c > .tests/resources/genomes/human-genome.fna.gz
+
+      - name: Simulate GISAID download
+        run: |
+          mkdir -p .tests/results/benchmarking/tables
+          echo -e "resources/genomes/B.1.1.7.fasta\nresources/genomes/B.1.351.fasta" > .tests/results/benchmarking/tables/strain-genomes.txt
+          mkdir -p .tests/resources/genomes
+          curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314997.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.1.7.fasta
+          curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314998.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.351.fasta
+
+      - name: Test rule ${{ matrix.rule }} on ${{ matrix.technology }} ${{ matrix.seq_method }} data
+        uses: snakemake/[email protected]
+        with:
+          directory: .tests
+          snakefile: workflow/Snakefile
+          args: "--use-conda --show-failed-logs --cores 2 --resources ncbi_api_requests=1 --conda-cleanup-pkgs cache --conda-frontend mamba ${{ matrix.rule }}"
+
+      - name: Test report
+        uses: snakemake/[email protected]
+        if: startsWith(matrix.rule, 'all -npr') != true
+        with:
+          directory: .tests
+          snakefile: workflow/Snakefile
+          args: "${{ matrix.rule }} --report report.zip"
+
+      - name: Upload report
+        uses: actions/upload-artifact@v2
+        if: matrix.technology == 'all' && matrix.rule != 'all -npr'
+        with:
+          name: report-rule-${{ matrix.rule }}-${{ matrix.technology }}-${{ matrix.seq_method }}
+          path: .tests/report.zip
+
+      - name: Upload logs
+        uses: actions/upload-artifact@v2
+        if: matrix.technology == 'all' && matrix.rule != 'all -npr'
+        with:
+          name: log-rule-${{ matrix.rule }}-technology-${{ matrix.technology }}
+          path: .tests/logs/
+
+      - name: Change permissions for caching
+        run: sudo chmod -R 755 .tests/.snakemake/conda
+
+      - name: Print disk space
+        run: sudo df -h
 
-  Testing:
+  Benchmarks-Tests:
     runs-on: ubuntu-latest
     env:
       GISAID_API_TOKEN: ${{ secrets.GISAID_API_TOKEN }}
@@ -60,10 +175,18 @@ jobs:
       - Linting
     strategy:
       matrix:
-        rule: [all, all -npr, benchmark_strain_calling, benchmark_assembly, benchmark_mixtures, benchmark_non_sars_cov_2, compare_assemblers, benchmark_reads]
+        rule: [benchmark_strain_calling, benchmark_assembly, benchmark_mixtures, benchmark_non_sars_cov_2, benchmark_reads, compare_assemblers]
     steps:
       - uses: actions/checkout@v2
 
+      - name: Cache conda dependencies
+        uses: actions/cache@v2
+        with:
+          path: |
+            .tests/.snakemake/conda
+          key: benchmarks-${{ runner.os }}-${{ matrix.rule }}-${{ matrix.technology }}-${{ matrix.seq_method }}-${{ hashFiles('*.tests/.snakemake/conda/*.yaml') }}
+
+
       # TODO caches are currently completely misleading, as they lead to certain files becoming present on disk which might
       # then hide failures that would otherwise be seen.
 
@@ -145,14 +268,16 @@ jobs:
       #       ${{ runner.os }}-sars-cov-benchmark-dependencies-${{ steps.get-date.outputs.date }}-
       #       ${{ runner.os }}-sars-cov-benchmark-dependencies-
 
-      - name: Download test data
-        if: steps.test-data.outputs.cache-hit != true && (startsWith(matrix.rule, 'all') || matrix.rule == 'compare_assemblers')
+
+      - name: Prepare test data
+        if: steps.test-data.outputs.cache-hit != true
         run: |
           mkdir -p .tests/data
           curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.1.fastq.gz
           curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/B.1.1.7.reads.1.fastq.gz > .tests/data/B117.2.fastq.gz
-          curl -L https://github.com/thomasbtf/small-kraken-db/raw/master/ont_reads.fastq.gz > .tests/data/ont_reads.fastq.gz
-      
+          echo sample_name,fq1,fq2,date,is_amplicon_data,technology > .tests/config/pep/samples.csv
+          echo illumina-test,data/B117.1.fastq.gz,data/B117.2.fastq.gz,2022-01-01,0,illumina >> .tests/config/pep/samples.csv
+
       - name: Use smaller reference files for testing
         if: steps.test-resources.outputs.cache-hit != true
         run: |
@@ -169,8 +294,7 @@ jobs:
           curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314997.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.1.7.fasta
           curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MZ314998.1&rettype=fasta" | sed '$ d' > .tests/resources/genomes/B.1.351.fasta
 
-
-      - name: Test rule ${{ matrix.rule }}
+      - name: Test rule ${{ matrix.rule }} 
         uses: snakemake/[email protected]
         with:
           directory: .tests
@@ -185,16 +309,16 @@ jobs:
           snakefile: workflow/Snakefile
           args: "${{ matrix.rule }} --report report.zip"
 
-      - name: Upload report
-        uses: actions/upload-artifact@v2
-        with:
-          name: report-${{ matrix.rule }}
-          path: .tests/report.zip
+      # - name: Upload report
+      #   uses: actions/upload-artifact@v2
+      #   with:
+      #     name: report-rule-${{ matrix.rule }}
+      #     path: .tests/report.zip
 
       - name: Upload logs
         uses: actions/upload-artifact@v2
         with:
-          name: log-${{ matrix.rule }}
+          name: log-rule-${{ matrix.rule }}
           path: .tests/logs/
 
       # - name: Unit test
@@ -226,7 +350,7 @@ jobs:
           cat .tests/results/benchmarking/assembly/pseudoassembly.csv
           if [[ $(tail -1 .tests/results/benchmarking/assembly/pseudoassembly.csv) < 0.95 ]]
           then
-            echo "Pseudoassembly bechmarking failed. There is at least one assembly where the contigs do not cover 95% of the original sequence (see above)."
+            echo "Pseudoassembly benchmarking failed. There is at least one assembly where the contigs do not cover 95% of the original sequence (see above)."
             exit 1
           else
             echo "Pseudoassembly was successful."
@@ -238,7 +362,7 @@ jobs:
           cat .tests/results/benchmarking/assembly/assembly.csv
           if [[ $(tail -1 .tests/results/benchmarking/assembly/assembly.csv) < 0.8 ]]
           then
-            echo "Assembly bechmarking failed. There is at least one assembly where the contigs do not cover 80% of the original sequence (see above)."
+            echo "Assembly benchmarking failed. There is at least one assembly where the contigs do not cover 80% of the original sequence (see above)."
             exit 1
           else
             echo "Assembly was successful."
@@ -261,8 +385,8 @@ jobs:
               echo "Workflow sucessfully identified samples as non-sars-cov-2 in all cases."
           fi
 
-      - name: Print disk space
-        run: sudo df -h
-
       - name: Change permissions for caching
         run: sudo chmod -R 755 .tests/.snakemake/conda
+
+      - name: Print disk space
+        run: sudo df -h
diff --git a/.tests/config/config.yaml b/.tests/config/config.yaml
@@ -77,16 +77,27 @@ variant-calling:
     high+moderate-impact: 'ANN["IMPACT"] in ["HIGH", "MODERATE"]'
 
 assembly:
-  # minimum posterior probability for a clonal variant to be included in the generated pseudoassembly
+  illumina:
+    # assemblers used for shotgun sequencing with on Illumina technology
+    shotgun: "megahit-std"
+    # assemblers used for amplicon sequencing with on Illumina technology
+    amplicon: "metaspades"
+  oxford nanopore:
+    # assemblers used for shotgun sequencing with on Oxford Nanopore technology
+    shotgun: "megahit-std"
+    # assemblers used for amplicon sequencing with on Oxford Nanopore technology
+    amplicon: "spades"
+    # Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
+    # with the format: {pore}_{device}_{caller variant}_{caller version}
+    # See https://github.com/nanoporetech/medaka#models for more information.
+    medaka_model: r941_min_fast_g303
+  ion torrent:
+    # assemblers used for shotgun sequencing with on Ion Torrent technology
+    shotgun: "megahit-std"
+    # assemblers used for amplicon sequencing with on Torrent technology
+    amplicon: "spades"
+  # minimum posterior probability for a clonal variant to be included in the generated pseudo-assembly
   min-variant-prob: 0.95
-  # assemblers used for shotgun sequencing for Illumina data
-  shotgun: "megahit-std"
-  # assemblers used for amplicon sequencing for Illumina data
-  amplicon: "metaspades"
-  # Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
-  # with the format: {pore}_{device}_{caller variant}_{caller version}
-  # See https://github.com/nanoporetech/medaka#models for more information.
-  medaka_model: r941_min_fast_g303
 
 
 strain-calling:

diff --git a/.tests/config/pep/samples.csv b/.tests/config/pep/samples.csv
diff --git a/config/config.yaml b/config/config.yaml
@@ -56,16 +56,27 @@ adapters:
   artic-primer-version: 3
 
 assembly:
-  # minimum posterior probability for a clonal variant to be included in the generated pseudoassembly
+  illumina:
+    # assemblers used for shotgun sequencing with on Illumina technology
+    shotgun: "megahit-std"
+    # assemblers used for amplicon sequencing with on Illumina technology
+    amplicon: "metaspades"
+  oxford nanopore:
+    # assemblers used for shotgun sequencing with on Oxford Nanopore technology
+    shotgun: "megahit-std"
+    # assemblers used for amplicon sequencing with on Oxford Nanopore technology
+    amplicon: "spades"
+    # Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
+    # with the format: {pore}_{device}_{caller variant}_{caller version}
+    # See https://github.com/nanoporetech/medaka#models for more information.
+    medaka_model: r941_min_fast_g303
+  ion torrent:
+    # assemblers used for shotgun sequencing with on Ion Torrent technology
+    shotgun: "megahit-std"
+    # assemblers used for amplicon sequencing with on Torrent technology
+    amplicon: "spades"
+  # minimum posterior probability for a clonal variant to be included in the generated pseudo-assembly
   min-variant-prob: 0.95
-  # assemblers used for shotgun sequencing for Illumina data
-  shotgun: "megahit-std"
-  # assemblers used for amplicon sequencing for Illumina data
-  amplicon: "metaspades"
-  # Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the basecaller variant, and iv) the basecaller version
-  # with the format: {pore}_{device}_{caller variant}_{caller version}
-  # See https://github.com/nanoporetech/medaka#models for more information.
-  medaka_model: r941_min_fast_g303
 
 variant-calling:
   # false discovery rate to control for

diff --git a/config/pep/samples.csv b/config/pep/samples.csv
@@ -1,3 +1,4 @@
 sample_name,fq1,fq2,date,is_amplicon_data,technology
-NAME,PATH/TO/fq1,PATH/TO/fq2,ID,0,illumina
-NAME,PATH/TO/fq1,,ID,1,ont
+SAMPLE_NAME_1,PATH/TO/fq1,PATH/TO/fq2,SEQUENCING_DATE,0,illumina # Required information for a sample sequencing on the Illumina platform
+SAMPLE_NAME_2,PATH/TO/fq,,SEQUENCING_DATE,1,ont # Required information for a sample sequencing on the Oxford Nanopore platform
+SAMPLE_NAME_3,PATH/TO/fq,,SEQUENCING_DATE,1,ion # Required information for a sample sequencing on the Ion Torrent platform
diff --git a/workflow/envs/bamclipper.yaml b/workflow/envs/bamclipper.yaml
@@ -3,5 +3,3 @@ channels:
   - conda-forge
 dependencies:
   - bamclipper =1.0
-  - fgbio = 1.3
-  - samtools = 1.9
diff --git a/workflow/envs/fgbio.yaml b/workflow/envs/fgbio.yaml
@@ -0,0 +1,5 @@
+channels:
+  - bioconda
+  - conda-forge
+dependencies:
+  - fgbio = 1.3
diff --git a/workflow/envs/samtools.yaml b/workflow/envs/samtools.yaml
@@ -2,4 +2,4 @@ channels:
   - bioconda
   - conda-forge
 dependencies:
-  - samtools =1.10
+  - samtools =1.14