sanger-tol · muffato · Sep 13, 2024 · Sep 16, 2024 · Sep 16, 2024 · Sep 19, 2024
diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -19,10 +19,10 @@ jobs:
       - uses: actions/setup-node@v4
 
       - name: Install editorconfig-checker
-        run: npm install -g editorconfig-checker
+        run: npm install -g editorconfig-checker@3.0.2
 
       - name: Run ECLint check
-        run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile')
+        run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile\|.sqlite3')
 
   Prettier:
     runs-on: ubuntu-latest

diff --git a/.github/workflows/sanger_test_full.yml b/.github/workflows/sanger_test_full.yml
@@ -26,7 +26,7 @@ jobs:
         with:
           workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }}
           access_token: ${{ secrets.TOWER_ACCESS_TOKEN }}
-          compute_env: ${{ secrets.TOWER_COMPUTE_ENV_LARGE }}
+          compute_env: ${{ secrets.TOWER_COMPUTE_ENV }}
           revision: ${{ env.REVISION }}
           workdir: ${{ secrets.TOWER_WORKDIR_PARENT }}/work/${{ github.repository }}/work-${{ env.REVISION }}
           parameters: |

diff --git a/.nf-core.yml b/.nf-core.yml
@@ -12,13 +12,18 @@ lint:
     - LICENCE
     - lib/NfcoreTemplate.groovy
     - CODE_OF_CONDUCT.md
+    - assets/sendmail_template.txt
+    - assets/email_template.html
+    - assets/email_template.txt
     - assets/nf-core-blobtoolkit_logo_light.png
     - docs/images/nf-core-blobtoolkit_logo_light.png
     - docs/images/nf-core-blobtoolkit_logo_dark.png
     - .github/ISSUE_TEMPLATE/bug_report.yml
     - .github/PULL_REQUEST_TEMPLATE.md
     - .github/workflows/linting.yml
     - .github/workflows/branch.yml
+    - .github/CONTRIBUTING.md
+    - .github/workflows/linting_comment.yml
   multiqc_config:
     - report_comment
   nextflow_config:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,25 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-10-02]
+
+The pipeline is now considered to be a complete and suitable replacement for the Snakemake version.
+
+- Fetch information about the chromosomes of the assemblies. Used to power
+  "grid plots".
+- Fill in accurate read information in the blobDir. Users are now reqiured
+  to indicate in the samplesheet whether the reads are paired or single.
+- Updated the Blastn settings to allow 7 days runtime at most, since that
+  covers 99.7% of the jobs.
+
+### Software dependencies
+
+Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.
+
+| Dependency  | Old version | New version |
+| ----------- | ----------- | ----------- |
+| blobtoolkit | 4.3.9       | 4.3.13      |
+
 ## [[0.6.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.6.0)] – Bellsprout – [2024-09-13]
 
 The pipeline has now been validated for draft (unpublished) assemblies.
@@ -87,13 +106,13 @@ The pipeline has now been validated on dozens of genomes, up to 11 Gbp.
 
 Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.
 
-| Dependency  | Old version   | New version   |
-| ----------- | ------------- | ------------- |
-| blobtoolkit | 4.3.3         | 4.3.9         |
-| blast       | 2.14.0        | 2.15.0        |
-| multiqc     | 1.17 and 1.18 | 1.20 and 1.21 |
-| samtools    | 1.18          | 1.19.2        |
-| seqtk       | 1.3           | 1.4           |
+| Dependency  | Old version   | New version       |
+| ----------- | ------------- | ----------------- |
+| blobtoolkit | 4.3.3         | 4.3.9             |
+| blast       | 2.14.0        | 2.15.0 and 2.14.1 |
+| multiqc     | 1.17 and 1.18 | 1.20 and 1.21     |
+| samtools    | 1.18          | 1.19.2            |
+| seqtk       | 1.3           | 1.4               |
 
 > **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.
 

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -12,6 +12,10 @@
 
 ## Pipeline tools
 
+- [BLAST+](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)
+
+  > Camacho, Chritiam, et al. “BLAST+: architecture and applications.” BMC Bioinformatics, vol. 10, no. 412, Dec. 2009, https://doi.org/10.1186/1471-2105-10-421
+
 - [BlobToolKit](https://github.com/blobtoolkit/blobtoolkit)
 
   > Challis, Richard, et al. “BlobToolKit – Interactive Quality Assessment of Genome Assemblies.” G3 Genes|Genomes|Genetics, vol. 10, no. 4, Apr. 2020, pp. 1361–74, https://doi.org/10.1534/g3.119.400908.
@@ -26,9 +30,7 @@
 
 - [Fasta_windows](https://github.com/tolkit/fasta_windows)
 
-- [GoaT](https://goat.genomehubs.org)
-
-  > Challis, Richard, et al. “Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life.” Wellcome Open Research, vol. 8, no. 24, 2023, https://doi.org/10.12688/wellcomeopenres.18658.1.
+  > Brown, Max, et al. "Fasta_windows v0.2.3". GitHub, 2021. https://github.com/tolkit/fasta_windows
 
 - [Minimap2](https://github.com/lh3/minimap2)
 
@@ -42,6 +44,10 @@
 
   > Danecek, Petr, et al. “Twelve Years of SAMtools and BCFtools.” GigaScience, vol. 10, no. 2, Jan. 2021, https://doi.org/10.1093/gigascience/giab008.
 
+- [SeqTK](https://github.com/lh3/seqtk)
+
+  > Li, Heng. "SeqTK v1.4" GitHub, 2023, https://github.com/lh3/seqtk
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

diff --git a/README.md b/README.md
@@ -20,8 +20,8 @@ It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome
 4. Run BUSCO ([`busco`](https://busco.ezlab.org/))
 5. Extract BUSCO genes ([`blobtoolkit/extractbuscos`](https://github.com/blobtoolkit/blobtoolkit))
 6. Run Diamond BLASTp against extracted BUSCO genes ([`diamond/blastp`](https://github.com/bbuchfink/diamond))
-7. Run BLASTx against sequences with no hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
-8. Run BLASTn against sequences still with not hit ([`blast/blastx`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
+7. Run BLASTx against sequences with no hit ([`diamond/blastx`](https://github.com/bbuchfink/diamond))
+8. Run BLASTn against sequences still with not hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
 9. Count BUSCO genes ([`blobtoolkit/countbuscos`](https://github.com/blobtoolkit/blobtoolkit))
 10. Generate combined sequence stats across various window sizes ([`blobtoolkit/windowstats`](https://github.com/blobtoolkit/blobtoolkit))
 11. Imports analysis results into a BlobDir dataset ([`blobtoolkit/blobdir`](https://github.com/blobtoolkit/blobtoolkit))
@@ -37,13 +37,17 @@ First, prepare a samplesheet with your input data that looks as follows:
 `samplesheet.csv`:
 
 ```csv
-sample,datatype,datafile
-mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram
-mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram
-mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram
+sample,datatype,datafile,library_layout
+mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram,PAIRED
+mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram,PAIRED
+mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram,SINGLE
 ```
 
-Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
+Each row represents an aligned file.
+Rows with the same sample identifier are considered technical replicates.
+The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`).
+The library layout indicates whether the reads are paired or single.
+The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
 
 Now, you can run the pipeline using:
 
@@ -77,9 +81,8 @@ sanger-tol/blobtoolkit was written in Nextflow by [Alexander Ramos Diaz](https:/
 
 We thank the following people for their assistance in the development of this pipeline:
 
-<!-- If applicable, make list of people who have also contributed -->
-
 - [Guoying Qi](https://github.com/gq1)
+- [Bethan Yates](https://github.com/BethYates)
 
 ## Contributions and Support
 
@@ -89,8 +92,6 @@ If you would like to contribute to this pipeline, please see the [contributing g
 
 If you use sanger-tol/blobtoolkit for your analysis, please cite it using the following doi: [10.5281/zenodo.7949058](https://doi.org/10.5281/zenodo.7949058)
 
-<!-- Add bibliography of tools and data used in your pipeline -->
-
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 
 This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).

diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -23,6 +23,11 @@
                 "type": "string",
                 "pattern": "^\\S+\\.(bam|cram|fa|fa.gz|fasta|fasta.gz|fq|fq.gz|fastq|fastq.gz)$",
                 "errorMessage": "Data file for reads cannot contain spaces and must be BAM/CRAM/FASTQ/FASTA"
+            },
+            "library_layout": {
+                "type": "string",
+                "pattern": "^(SINGLE|PAIRED)$",
+                "errorMessage": "The only valid layouts are SINGLE and PAIRED"
             }
         },
         "required": ["datafile", "datatype", "sample"]

diff --git a/assets/test/samplesheet.csv b/assets/test/samplesheet.csv
@@ -1,5 +1,5 @@
-sample,datatype,datafile
-mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
-mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
-mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
-mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
+sample,datatype,datafile,library_layout
+mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
+mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
+mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
+mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
diff --git a/assets/test/samplesheet_raw.csv b/assets/test/samplesheet_raw.csv
@@ -1,4 +1,4 @@
-sample,datatype,datafile
-mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram
-mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram
-mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram
+sample,datatype,datafile,library_layout
+mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram,PAIRED
+mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram,PAIRED
+mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram,PAIRED
diff --git a/assets/test/samplesheet_s3.csv b/assets/test/samplesheet_s3.csv
@@ -1,5 +1,5 @@
-sample,datatype,datafile
-mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
-mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
-mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
-mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
+sample,datatype,datafile,library_layout
+mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
+mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
+mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
+mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
diff --git a/assets/test_full/full_samplesheet.csv b/assets/test_full/full_samplesheet.csv
@@ -1,3 +1,3 @@
-sample,datatype,datafile
-gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram
-gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram
+sample,datatype,datafile,library_layout
+gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram,PAIRED
+gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram,SINGLE
diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -45,11 +45,17 @@ class RowChecker:
         "ont",
     )
 
+    VALID_LAYOUTS = (
+        "SINGLE",
+        "PAIRED",
+    )
+
     def __init__(
         self,
         sample_col="sample",
         type_col="datatype",
         file_col="datafile",
+        layout_col="library_layout",
         **kwargs,
     ):
         """
@@ -62,11 +68,14 @@ def __init__(
                 the read data (default "datatype").
             file_col (str): The name of the column that contains the file path for
                 the read data (default "datafile").
+            layout_col(str): The name of the column that contains the layout of the
+                library (i.e. "PAIRED" or "SINGLE").
         """
         super().__init__(**kwargs)
         self._sample_col = sample_col
         self._type_col = type_col
         self._file_col = file_col
+        self._layout_col = layout_col
         self._seen = set()
         self.modified = []
 
@@ -82,6 +91,7 @@ def validate_and_transform(self, row):
         self._validate_sample(row)
         self._validate_type(row)
         self._validate_file(row)
+        self._validate_layout(row)
         self._seen.add((row[self._sample_col], row[self._file_col]))
         self.modified.append(row)
 
@@ -94,7 +104,7 @@ def _validate_sample(self, row):
 
     def _validate_type(self, row):
         """Assert that the data type matches expected values."""
-        if not any(row[self._type_col] for datatype in self.VALID_DATATYPES):
+        if row[self._type_col] not in self.VALID_DATATYPES:
             raise AssertionError(
                 f"The datatype is unrecognized: {row[self._type_col]}\n"
                 f"It should be one of: {', '.join(self.VALID_DATATYPES)}"
@@ -114,6 +124,14 @@ def _validate_data_format(self, filename):
                 f"It should be one of: {', '.join(self.VALID_FORMATS)}"
             )
 
+    def _validate_layout(self, row):
+        """Assert that the library layout matches expected values."""
+        if not row[self._layout_col] in self.VALID_LAYOUTS:
+            raise AssertionError(
+                f"The library layout is unrecognized: {row[self._layout_col]}\n"
+                f"It should be one of: {', '.join(self.VALID_LAYOUTS)}"
+            )
+
     def validate_unique_samples(self):
         """
         Assert that the combination of sample name and aligned filename is unique.
@@ -178,7 +196,7 @@ def check_samplesheet(file_in, file_out):
         This function checks that the samplesheet follows the following structure,
         see also the `blobtoolkit samplesheet`_::
 
-        sample,datatype,datafile
+        sample,datatype,datafile,library_layout
         sample1,hic,/path/to/file1.cram
         sample1,pacbio,/path/to/file2.cram
         sample1,ont,/path/to/file3.cram
@@ -187,7 +205,7 @@ def check_samplesheet(file_in, file_out):
         https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/assets/test/samplesheet.csv
 
     """
-    required_columns = {"sample", "datatype", "datafile"}
+    required_columns = {"sample", "datatype", "datafile", "library_layout"}
     # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
     with file_in.open(newline="") as in_handle:
         reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))