From 094ca56c8ba17e9c1e707f45cea683d1eb3289b5 Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Thu, 8 Aug 2024 17:38:27 +0200 Subject: [PATCH 01/19] Started adding samplesheet validator --- CHANGELOG.md | 1 + .../local/samplesheet_validator/Dockerfile | 20 ++++ modules/local/samplesheet_validator/README.md | 84 +++++++++++++++ modules/local/samplesheet_validator/main.nf | 100 ++++++++++++++++++ modules/local/samplesheet_validator/meta.yml | 37 +++++++ .../samplesheet_validator/tests/main.nf.test | 51 +++++++++ .../tests/nextflow.config | 1 + .../validate_samplesheet.py | 37 +++++++ nextflow.config | 5 +- workflows/demultiplex.nf | 14 +++ 10 files changed, 349 insertions(+), 1 deletion(-) create mode 100644 modules/local/samplesheet_validator/Dockerfile create mode 100644 modules/local/samplesheet_validator/README.md create mode 100644 modules/local/samplesheet_validator/main.nf create mode 100644 modules/local/samplesheet_validator/meta.yml create mode 100644 modules/local/samplesheet_validator/tests/main.nf.test create mode 100644 modules/local/samplesheet_validator/tests/nextflow.config create mode 100644 modules/local/samplesheet_validator/validate_samplesheet.py diff --git a/CHANGELOG.md b/CHANGELOG.md index c87fb4e2..d0a5f9ba 100755 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -18,6 +18,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [#220](https://github.com/nf-core/demultiplex/pull/220) Added kraken2. - [#221](https://github.com/nf-core/demultiplex/pull/221) Added checkqc_config to pipeline schema. - [#225](https://github.com/nf-core/demultiplex/pull/225) Added test profile for multi-lane samples, updated handling of such samples and adapter trimming. +- [#TBD](https://github.com/nf-core/demultiplex/pull/TBD) Added module for samplesheet validation. ### `Changed` diff --git a/modules/local/samplesheet_validator/Dockerfile b/modules/local/samplesheet_validator/Dockerfile new file mode 100644 index 00000000..97a503bf --- /dev/null +++ b/modules/local/samplesheet_validator/Dockerfile @@ -0,0 +1,20 @@ +# Use an official Python runtime as a parent image +FROM python:3.10-slim + +# Set the working directory in the container +WORKDIR /usr/src/app + +# Install Samshee and any additional Python packages +RUN pip install --no-cache-dir samshee + +# Copy the validation script into the container's PATH +COPY validate_samplesheet.py /usr/local/bin/ + +# Make sure the script is executable +RUN chmod +x /usr/local/bin/validate_samplesheet.py + +# Make sure bash is available +RUN apt-get update && apt-get install -y bash + +# Set the entry point to bash, so you can run commands interactively +ENTRYPOINT ["/bin/bash"] \ No newline at end of file diff --git a/modules/local/samplesheet_validator/README.md b/modules/local/samplesheet_validator/README.md new file mode 100644 index 00000000..16190bad --- /dev/null +++ b/modules/local/samplesheet_validator/README.md @@ -0,0 +1,84 @@ +# Guide to Writing a `validation.json` Schema File + +## Introduction + +A JSON schema defines the structure and constraints of JSON data. This guide will help you create a `validation.json` schema file for use with Samshee to perform additional checks on Illumina® Sample Sheet v2 files. + +## JSON Schema Basics + +JSON Schema is a powerful tool for validating the structure of JSON data. It allows you to specify required fields, data types, and constraints. Here are some common components: + +- **`$schema`**: Declares the JSON Schema version being used. +- **`type`**: Specifies the data type (e.g., `object`, `array`, `string`, `number`). +- **`properties`**: Defines the properties of an object and their constraints. +- **`required`**: Lists properties that must be present in the object. +- **`items`**: Specifies the schema for items in an array. + +## Example Schema + +Here’s an example of a `validation.json` schema file for an Illumina® Sample Sheet: + +```json +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "type": "object", + "properties": { + "Header": { + "type": "object", + "properties": { + "InvestigatorName": { + "type": "string" + }, + "ExperimentName": { + "type": "string" + } + }, + "required": ["InvestigatorName", "ExperimentName"] + }, + "Reads": { + "type": "object", + "properties": { + "Read1": { + "type": "integer", + "minimum": 1 + }, + "Read2": { + "type": "integer", + "minimum": 1 + } + }, + "required": ["Read1", "Read2"] + }, + "BCLConvert": { + "type": "object", + "properties": { + "Index": { + "type": "string", + "pattern": "^[ACGT]{8}$" // Example pattern for 8-base indices + } + } + } + }, + "required": ["Header", "Reads"] +} +``` + +### Explanation of the Example + +- **`$schema`**: Specifies the JSON Schema version (draft-07). +- **`type`**: Defines the main type as `object`. +- **`properties`**: Lists the properties of the object: +- **`Header`**: An object with required `InvestigatorName` and `ExperimentName` fields. +- **`Reads`**: An object with required `Read1` and `Read2` fields that must be integers greater than or equal to 1. +- **`BCLConvert`**: An object with an optional `Index` field that must be a string matching a pattern for 8-base indices. +- **`required`**: Lists required properties at the top level. + +### Tips for Writing JSON Schemas + +1. **Start Simple**: Begin with basic constraints and gradually add complexity. +2. **Use Online Validators**: Validate your schema using online tools to ensure it adheres to the JSON Schema specification. +3. **Refer to Schema Documentation**: Consult the [JSON Schema documentation](https://json-schema.org/) for detailed guidance. + +### Conclusion + +By defining a JSON schema, you can enforce specific rules and ensure that your Illumina® Sample Sheet v2 files meet your required structure and constraints. Use this guide to create and validate your `validation.json` schema files effectively. diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf new file mode 100644 index 00000000..8bd9fb31 --- /dev/null +++ b/modules/local/samplesheet_validator/main.nf @@ -0,0 +1,100 @@ +process SAMPLESHEET_VALIDATOR { + tag {"$meta.id"} + label 'process_low' + + container "nschcolnicov/samshee:latest" //TODO replace with nf-core container + + input: + tuple val(meta), path(samplesheet) + path (validator_schema) + + // output: //Module is meant to crash pipeline if validation fails, output is not needed + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def args2 = task.ext.args2 ?: '' + def args3 = task.ext.args3 ?: '' + def arg_validator_schema = validator_schema ? "${validator_schema}" : "" + """ + validate_samplesheet.py "${samplesheet}" "${arg_validator_schema}" + """ + + stub: + """ + #Generate minimal samplesheet + cat <<-END_SAMPLE_SHEET > minimal_samplesheet.csv + [Header] + FileFormatVersion,2 + RunName,Run_001 + Instrument Type,NextSeq 1000 + InstrumentPlatform,NextSeq 1000 + + [Reads] + Read1Cycles,150 + Read2Cycles,150 + Index1Cycles,8 + Index2Cycles,8 + + [Settings] + + [Data] + Sample_ID,Sample_Name,Description,Sample_Project + Sample1,Sample1,, + END_SAMPLE_SHEET + + + + #Generate minimal schema validator file + cat <<-END_SCHEMA > minimal_schema.json + { + "type": "object", + "properties": { + "Header": { + "type": "object", + "properties": { + "FileFormatVersion": { "type": "integer" }, + "RunName": { "type": "string" }, + "Instrument Type": { "type": "string" }, + "InstrumentPlatform": { "type": "string" } + }, + "required": ["FileFormatVersion", "RunName", "Instrument Type", "InstrumentPlatform"] + }, + "Reads": { + "type": "object", + "properties": { + "Read1Cycles": { "type": "integer" }, + "Read2Cycles": { "type": "integer" }, + "Index1Cycles": { "type": "integer" }, + "Index2Cycles": { "type": "integer" } + }, + "required": ["Read1Cycles", "Read2Cycles", "Index1Cycles", "Index2Cycles"] + }, + "Settings": { + "type": "object" + }, + "Data": { + "type": "array", + "items": { + "type": "object", + "properties": { + "Sample_ID": { "type": "string" }, + "Sample_Name": { "type": "string" }, + "Description": { "type": "string" }, + "Sample_Project": { "type": "string" } + }, + "required": ["Sample_ID", "Sample_Name", "Description", "Sample_Project"] + } + } + }, + "required": ["Header", "Reads", "Settings", "Data"] + } + END_SCHEMA + + #Run command + validate_samplesheet.py minimal_samplesheet.csv minimal_schema.json + + """ +} diff --git a/modules/local/samplesheet_validator/meta.yml b/modules/local/samplesheet_validator/meta.yml new file mode 100644 index 00000000..5c1cb27a --- /dev/null +++ b/modules/local/samplesheet_validator/meta.yml @@ -0,0 +1,37 @@ +name: samplesheet_validator +description: Module to validate illumina® Sample Sheet v2 files. +keywords: + - samplesheet + - illumina + - bclconvert + - bcl2fastq +tools: + - samshee: + description: A schema-agnostic parser and writer for illumina® sample sheets v2 and similar documents. + homepage: https://github.com/lit-regensburg/samshee + documentation: https://github.com/lit-regensburg/samshee/blob/main/README.md + tool_dev_url: https://github.com/lit-regensburg/samshee + licence: [MIT license] +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test', lane:1 ] + - samplesheet: + type: file + description: "illumina v2 samplesheet" + pattern: "*.{csv}" +output: + - fastq: + type: file + description: Unaligned FastQ files + pattern: "*.fastq.gz" + - versions: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@nschcolnicov" +maintainers: + - "@nschcolnicov" diff --git a/modules/local/samplesheet_validator/tests/main.nf.test b/modules/local/samplesheet_validator/tests/main.nf.test new file mode 100644 index 00000000..b546b897 --- /dev/null +++ b/modules/local/samplesheet_validator/tests/main.nf.test @@ -0,0 +1,51 @@ +// nf-core modules test cellranger/mkfastq +nextflow_process { + + name "Test Process SAMPLESHEET_VALIDATOR" + script "../main.nf" + config "./nextflow.config" + process "SAMPLESHEET_VALIDATOR" + + tag "modules" + + test("test samplesheet") { + + when { + process { + """ + input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true) ] + input[1] = [] + """ + } + } + + then { + assertAll( + { assert process.success } + ) + } + + } + + test("stub") { + + options "-stub" + + when { + process { + """ + input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true) ] + input[1] = [] + """ + } + } + + then { + assertAll( + { assert process.success }, + ) + } + + } + +} diff --git a/modules/local/samplesheet_validator/tests/nextflow.config b/modules/local/samplesheet_validator/tests/nextflow.config new file mode 100644 index 00000000..48c63506 --- /dev/null +++ b/modules/local/samplesheet_validator/tests/nextflow.config @@ -0,0 +1 @@ +singularity.registry = '' //TODO for testing in BI, remove \ No newline at end of file diff --git a/modules/local/samplesheet_validator/validate_samplesheet.py b/modules/local/samplesheet_validator/validate_samplesheet.py new file mode 100644 index 00000000..5df1ef6d --- /dev/null +++ b/modules/local/samplesheet_validator/validate_samplesheet.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 + +from samshee.samplesheetv2 import read_samplesheetv2 +from samshee.validation import illuminasamplesheetv2schema, illuminasamplesheetv2logic, validate +import json +import sys + +def validate_samplesheet(filename, custom_schema_file=None): + # Load the custom schema if provided + if custom_schema_file: + with open(custom_schema_file, 'r') as f: + custom_schema = json.load(f) + custom_validator = lambda doc: validate(doc, custom_schema) + else: + custom_validator = None + + # Prepare the list of validators + validators = [illuminasamplesheetv2schema, illuminasamplesheetv2logic] + if custom_validator: + validators.append(custom_validator) + + # Read and validate the sample sheet + try: + sheet = read_samplesheetv2(filename, validation=validators) + print(f"Validation successful for {filename}") + except Exception as e: + print(f"Validation failed: {e}") + +if __name__ == "__main__": + if len(sys.argv) < 2 or len(sys.argv) > 3: + print("Usage: validate_samplesheet.py [custom_schema.json]") + sys.exit(1) + + samplesheet_file = sys.argv[1] + schema_file = sys.argv[2] if len(sys.argv) == 3 else None + + validate_samplesheet(samplesheet_file, schema_file) diff --git a/nextflow.config b/nextflow.config index 1e1ab763..78047a1b 100755 --- a/nextflow.config +++ b/nextflow.config @@ -18,7 +18,7 @@ params { remove_adapter = true // [true, false] // Options: tooling - skip_tools = [] // list [fastp, fastqc, kraken, multiqc, checkqc, falco, md5sum] + skip_tools = [] // list [fastp, fastqc, kraken, multiqc, checkqc, falco, md5sum, samplesheet_validator] // seqtk sample options sample_size = 100000 @@ -28,6 +28,9 @@ params { // Options: CheckQC checkqc_config = [] // file .yaml + // Options: Illumina samplesheet validator + validator_schema = null // file .json + // MultiQC options multiqc_config = null multiqc_title = null diff --git a/workflows/demultiplex.nf b/workflows/demultiplex.nf index 35622991..aed2a52a 100644 --- a/workflows/demultiplex.nf +++ b/workflows/demultiplex.nf @@ -27,6 +27,11 @@ include { UNTAR as UNTAR_FLOWCELL } from '../modules/nf-core/untar/main' include { UNTAR as UNTAR_KRAKEN_DB } from '../modules/nf-core/untar/main' include { MD5SUM } from '../modules/nf-core/md5sum/main' +// +// MODULE: Local modules +// +include { SAMPLESHEET_VALIDATOR } from '../modules/local/samplesheet_validator/main' + // // FUNCTION // @@ -60,6 +65,7 @@ workflow DEMULTIPLEX { ch_multiqc_files = Channel.empty() ch_multiqc_reports = Channel.empty() checkqc_config = params.checkqc_config ? Channel.fromPath(params.checkqc_config, checkIfExists: true) : [] // file checkqc_config.yaml + ch_validator_schema = params.validator_schema ? Channel.fromPath(params.validator_schema, checkIfExists: true) : [] // file validator_schema.json // Remove adapter from Illumina samplesheet to avoid adapter trimming in demultiplexer tools if (params.remove_adapter && (params.demultiplexer in ["bcl2fastq", "bclconvert", "mkfastq"])) { @@ -84,6 +90,14 @@ workflow DEMULTIPLEX { } } + // RUN samplesheet_validator + if (!("samplesheet_validator" in skip_tools)){ + SAMPLESHEET_VALIDATOR ( + ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }, + ch_validator_schema + ) + } + // Convenience ch_samplesheet.dump(tag: 'DEMULTIPLEX::inputs', {FormattingService.prettyFormat(it)}) From 5030413785212968c3fd3f10f3a91aec7f47aeba Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Thu, 8 Aug 2024 18:51:15 +0200 Subject: [PATCH 02/19] Fixed module execution --- conf/test.config | 1 + modules/local/samplesheet_validator/main.nf | 29 +++++++++++++++---- .../samplesheet_validator/tests/main.nf.test | 6 ++-- workflows/demultiplex.nf | 3 +- 4 files changed, 28 insertions(+), 11 deletions(-) diff --git a/conf/test.config b/conf/test.config index bfea6a82..45467be6 100755 --- a/conf/test.config +++ b/conf/test.config @@ -22,6 +22,7 @@ params { // Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' + skip_tools = 'samplesheet_validator' } diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf index 8bd9fb31..3546e50a 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samplesheet_validator/main.nf @@ -5,8 +5,7 @@ process SAMPLESHEET_VALIDATOR { container "nschcolnicov/samshee:latest" //TODO replace with nf-core container input: - tuple val(meta), path(samplesheet) - path (validator_schema) + tuple val(meta), path(samplesheet), path (validator_schema) // output: //Module is meant to crash pipeline if validation fails, output is not needed @@ -19,7 +18,18 @@ process SAMPLESHEET_VALIDATOR { def args3 = task.ext.args3 ?: '' def arg_validator_schema = validator_schema ? "${validator_schema}" : "" """ - validate_samplesheet.py "${samplesheet}" "${arg_validator_schema}" + # Run validation command and capture output + output=\$(validate_samplesheet.py "${samplesheet}" "${arg_validator_schema}" 2>&1) + status=\$? + + # Check if validation failed + if echo "\$output" | grep -q "Validation failed:"; then + echo "\$output" # Print output for debugging + exit 1 # Fail the process if validation failed + fi + + # If no validation errors, process exits with status 0 + exit \$status """ stub: @@ -93,8 +103,17 @@ process SAMPLESHEET_VALIDATOR { } END_SCHEMA - #Run command - validate_samplesheet.py minimal_samplesheet.csv minimal_schema.json + # Run validation command and capture output + output=\$(validate_samplesheet.py minimal_samplesheet.csv minimal_schema.json 2>&1) + status=\$? + + # Check if validation failed + if echo "\$output" | grep -q "Validation failed:"; then + echo "\$output" # Print output for debugging + exit 1 # Fail the process if validation failed + fi + # If no validation errors, process exits with status 0 + exit \$status """ } diff --git a/modules/local/samplesheet_validator/tests/main.nf.test b/modules/local/samplesheet_validator/tests/main.nf.test index b546b897..7e95c72e 100644 --- a/modules/local/samplesheet_validator/tests/main.nf.test +++ b/modules/local/samplesheet_validator/tests/main.nf.test @@ -13,8 +13,7 @@ nextflow_process { when { process { """ - input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true) ] - input[1] = [] + input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true), [] ] """ } } @@ -34,8 +33,7 @@ nextflow_process { when { process { """ - input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true) ] - input[1] = [] + input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true), [] ] """ } } diff --git a/workflows/demultiplex.nf b/workflows/demultiplex.nf index aed2a52a..a49384e3 100644 --- a/workflows/demultiplex.nf +++ b/workflows/demultiplex.nf @@ -93,8 +93,7 @@ workflow DEMULTIPLEX { // RUN samplesheet_validator if (!("samplesheet_validator" in skip_tools)){ SAMPLESHEET_VALIDATOR ( - ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }, - ch_validator_schema + ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }.combine( ch_validator_schema ) ) } From 0f38381559d2c05c2c0bc89c26de3194327ba030 Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Thu, 8 Aug 2024 20:29:39 +0200 Subject: [PATCH 03/19] Updated docs, schema and module --- docs/usage.md | 6 +++ modules/local/samplesheet_validator/main.nf | 3 +- .../samplesheet_validator/tests/main.nf.test | 4 +- nextflow.config | 7 ++-- nextflow_schema.json | 41 +++++++++++++++---- workflows/demultiplex.nf | 13 +++--- 6 files changed, 55 insertions(+), 19 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index 8af3d823..78acc275 100755 --- a/docs/usage.md +++ b/docs/usage.md @@ -204,6 +204,12 @@ To learn how to provide additional arguments to a particular tool of the pipelin The trimming process in our demultiplexing pipeline has been updated to ensure compatibility with 10x Genomics recommendations. By default, trimming in the pipeline is performed using fastp, which reliably auto-detects and removes adapter sequences without the need for storing adapter sequences. As users can also supply adapter sequences in a samplesheet and thereby triggering trimming in any `bcl2fastq` or `bclconvert` subworkflows, we have added a new parameter, `remove_adapter`, which is set to true by default. When `remove_adapter` is true, the pipeline automatically removes any adapter sequences listed in the `[Settings]` section of the Illumina sample sheet, replacing them with an empty string in order to not provoke this behaviour. This approach aligns with 10x Genomics' guidelines, as they advise against pre-processing FASTQ reads before inputting them into their software pipelines. If the `remove_adapter` setting is true but no adapter is removed, a warning will be displayed; however, this does not necessarily indicate an error, as some sample sheets may already lack these adapter sequences. Users can disable this behavior by setting `--remove_adapter false` in the command line, though this is not recommended. +## Samplesheet validator (samshee) + +The Samplesheet validator (samshee) module ensures the integrity of Illumina v2 Sample Sheets by allowing users to apply custom validation rules. The module can be used together with the parameter `--validator_schema`, which accepts a JSON schema validator file. Users can specify this file to enforce additional validation rules beyond the default ones provided by the tool. To use this feature, simply provide the path to the JSON schema validator file via the `--validator_schema` parameter in the pipeline configuration. This enables tailored validation of Sample Sheets to meet specific requirements or standards relevant to your sequencing workflow. For more information about the tool or how to write the schema JSON file, please refer to [Samshee on GitHub](https://github.com/lit-regensburg/samshee). + + + ### nf-core/configs In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter. You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf index 3546e50a..6efe67f7 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samplesheet_validator/main.nf @@ -5,7 +5,8 @@ process SAMPLESHEET_VALIDATOR { container "nschcolnicov/samshee:latest" //TODO replace with nf-core container input: - tuple val(meta), path(samplesheet), path (validator_schema) + tuple val(meta), path(samplesheet) + path(validator_schema) //optional // output: //Module is meant to crash pipeline if validation fails, output is not needed diff --git a/modules/local/samplesheet_validator/tests/main.nf.test b/modules/local/samplesheet_validator/tests/main.nf.test index 7e95c72e..6e9592b4 100644 --- a/modules/local/samplesheet_validator/tests/main.nf.test +++ b/modules/local/samplesheet_validator/tests/main.nf.test @@ -13,7 +13,8 @@ nextflow_process { when { process { """ - input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true), [] ] + input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true) ] + input[1] = [] """ } } @@ -34,6 +35,7 @@ nextflow_process { process { """ input[0] = [ [ id: 'test', lane:1 ], file("https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/testdata/NextSeq2000/SampleSheet.csv", checkIfExists: true), [] ] + input[1] = [] """ } } diff --git a/nextflow.config b/nextflow.config index 78047a1b..48ddf192 100755 --- a/nextflow.config +++ b/nextflow.config @@ -24,12 +24,13 @@ params { sample_size = 100000 // Kraken2 options - kraken_db = null // file .tar.gz + kraken_db = [] // file .tar.gz + // Options: CheckQC - checkqc_config = [] // file .yaml + checkqc_config = [] // file .yaml // Options: Illumina samplesheet validator - validator_schema = null // file .json + validator_schema = null // file .json // MultiQC options multiqc_config = null diff --git a/nextflow_schema.json b/nextflow_schema.json index 9db9ac30..0d21c7e0 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -19,7 +19,7 @@ "skip_tools": { "type": "string", "default": "[]", - "description": "Comma-separated list of tools to skip (fastp,falco,multiqc)" + "description": "Comma-separated list of tools to skip (fastp,fastqc,kraken,multiqc,checkqc,falco,md5sum,samplesheet_validator)" }, "sample_size": { "type": "integer", @@ -29,8 +29,14 @@ "kraken_db": { "type": "string", "format": "path", - "default": null, - "description": "path to Kraken2 DB to use for screening" + "default": "None", + "description": "Path to Kraken2 DB to use for screening" + }, + "validator_schema": { + "type": "string", + "format": "file-path", + "default": "None", + "description": "Path to Illumina v2 samplesheet validator .json file" } } }, @@ -39,7 +45,10 @@ "type": "object", "fa_icon": "fas fa-terminal", "description": "Define where the pipeline should find input data and save output data.", - "required": ["input", "outdir"], + "required": [ + "input", + "outdir" + ], "properties": { "input": { "type": "string", @@ -77,11 +86,20 @@ "type": "object", "fa_icon": "fas fa-microscope", "description": "Options for demultiplexing.", - "required": ["demultiplexer"], + "required": [ + "demultiplexer" + ], "properties": { "demultiplexer": { "type": "string", - "enum": ["bases2fastq", "bcl2fastq", "bclconvert", "fqtk", "sgdemux", "mkfastq"], + "enum": [ + "bases2fastq", + "bcl2fastq", + "bclconvert", + "fqtk", + "sgdemux", + "mkfastq" + ], "description": "Demultiplexer to use.", "fa_icon": "fas fa-microscope", "default": "bclconvert" @@ -210,7 +228,14 @@ "description": "Method used to save pipeline results to output directory.", "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", "fa_icon": "fas fa-copy", - "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"], + "enum": [ + "symlink", + "rellink", + "link", + "copy", + "copyNoFollow", + "move" + ], "hidden": true }, "email_on_fail": { @@ -332,4 +357,4 @@ "$ref": "#/definitions/generic_options" } ] -} +} \ No newline at end of file diff --git a/workflows/demultiplex.nf b/workflows/demultiplex.nf index a49384e3..65a8605c 100644 --- a/workflows/demultiplex.nf +++ b/workflows/demultiplex.nf @@ -61,10 +61,10 @@ workflow DEMULTIPLEX { // Channel inputs - ch_versions = Channel.empty() - ch_multiqc_files = Channel.empty() - ch_multiqc_reports = Channel.empty() - checkqc_config = params.checkqc_config ? Channel.fromPath(params.checkqc_config, checkIfExists: true) : [] // file checkqc_config.yaml + ch_versions = Channel.empty() + ch_multiqc_files = Channel.empty() + ch_multiqc_reports = Channel.empty() + checkqc_config = params.checkqc_config ? Channel.fromPath(params.checkqc_config, checkIfExists: true) : [] // file checkqc_config.yaml ch_validator_schema = params.validator_schema ? Channel.fromPath(params.validator_schema, checkIfExists: true) : [] // file validator_schema.json // Remove adapter from Illumina samplesheet to avoid adapter trimming in demultiplexer tools @@ -91,9 +91,10 @@ workflow DEMULTIPLEX { } // RUN samplesheet_validator - if (!("samplesheet_validator" in skip_tools)){ + if (!("samplesheet_validator" in skip_tools) && (params.demultiplexer in ["bcl2fastq", "bclconvert", "mkfastq"])){ SAMPLESHEET_VALIDATOR ( - ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }.combine( ch_validator_schema ) + ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }, + ch_validator_schema ) } From be62c8faaafa2c6036d9617238952d6ae59c1cac Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Thu, 8 Aug 2024 20:38:37 +0200 Subject: [PATCH 04/19] Remove testing line --- modules/local/samplesheet_validator/tests/nextflow.config | 1 - 1 file changed, 1 deletion(-) diff --git a/modules/local/samplesheet_validator/tests/nextflow.config b/modules/local/samplesheet_validator/tests/nextflow.config index 48c63506..e69de29b 100644 --- a/modules/local/samplesheet_validator/tests/nextflow.config +++ b/modules/local/samplesheet_validator/tests/nextflow.config @@ -1 +0,0 @@ -singularity.registry = '' //TODO for testing in BI, remove \ No newline at end of file From e5021943837d92f152a489d5989138d83dfd2f0f Mon Sep 17 00:00:00 2001 From: nschcolnicov Date: Thu, 8 Aug 2024 18:46:42 +0000 Subject: [PATCH 05/19] Ran precommit, updated changelog --- CHANGELOG.md | 2 +- docs/usage.md | 2 -- .../local/samplesheet_validator/Dockerfile | 2 +- modules/local/samplesheet_validator/README.md | 2 +- modules/local/samplesheet_validator/main.nf | 4 +-- .../validate_samplesheet.py | 2 -- nextflow_schema.json | 29 ++++--------------- workflows/demultiplex.nf | 4 +-- 8 files changed, 11 insertions(+), 36 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index d0a5f9ba..aac98a17 100755 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -18,7 +18,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [#220](https://github.com/nf-core/demultiplex/pull/220) Added kraken2. - [#221](https://github.com/nf-core/demultiplex/pull/221) Added checkqc_config to pipeline schema. - [#225](https://github.com/nf-core/demultiplex/pull/225) Added test profile for multi-lane samples, updated handling of such samples and adapter trimming. -- [#TBD](https://github.com/nf-core/demultiplex/pull/TBD) Added module for samplesheet validation. +- [#234](https://github.com/nf-core/demultiplex/pull/234) Added module for samplesheet validation. ### `Changed` diff --git a/docs/usage.md b/docs/usage.md index 78acc275..d7905ec3 100755 --- a/docs/usage.md +++ b/docs/usage.md @@ -208,8 +208,6 @@ The trimming process in our demultiplexing pipeline has been updated to ensure c The Samplesheet validator (samshee) module ensures the integrity of Illumina v2 Sample Sheets by allowing users to apply custom validation rules. The module can be used together with the parameter `--validator_schema`, which accepts a JSON schema validator file. Users can specify this file to enforce additional validation rules beyond the default ones provided by the tool. To use this feature, simply provide the path to the JSON schema validator file via the `--validator_schema` parameter in the pipeline configuration. This enables tailored validation of Sample Sheets to meet specific requirements or standards relevant to your sequencing workflow. For more information about the tool or how to write the schema JSON file, please refer to [Samshee on GitHub](https://github.com/lit-regensburg/samshee). - - ### nf-core/configs In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter. You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. diff --git a/modules/local/samplesheet_validator/Dockerfile b/modules/local/samplesheet_validator/Dockerfile index 97a503bf..b63a6454 100644 --- a/modules/local/samplesheet_validator/Dockerfile +++ b/modules/local/samplesheet_validator/Dockerfile @@ -17,4 +17,4 @@ RUN chmod +x /usr/local/bin/validate_samplesheet.py RUN apt-get update && apt-get install -y bash # Set the entry point to bash, so you can run commands interactively -ENTRYPOINT ["/bin/bash"] \ No newline at end of file +ENTRYPOINT ["/bin/bash"] diff --git a/modules/local/samplesheet_validator/README.md b/modules/local/samplesheet_validator/README.md index 16190bad..3e8745bc 100644 --- a/modules/local/samplesheet_validator/README.md +++ b/modules/local/samplesheet_validator/README.md @@ -54,7 +54,7 @@ Here’s an example of a `validation.json` schema file for an Illumina® Sample "properties": { "Index": { "type": "string", - "pattern": "^[ACGT]{8}$" // Example pattern for 8-base indices + "pattern": "^[ACGT]{8}$" // Example pattern for 8-base indices } } } diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf index 6efe67f7..e527ab35 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samplesheet_validator/main.nf @@ -13,7 +13,7 @@ process SAMPLESHEET_VALIDATOR { when: task.ext.when == null || task.ext.when - script: + script: def args = task.ext.args ?: '' def args2 = task.ext.args2 ?: '' def args3 = task.ext.args3 ?: '' @@ -22,7 +22,6 @@ process SAMPLESHEET_VALIDATOR { # Run validation command and capture output output=\$(validate_samplesheet.py "${samplesheet}" "${arg_validator_schema}" 2>&1) status=\$? - # Check if validation failed if echo "\$output" | grep -q "Validation failed:"; then echo "\$output" # Print output for debugging @@ -107,7 +106,6 @@ process SAMPLESHEET_VALIDATOR { # Run validation command and capture output output=\$(validate_samplesheet.py minimal_samplesheet.csv minimal_schema.json 2>&1) status=\$? - # Check if validation failed if echo "\$output" | grep -q "Validation failed:"; then echo "\$output" # Print output for debugging diff --git a/modules/local/samplesheet_validator/validate_samplesheet.py b/modules/local/samplesheet_validator/validate_samplesheet.py index 5df1ef6d..987e3441 100644 --- a/modules/local/samplesheet_validator/validate_samplesheet.py +++ b/modules/local/samplesheet_validator/validate_samplesheet.py @@ -18,7 +18,6 @@ def validate_samplesheet(filename, custom_schema_file=None): validators = [illuminasamplesheetv2schema, illuminasamplesheetv2logic] if custom_validator: validators.append(custom_validator) - # Read and validate the sample sheet try: sheet = read_samplesheetv2(filename, validation=validators) @@ -30,7 +29,6 @@ def validate_samplesheet(filename, custom_schema_file=None): if len(sys.argv) < 2 or len(sys.argv) > 3: print("Usage: validate_samplesheet.py [custom_schema.json]") sys.exit(1) - samplesheet_file = sys.argv[1] schema_file = sys.argv[2] if len(sys.argv) == 3 else None diff --git a/nextflow_schema.json b/nextflow_schema.json index 0d21c7e0..cdc45f10 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -45,10 +45,7 @@ "type": "object", "fa_icon": "fas fa-terminal", "description": "Define where the pipeline should find input data and save output data.", - "required": [ - "input", - "outdir" - ], + "required": ["input", "outdir"], "properties": { "input": { "type": "string", @@ -86,20 +83,11 @@ "type": "object", "fa_icon": "fas fa-microscope", "description": "Options for demultiplexing.", - "required": [ - "demultiplexer" - ], + "required": ["demultiplexer"], "properties": { "demultiplexer": { "type": "string", - "enum": [ - "bases2fastq", - "bcl2fastq", - "bclconvert", - "fqtk", - "sgdemux", - "mkfastq" - ], + "enum": ["bases2fastq", "bcl2fastq", "bclconvert", "fqtk", "sgdemux", "mkfastq"], "description": "Demultiplexer to use.", "fa_icon": "fas fa-microscope", "default": "bclconvert" @@ -228,14 +216,7 @@ "description": "Method used to save pipeline results to output directory.", "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", "fa_icon": "fas fa-copy", - "enum": [ - "symlink", - "rellink", - "link", - "copy", - "copyNoFollow", - "move" - ], + "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"], "hidden": true }, "email_on_fail": { @@ -357,4 +338,4 @@ "$ref": "#/definitions/generic_options" } ] -} \ No newline at end of file +} diff --git a/workflows/demultiplex.nf b/workflows/demultiplex.nf index 65a8605c..9506b7a5 100644 --- a/workflows/demultiplex.nf +++ b/workflows/demultiplex.nf @@ -91,8 +91,8 @@ workflow DEMULTIPLEX { } // RUN samplesheet_validator - if (!("samplesheet_validator" in skip_tools) && (params.demultiplexer in ["bcl2fastq", "bclconvert", "mkfastq"])){ - SAMPLESHEET_VALIDATOR ( + if (!("samplesheet_validator" in skip_tools) && (params.demultiplexer in ["bcl2fastq", "bclconvert", "mkfastq"])){ + SAMPLESHEET_VALIDATOR ( ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }, ch_validator_schema ) From f7769e908e7ecd65c71a1a6b23401b5bbeedab22 Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Thu, 8 Aug 2024 20:55:02 +0200 Subject: [PATCH 06/19] Reverted change to default kraken_db: --- nextflow.config | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/nextflow.config b/nextflow.config index 48ddf192..fa77a58f 100755 --- a/nextflow.config +++ b/nextflow.config @@ -24,7 +24,7 @@ params { sample_size = 100000 // Kraken2 options - kraken_db = [] // file .tar.gz + kraken_db = null // file .tar.gz // Options: CheckQC checkqc_config = [] // file .yaml From a3153bfaca8ba634b809657e701c2169866a529b Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Thu, 8 Aug 2024 22:09:33 +0200 Subject: [PATCH 07/19] Updated container to wave --- .../validate_samplesheet.py | 0 .../local/samplesheet_validator/Dockerfile | 20 ------------------- modules/local/samplesheet_validator/main.nf | 2 +- 3 files changed, 1 insertion(+), 21 deletions(-) rename {modules/local/samplesheet_validator => bin}/validate_samplesheet.py (100%) mode change 100644 => 100755 delete mode 100644 modules/local/samplesheet_validator/Dockerfile diff --git a/modules/local/samplesheet_validator/validate_samplesheet.py b/bin/validate_samplesheet.py old mode 100644 new mode 100755 similarity index 100% rename from modules/local/samplesheet_validator/validate_samplesheet.py rename to bin/validate_samplesheet.py diff --git a/modules/local/samplesheet_validator/Dockerfile b/modules/local/samplesheet_validator/Dockerfile deleted file mode 100644 index b63a6454..00000000 --- a/modules/local/samplesheet_validator/Dockerfile +++ /dev/null @@ -1,20 +0,0 @@ -# Use an official Python runtime as a parent image -FROM python:3.10-slim - -# Set the working directory in the container -WORKDIR /usr/src/app - -# Install Samshee and any additional Python packages -RUN pip install --no-cache-dir samshee - -# Copy the validation script into the container's PATH -COPY validate_samplesheet.py /usr/local/bin/ - -# Make sure the script is executable -RUN chmod +x /usr/local/bin/validate_samplesheet.py - -# Make sure bash is available -RUN apt-get update && apt-get install -y bash - -# Set the entry point to bash, so you can run commands interactively -ENTRYPOINT ["/bin/bash"] diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf index e527ab35..1176bd01 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samplesheet_validator/main.nf @@ -2,7 +2,7 @@ process SAMPLESHEET_VALIDATOR { tag {"$meta.id"} label 'process_low' - container "nschcolnicov/samshee:latest" //TODO replace with nf-core container + container "community.wave.seqera.io/library/pip_samshee:9f3c0736b7c44dc8" input: tuple val(meta), path(samplesheet) From 3b5f83e7e7eeb80936728ecf6abecf2ea4ae83eb Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Thu, 8 Aug 2024 22:26:19 +0200 Subject: [PATCH 08/19] Fixed tests --- conf/test_bases2fastq.config | 2 +- conf/test_bcl2fastq.config | 4 ++-- conf/test_checkqc.config | 4 ++-- conf/test_fqtk.config | 2 +- conf/test_full.config | 5 ++++- conf/test_kraken.config | 5 ++++- conf/test_mkfastq.config | 3 ++- conf/test_pe.config | 6 ++++-- conf/test_sgdemux.config | 2 +- conf/test_two_lanes.config | 2 +- conf/test_uncompressed.config | 3 ++- tests/pipeline/bcl2fastq.nf.test | 2 +- tests/pipeline/kraken.nf.test | 2 +- tests/pipeline/skip_tools.nf.test | 8 ++++---- tests/pipeline/test_pe.nf.test | 2 +- 15 files changed, 31 insertions(+), 21 deletions(-) diff --git a/conf/test_bases2fastq.config b/conf/test_bases2fastq.config index 3d9c79f3..f87261ea 100644 --- a/conf/test_bases2fastq.config +++ b/conf/test_bases2fastq.config @@ -20,6 +20,6 @@ params { max_time = '6.h' // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/b2fq-samplesheet.csv' demultiplexer = 'bases2fastq' } diff --git a/conf/test_bcl2fastq.config b/conf/test_bcl2fastq.config index ecb2adff..69960f6c 100755 --- a/conf/test_bcl2fastq.config +++ b/conf/test_bcl2fastq.config @@ -20,9 +20,9 @@ params { max_time = '6.h' // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bcl2fastq' - skip_tools = "checkqc" + skip_tools = "checkqc,samplesheet_validator" } diff --git a/conf/test_checkqc.config b/conf/test_checkqc.config index a9a88846..32d10a04 100644 --- a/conf/test_checkqc.config +++ b/conf/test_checkqc.config @@ -16,9 +16,9 @@ params { // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' demultiplexer = 'bcl2fastq' - skip_tools = "fastp,falco,md5sum,multiqc" + skip_tools = "fastp,falco,md5sum,multiqc,samplesheet_validator" checkqc_config = "${projectDir}/assets/checkqc_config.yaml" } diff --git a/conf/test_fqtk.config b/conf/test_fqtk.config index 40ce7665..f097b8b2 100644 --- a/conf/test_fqtk.config +++ b/conf/test_fqtk.config @@ -20,6 +20,6 @@ params { max_time = '1.h' // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/fqtk-samplesheet.csv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/fqtk-samplesheet.csv' demultiplexer = 'fqtk' } diff --git a/conf/test_full.config b/conf/test_full.config index c45a689c..b3cf6eb1 100644 --- a/conf/test_full.config +++ b/conf/test_full.config @@ -13,6 +13,9 @@ params { config_profile_name = 'Full test profile' config_profile_description = 'Full test dataset to check pipeline function' - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' + + // Input data + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' demultiplexer = 'bcl2fastq' + skip_tools = 'samplesheet_validator' } diff --git a/conf/test_kraken.config b/conf/test_kraken.config index 4adfef71..01d858ae 100644 --- a/conf/test_kraken.config +++ b/conf/test_kraken.config @@ -13,8 +13,11 @@ params { config_profile_name = 'Test full kraken profile' config_profile_description = 'Full test dataset to check pipeline function with kraken' - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' + + // Input data + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' demultiplexer = 'bcl2fastq' kraken_db = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/db/kraken2.tar.gz' + skip_tools = 'samplesheet_validator' } diff --git a/conf/test_mkfastq.config b/conf/test_mkfastq.config index 1cc9914c..c981f41d 100644 --- a/conf/test_mkfastq.config +++ b/conf/test_mkfastq.config @@ -20,6 +20,7 @@ params { max_time = '1.h' // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/mkfastq-samplesheet.csv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/mkfastq-samplesheet.csv' demultiplexer = 'mkfastq' + skip_tools = 'samplesheet_validator' } diff --git a/conf/test_pe.config b/conf/test_pe.config index e3fef5c8..b6c38e33 100644 --- a/conf/test_pe.config +++ b/conf/test_pe.config @@ -13,7 +13,9 @@ params { config_profile_name = 'Paired end test profile' config_profile_description = 'Paired end test dataset to check pipeline function' - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/pe_samplesheet.csv' + + // Input data + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/pe_samplesheet.csv' demultiplexer = 'bcl2fastq' - skip_tools = "checkqc" + skip_tools = "checkqc,samplesheet_validator" } diff --git a/conf/test_sgdemux.config b/conf/test_sgdemux.config index 1ee949b5..00ed472e 100644 --- a/conf/test_sgdemux.config +++ b/conf/test_sgdemux.config @@ -20,6 +20,6 @@ params { max_time = '1.h' // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/sgdemux-samplesheet.csv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/sgdemux-samplesheet.csv' demultiplexer = 'sgdemux' } diff --git a/conf/test_two_lanes.config b/conf/test_two_lanes.config index d3385173..6dff2efc 100644 --- a/conf/test_two_lanes.config +++ b/conf/test_two_lanes.config @@ -15,7 +15,7 @@ params { config_profile_description = 'Minimal test dataset to check pipeline function with multiple lanes' // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/two_lane_samplesheet.csv' + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/two_lane_samplesheet.csv' demultiplexer = 'bclconvert' skip_tools = "checkqc" } diff --git a/conf/test_uncompressed.config b/conf/test_uncompressed.config index 1072387b..2db017d1 100644 --- a/conf/test_uncompressed.config +++ b/conf/test_uncompressed.config @@ -20,8 +20,9 @@ params { max_time = '6.h' // Input data - input = 'https://github.com/nf-core/test-datasets/raw/demultiplex/samplesheet/1.3.0/uncompressed-samplesheet.csv' + input = 'https://github.com/nf-core/test-datasets/raw/demultiplex/samplesheet/1.3.0/uncompressed-samplesheet.csv' demultiplexer = 'bclconvert' + skip_tools = 'samplesheet_validator' } diff --git a/tests/pipeline/bcl2fastq.nf.test b/tests/pipeline/bcl2fastq.nf.test index 2bada9be..abca8b56 100644 --- a/tests/pipeline/bcl2fastq.nf.test +++ b/tests/pipeline/bcl2fastq.nf.test @@ -12,7 +12,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bcl2fastq' outdir = "$outputDir" - skip_tools = "checkqc" + skip_tools = "checkqc,samplesheet_validator" } } diff --git a/tests/pipeline/kraken.nf.test b/tests/pipeline/kraken.nf.test index 20afde6c..078cba44 100644 --- a/tests/pipeline/kraken.nf.test +++ b/tests/pipeline/kraken.nf.test @@ -12,7 +12,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bcl2fastq' outdir = "$outputDir" - skip_tools = "checkqc" + skip_tools = "checkqc,samplesheet_validator" kraken_db = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/db/kraken2.tar.gz' } } diff --git a/tests/pipeline/skip_tools.nf.test b/tests/pipeline/skip_tools.nf.test index 1de176d0..a479f370 100644 --- a/tests/pipeline/skip_tools.nf.test +++ b/tests/pipeline/skip_tools.nf.test @@ -41,7 +41,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "fastp" + skip_tools = "fastp,samplesheet_validator" } } @@ -69,7 +69,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "fastqc" + skip_tools = "fastqc,samplesheet_validator" } } @@ -97,7 +97,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "fastp,fastqc" + skip_tools = "fastp,fastqc,samplesheet_validator" } } @@ -125,7 +125,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "multiqc" + skip_tools = "multiqc,samplesheet_validator" } } diff --git a/tests/pipeline/test_pe.nf.test b/tests/pipeline/test_pe.nf.test index bba8f754..ec1f654d 100644 --- a/tests/pipeline/test_pe.nf.test +++ b/tests/pipeline/test_pe.nf.test @@ -11,7 +11,7 @@ nextflow_pipeline { params { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/pe_samplesheet.csv' demultiplexer = 'bcl2fastq' - skip_tools = "checkqc" + skip_tools = "checkqc,samplesheet_validator" outdir = "$outputDir" } } From de3549836a170fa0843122d72d3d9fe1df2b6d88 Mon Sep 17 00:00:00 2001 From: nschcolnicov Date: Thu, 8 Aug 2024 20:44:09 +0000 Subject: [PATCH 09/19] fixing bcl2fastq test --- tests/pipeline/bcl2fastq.nf.test.snap | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tests/pipeline/bcl2fastq.nf.test.snap b/tests/pipeline/bcl2fastq.nf.test.snap index 58ce544d..2e77ad49 100644 --- a/tests/pipeline/bcl2fastq.nf.test.snap +++ b/tests/pipeline/bcl2fastq.nf.test.snap @@ -5,9 +5,9 @@ ], "meta": { "nf-test": "0.9.0", - "nextflow": "23.10.0" + "nextflow": "24.04.4" }, - "timestamp": "2024-08-07T17:06:30.73934962" + "timestamp": "2024-08-08T20:43:41.988053048" }, "bcl2fastq": { "content": [ @@ -66,9 +66,9 @@ ], "meta": { "nf-test": "0.9.0", - "nextflow": "23.10.0" + "nextflow": "24.04.4" }, - "timestamp": "2024-08-07T17:06:30.762919351" + "timestamp": "2024-08-08T20:43:42.035230933" }, "multiqc": { "content": [ @@ -78,8 +78,8 @@ ], "meta": { "nf-test": "0.9.0", - "nextflow": "23.10.0" + "nextflow": "24.04.4" }, - "timestamp": "2024-08-07T17:06:30.754480121" + "timestamp": "2024-08-08T20:43:42.016587995" } } \ No newline at end of file From f77f1bdb54e09b520bb51f973c4f6533f1938939 Mon Sep 17 00:00:00 2001 From: nschcolnicov Date: Thu, 8 Aug 2024 21:54:24 +0000 Subject: [PATCH 10/19] Fixed linting --- nextflow_schema.json | 2 -- 1 file changed, 2 deletions(-) diff --git a/nextflow_schema.json b/nextflow_schema.json index cdc45f10..cf93b766 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -29,13 +29,11 @@ "kraken_db": { "type": "string", "format": "path", - "default": "None", "description": "Path to Kraken2 DB to use for screening" }, "validator_schema": { "type": "string", "format": "file-path", - "default": "None", "description": "Path to Illumina v2 samplesheet validator .json file" } } From a534f20148f798cc17b6ed1319c2e4359a45d59e Mon Sep 17 00:00:00 2001 From: Edmund Miller Date: Thu, 8 Aug 2024 18:11:12 -0500 Subject: [PATCH 11/19] build: Add environment.yml --- modules/local/samplesheet_validator/environment.yml | 8 ++++++++ modules/local/samplesheet_validator/main.nf | 5 ++++- 2 files changed, 12 insertions(+), 1 deletion(-) create mode 100644 modules/local/samplesheet_validator/environment.yml diff --git a/modules/local/samplesheet_validator/environment.yml b/modules/local/samplesheet_validator/environment.yml new file mode 100644 index 00000000..f92e0eee --- /dev/null +++ b/modules/local/samplesheet_validator/environment.yml @@ -0,0 +1,8 @@ +channels: + - conda-forge + - bioconda +dependencies: + - python>=3.9 + - pip + - pip: # FIXME https://github.com/nf-core/modules/issues/5814 + - samshee==0.1.12 diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf index 1176bd01..7dabe68d 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samplesheet_validator/main.nf @@ -2,7 +2,10 @@ process SAMPLESHEET_VALIDATOR { tag {"$meta.id"} label 'process_low' - container "community.wave.seqera.io/library/pip_samshee:9f3c0736b7c44dc8" + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'oras://community.wave.seqera.io/library/python_pip_samshee:84a770c9853c725d' : + 'community.wave.seqera.io/library/python_pip_samshee:e8a5c47ec32efa42' }" input: tuple val(meta), path(samplesheet) From 90bdd58891de7baf1f1fd5568134f5d82f97747d Mon Sep 17 00:00:00 2001 From: Alexander Peltzer Date: Fri, 9 Aug 2024 09:30:28 +0200 Subject: [PATCH 12/19] Apply suggestions from code review Co-authored-by: Edmund Miller <20095261+edmundmiller@users.noreply.github.com> --- modules/local/samplesheet_validator/main.nf | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf index 7dabe68d..44a25517 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samplesheet_validator/main.nf @@ -1,6 +1,6 @@ process SAMPLESHEET_VALIDATOR { - tag {"$meta.id"} - label 'process_low' + tag "$meta.id" + label 'process_single' conda "${moduleDir}/environment.yml" container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? @@ -11,7 +11,10 @@ process SAMPLESHEET_VALIDATOR { tuple val(meta), path(samplesheet) path(validator_schema) //optional - // output: //Module is meant to crash pipeline if validation fails, output is not needed + output: + // Module is meant to stop the pipeline if validation fails + stdout + path "versions.yml", emit: versions when: task.ext.when == null || task.ext.when From 8700822724dd67272b25d07f243832b04c83944b Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Fri, 9 Aug 2024 16:46:15 +0200 Subject: [PATCH 13/19] Added versions.yml --- modules/local/samplesheet_validator/main.nf | 16 +++++++++++++--- workflows/demultiplex.nf | 1 + 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samplesheet_validator/main.nf index 44a25517..fc85bd65 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samplesheet_validator/main.nf @@ -13,8 +13,7 @@ process SAMPLESHEET_VALIDATOR { output: // Module is meant to stop the pipeline if validation fails - stdout - path "versions.yml", emit: versions + path "versions.yml", emit: versions when: task.ext.when == null || task.ext.when @@ -34,6 +33,12 @@ process SAMPLESHEET_VALIDATOR { exit 1 # Fail the process if validation failed fi + cat <<-END_VERSIONS > versions.yml + "${task.process}": + samshee: \$( python -m pip show --version samshee | grep "Version" | sed -e "s/Version: //g" ) + python: \$( python --version | sed -e "s/Python //g" ) + END_VERSIONS + # If no validation errors, process exits with status 0 exit \$status """ @@ -62,7 +67,6 @@ process SAMPLESHEET_VALIDATOR { END_SAMPLE_SHEET - #Generate minimal schema validator file cat <<-END_SCHEMA > minimal_schema.json { @@ -118,6 +122,12 @@ process SAMPLESHEET_VALIDATOR { exit 1 # Fail the process if validation failed fi + cat <<-END_VERSIONS > versions.yml + "${task.process}": + samshee: \$( python -m pip show --version samshee | grep "Version" | sed -e "s/Version: //g" ) + python: \$( python --version | sed -e "s/Python //g" ) + END_VERSIONS + # If no validation errors, process exits with status 0 exit \$status """ diff --git a/workflows/demultiplex.nf b/workflows/demultiplex.nf index 34c5f4fd..0f235621 100644 --- a/workflows/demultiplex.nf +++ b/workflows/demultiplex.nf @@ -96,6 +96,7 @@ workflow DEMULTIPLEX { ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }, ch_validator_schema ) + ch_versions = ch_versions.mix(SAMPLESHEET_VALIDATOR.out.versions) } // Convenience From 896d02baa72030f806e40b31ea5ab24b087fb306 Mon Sep 17 00:00:00 2001 From: nschcolnicov Date: Fri, 9 Aug 2024 15:03:04 +0000 Subject: [PATCH 14/19] Updated kraken snap to work with CI --- tests/pipeline/kraken.nf.test.snap | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tests/pipeline/kraken.nf.test.snap b/tests/pipeline/kraken.nf.test.snap index 74d978b9..e73f3516 100644 --- a/tests/pipeline/kraken.nf.test.snap +++ b/tests/pipeline/kraken.nf.test.snap @@ -57,9 +57,9 @@ ], "meta": { "nf-test": "0.8.4", - "nextflow": "23.10.0" + "nextflow": "24.04.4" }, - "timestamp": "2024-08-05T22:49:12.12938394" + "timestamp": "2024-08-09T15:02:30.860785585" }, "software_versions": { "content": [ @@ -67,9 +67,9 @@ ], "meta": { "nf-test": "0.8.4", - "nextflow": "23.10.0" + "nextflow": "24.04.4" }, - "timestamp": "2024-08-01T22:34:15.140488001" + "timestamp": "2024-08-09T15:02:30.822491178" }, "multiqc": { "content": [ @@ -80,8 +80,8 @@ ], "meta": { "nf-test": "0.8.4", - "nextflow": "23.10.0" + "nextflow": "24.04.4" }, - "timestamp": "2024-08-05T22:49:08.601265877" + "timestamp": "2024-08-09T15:02:30.848025877" } } \ No newline at end of file From 8587df1e8eae8f92e4dcdffd4d237ee149d0cfe3 Mon Sep 17 00:00:00 2001 From: zxBIB Schcolnicov Date: Fri, 9 Aug 2024 21:25:05 +0200 Subject: [PATCH 15/19] Updated citations and readme, renamed module --- CITATIONS.md | 2 ++ README.md | 1 + conf/test.config | 2 +- conf/test_bcl2fastq.config | 2 +- conf/test_checkqc.config | 2 +- conf/test_full.config | 2 +- conf/test_kraken.config | 2 +- conf/test_mkfastq.config | 2 +- conf/test_pe.config | 2 +- conf/test_uncompressed.config | 2 +- docs/usage.md | 4 ++-- .../local/{samplesheet_validator => samshee}/README.md | 0 .../{samplesheet_validator => samshee}/environment.yml | 0 .../local/{samplesheet_validator => samshee}/main.nf | 2 +- .../local/{samplesheet_validator => samshee}/meta.yml | 2 +- .../tests/main.nf.test | 4 ++-- .../tests/nextflow.config | 0 nextflow.config | 2 +- nextflow_schema.json | 2 +- tests/pipeline/bcl2fastq.nf.test | 2 +- tests/pipeline/kraken.nf.test | 2 +- tests/pipeline/skip_tools.nf.test | 8 ++++---- tests/pipeline/test_pe.nf.test | 2 +- workflows/demultiplex.nf | 10 +++++----- 24 files changed, 31 insertions(+), 28 deletions(-) rename modules/local/{samplesheet_validator => samshee}/README.md (100%) rename modules/local/{samplesheet_validator => samshee}/environment.yml (100%) rename modules/local/{samplesheet_validator => samshee}/main.nf (99%) rename modules/local/{samplesheet_validator => samshee}/meta.yml (97%) rename modules/local/{samplesheet_validator => samshee}/tests/main.nf.test (93%) rename modules/local/{samplesheet_validator => samshee}/tests/nextflow.config (100%) diff --git a/CITATIONS.md b/CITATIONS.md index 425cedfb..b3497f60 100644 --- a/CITATIONS.md +++ b/CITATIONS.md @@ -22,6 +22,8 @@ - [CheckQC](https://github.com/Molmed/checkQC) +- [samshee](https://github.com/lit-regensburg/samshee) + ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) diff --git a/README.md b/README.md index 85601c83..d9932471 100755 --- a/README.md +++ b/README.md @@ -47,6 +47,7 @@ On release, automated continuous integration tests run the pipeline on a full-si 4. [Falco](#falco) - Raw read QC 5. [md5sum](#md5sum) - Creates an MD5 (128-bit) checksum of every fastq. 6. [MultiQC](#multiqc) - aggregate report, describing results of the whole pipeline +7. [samshee](#samshee) - Validates illumina v2 samplesheets. ![subway map](docs/demultiplex.png) diff --git a/conf/test.config b/conf/test.config index 45467be6..2a1366b5 100755 --- a/conf/test.config +++ b/conf/test.config @@ -22,7 +22,7 @@ params { // Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' - skip_tools = 'samplesheet_validator' + skip_tools = 'samshee' } diff --git a/conf/test_bcl2fastq.config b/conf/test_bcl2fastq.config index 69960f6c..ce880444 100755 --- a/conf/test_bcl2fastq.config +++ b/conf/test_bcl2fastq.config @@ -22,7 +22,7 @@ params { // Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bcl2fastq' - skip_tools = "checkqc,samplesheet_validator" + skip_tools = "checkqc,samshee" } diff --git a/conf/test_checkqc.config b/conf/test_checkqc.config index 32d10a04..7dc7fbb5 100644 --- a/conf/test_checkqc.config +++ b/conf/test_checkqc.config @@ -18,7 +18,7 @@ params { // Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' demultiplexer = 'bcl2fastq' - skip_tools = "fastp,falco,md5sum,multiqc,samplesheet_validator" + skip_tools = "fastp,falco,md5sum,multiqc,samshee" checkqc_config = "${projectDir}/assets/checkqc_config.yaml" } diff --git a/conf/test_full.config b/conf/test_full.config index b3cf6eb1..40209e9a 100644 --- a/conf/test_full.config +++ b/conf/test_full.config @@ -17,5 +17,5 @@ params { // Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' demultiplexer = 'bcl2fastq' - skip_tools = 'samplesheet_validator' + skip_tools = 'samshee' } diff --git a/conf/test_kraken.config b/conf/test_kraken.config index 01d858ae..f13f7fb8 100644 --- a/conf/test_kraken.config +++ b/conf/test_kraken.config @@ -18,6 +18,6 @@ params { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/samplesheet_full.csv' demultiplexer = 'bcl2fastq' kraken_db = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/db/kraken2.tar.gz' - skip_tools = 'samplesheet_validator' + skip_tools = 'samshee' } diff --git a/conf/test_mkfastq.config b/conf/test_mkfastq.config index c981f41d..7990e1c8 100644 --- a/conf/test_mkfastq.config +++ b/conf/test_mkfastq.config @@ -22,5 +22,5 @@ params { // Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/mkfastq-samplesheet.csv' demultiplexer = 'mkfastq' - skip_tools = 'samplesheet_validator' + skip_tools = 'samshee' } diff --git a/conf/test_pe.config b/conf/test_pe.config index b6c38e33..84ca95a0 100644 --- a/conf/test_pe.config +++ b/conf/test_pe.config @@ -17,5 +17,5 @@ params { // Input data input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/pe_samplesheet.csv' demultiplexer = 'bcl2fastq' - skip_tools = "checkqc,samplesheet_validator" + skip_tools = "checkqc,samshee" } diff --git a/conf/test_uncompressed.config b/conf/test_uncompressed.config index 2db017d1..3d0c2ab7 100644 --- a/conf/test_uncompressed.config +++ b/conf/test_uncompressed.config @@ -22,7 +22,7 @@ params { // Input data input = 'https://github.com/nf-core/test-datasets/raw/demultiplex/samplesheet/1.3.0/uncompressed-samplesheet.csv' demultiplexer = 'bclconvert' - skip_tools = 'samplesheet_validator' + skip_tools = 'samshee' } diff --git a/docs/usage.md b/docs/usage.md index d7905ec3..58152c24 100755 --- a/docs/usage.md +++ b/docs/usage.md @@ -204,9 +204,9 @@ To learn how to provide additional arguments to a particular tool of the pipelin The trimming process in our demultiplexing pipeline has been updated to ensure compatibility with 10x Genomics recommendations. By default, trimming in the pipeline is performed using fastp, which reliably auto-detects and removes adapter sequences without the need for storing adapter sequences. As users can also supply adapter sequences in a samplesheet and thereby triggering trimming in any `bcl2fastq` or `bclconvert` subworkflows, we have added a new parameter, `remove_adapter`, which is set to true by default. When `remove_adapter` is true, the pipeline automatically removes any adapter sequences listed in the `[Settings]` section of the Illumina sample sheet, replacing them with an empty string in order to not provoke this behaviour. This approach aligns with 10x Genomics' guidelines, as they advise against pre-processing FASTQ reads before inputting them into their software pipelines. If the `remove_adapter` setting is true but no adapter is removed, a warning will be displayed; however, this does not necessarily indicate an error, as some sample sheets may already lack these adapter sequences. Users can disable this behavior by setting `--remove_adapter false` in the command line, though this is not recommended. -## Samplesheet validator (samshee) +## samshee (Samplesheet validator) -The Samplesheet validator (samshee) module ensures the integrity of Illumina v2 Sample Sheets by allowing users to apply custom validation rules. The module can be used together with the parameter `--validator_schema`, which accepts a JSON schema validator file. Users can specify this file to enforce additional validation rules beyond the default ones provided by the tool. To use this feature, simply provide the path to the JSON schema validator file via the `--validator_schema` parameter in the pipeline configuration. This enables tailored validation of Sample Sheets to meet specific requirements or standards relevant to your sequencing workflow. For more information about the tool or how to write the schema JSON file, please refer to [Samshee on GitHub](https://github.com/lit-regensburg/samshee). +samshee ensures the integrity of Illumina v2 Sample Sheets by allowing users to apply custom validation rules. The module can be used together with the parameter `--validator_schema`, which accepts a JSON schema validator file. Users can specify this file to enforce additional validation rules beyond the default ones provided by the tool. To use this feature, simply provide the path to the JSON schema validator file via the `--validator_schema` parameter in the pipeline configuration. This enables tailored validation of Sample Sheets to meet specific requirements or standards relevant to your sequencing workflow. For more information about the tool or how to write the schema JSON file, please refer to [Samshee on GitHub](https://github.com/lit-regensburg/samshee). ### nf-core/configs diff --git a/modules/local/samplesheet_validator/README.md b/modules/local/samshee/README.md similarity index 100% rename from modules/local/samplesheet_validator/README.md rename to modules/local/samshee/README.md diff --git a/modules/local/samplesheet_validator/environment.yml b/modules/local/samshee/environment.yml similarity index 100% rename from modules/local/samplesheet_validator/environment.yml rename to modules/local/samshee/environment.yml diff --git a/modules/local/samplesheet_validator/main.nf b/modules/local/samshee/main.nf similarity index 99% rename from modules/local/samplesheet_validator/main.nf rename to modules/local/samshee/main.nf index fc85bd65..58146d6e 100644 --- a/modules/local/samplesheet_validator/main.nf +++ b/modules/local/samshee/main.nf @@ -1,4 +1,4 @@ -process SAMPLESHEET_VALIDATOR { +process SAMSHEE { tag "$meta.id" label 'process_single' diff --git a/modules/local/samplesheet_validator/meta.yml b/modules/local/samshee/meta.yml similarity index 97% rename from modules/local/samplesheet_validator/meta.yml rename to modules/local/samshee/meta.yml index 5c1cb27a..0c6388ee 100644 --- a/modules/local/samplesheet_validator/meta.yml +++ b/modules/local/samshee/meta.yml @@ -1,4 +1,4 @@ -name: samplesheet_validator +name: samshee description: Module to validate illumina® Sample Sheet v2 files. keywords: - samplesheet diff --git a/modules/local/samplesheet_validator/tests/main.nf.test b/modules/local/samshee/tests/main.nf.test similarity index 93% rename from modules/local/samplesheet_validator/tests/main.nf.test rename to modules/local/samshee/tests/main.nf.test index 6e9592b4..d76c98f4 100644 --- a/modules/local/samplesheet_validator/tests/main.nf.test +++ b/modules/local/samshee/tests/main.nf.test @@ -1,10 +1,10 @@ // nf-core modules test cellranger/mkfastq nextflow_process { - name "Test Process SAMPLESHEET_VALIDATOR" + name "Test Process samshee" script "../main.nf" config "./nextflow.config" - process "SAMPLESHEET_VALIDATOR" + process "SAMSHEE" tag "modules" diff --git a/modules/local/samplesheet_validator/tests/nextflow.config b/modules/local/samshee/tests/nextflow.config similarity index 100% rename from modules/local/samplesheet_validator/tests/nextflow.config rename to modules/local/samshee/tests/nextflow.config diff --git a/nextflow.config b/nextflow.config index fa77a58f..040dcb5a 100755 --- a/nextflow.config +++ b/nextflow.config @@ -18,7 +18,7 @@ params { remove_adapter = true // [true, false] // Options: tooling - skip_tools = [] // list [fastp, fastqc, kraken, multiqc, checkqc, falco, md5sum, samplesheet_validator] + skip_tools = [] // list [fastp, fastqc, kraken, multiqc, checkqc, falco, md5sum, samshee] // seqtk sample options sample_size = 100000 diff --git a/nextflow_schema.json b/nextflow_schema.json index cf93b766..c88d8f96 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -19,7 +19,7 @@ "skip_tools": { "type": "string", "default": "[]", - "description": "Comma-separated list of tools to skip (fastp,fastqc,kraken,multiqc,checkqc,falco,md5sum,samplesheet_validator)" + "description": "Comma-separated list of tools to skip (fastp,fastqc,kraken,multiqc,checkqc,falco,md5sum,samshee)" }, "sample_size": { "type": "integer", diff --git a/tests/pipeline/bcl2fastq.nf.test b/tests/pipeline/bcl2fastq.nf.test index abca8b56..05e7c792 100644 --- a/tests/pipeline/bcl2fastq.nf.test +++ b/tests/pipeline/bcl2fastq.nf.test @@ -12,7 +12,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bcl2fastq' outdir = "$outputDir" - skip_tools = "checkqc,samplesheet_validator" + skip_tools = "checkqc,samshee" } } diff --git a/tests/pipeline/kraken.nf.test b/tests/pipeline/kraken.nf.test index 078cba44..5d01f3fa 100644 --- a/tests/pipeline/kraken.nf.test +++ b/tests/pipeline/kraken.nf.test @@ -12,7 +12,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bcl2fastq' outdir = "$outputDir" - skip_tools = "checkqc,samplesheet_validator" + skip_tools = "checkqc,samshee" kraken_db = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/sarscov2/genome/db/kraken2.tar.gz' } } diff --git a/tests/pipeline/skip_tools.nf.test b/tests/pipeline/skip_tools.nf.test index a479f370..55a9da2b 100644 --- a/tests/pipeline/skip_tools.nf.test +++ b/tests/pipeline/skip_tools.nf.test @@ -41,7 +41,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "fastp,samplesheet_validator" + skip_tools = "fastp,samshee" } } @@ -69,7 +69,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "fastqc,samplesheet_validator" + skip_tools = "fastqc,samshee" } } @@ -97,7 +97,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "fastp,fastqc,samplesheet_validator" + skip_tools = "fastp,fastqc,samshee" } } @@ -125,7 +125,7 @@ nextflow_pipeline { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/flowcell_input.csv' demultiplexer = 'bclconvert' outdir = "$outputDir" - skip_tools = "multiqc,samplesheet_validator" + skip_tools = "multiqc,samshee" } } diff --git a/tests/pipeline/test_pe.nf.test b/tests/pipeline/test_pe.nf.test index ec1f654d..d7222f82 100644 --- a/tests/pipeline/test_pe.nf.test +++ b/tests/pipeline/test_pe.nf.test @@ -11,7 +11,7 @@ nextflow_pipeline { params { input = 'https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/pe_samplesheet.csv' demultiplexer = 'bcl2fastq' - skip_tools = "checkqc,samplesheet_validator" + skip_tools = "checkqc,samshee" outdir = "$outputDir" } } diff --git a/workflows/demultiplex.nf b/workflows/demultiplex.nf index 0f235621..b9b1d398 100644 --- a/workflows/demultiplex.nf +++ b/workflows/demultiplex.nf @@ -30,7 +30,7 @@ include { MD5SUM } from '../modules/nf-core/md5sum/main' // // MODULE: Local modules // -include { SAMPLESHEET_VALIDATOR } from '../modules/local/samplesheet_validator/main' +include { SAMSHEE } from '../modules/local/samshee/main' // // FUNCTION @@ -90,13 +90,13 @@ workflow DEMULTIPLEX { } } - // RUN samplesheet_validator - if (!("samplesheet_validator" in skip_tools) && (params.demultiplexer in ["bcl2fastq", "bclconvert", "mkfastq"])){ - SAMPLESHEET_VALIDATOR ( + // RUN samplesheet_validator samshee + if (!("samshee" in skip_tools) && (params.demultiplexer in ["bcl2fastq", "bclconvert", "mkfastq"])){ + SAMSHEE ( ch_samplesheet.map{ meta, samplesheet, flowcell, lane -> [meta,samplesheet] }, ch_validator_schema ) - ch_versions = ch_versions.mix(SAMPLESHEET_VALIDATOR.out.versions) + ch_versions = ch_versions.mix(SAMSHEE.out.versions) } // Convenience From b50e71ddb25ffa17916d1508943072e3b9bbb501 Mon Sep 17 00:00:00 2001 From: Alexander Peltzer Date: Mon, 12 Aug 2024 07:08:01 +0000 Subject: [PATCH 16/19] Fix lint --- nextflow_schema.json | 1 + 1 file changed, 1 insertion(+) diff --git a/nextflow_schema.json b/nextflow_schema.json index 87fe3092..7ab8d009 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -35,6 +35,7 @@ "type": "string", "format": "file-path", "description": "Path to Illumina v2 samplesheet validator .json file" + }, "downstream_pipeline": { "type": "string", "description": "Name of downstream nf-core pipeline (one of: rnaseq, atacseq, taxprofiler or default). Used to produce the input samplesheet for that pipeline.", From 08460b21dd63e7b9c3dcd573e537af3ad52e0dd6 Mon Sep 17 00:00:00 2001 From: Alexander Peltzer Date: Mon, 12 Aug 2024 07:18:27 +0000 Subject: [PATCH 17/19] Add versions to csv2tsv module --- modules/local/csv2tsv.nf | 6 ++++++ subworkflows/local/fqtk_demultiplex/main.nf | 5 ++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/modules/local/csv2tsv.nf b/modules/local/csv2tsv.nf index 0dd5a720..a35681d3 100644 --- a/modules/local/csv2tsv.nf +++ b/modules/local/csv2tsv.nf @@ -9,6 +9,7 @@ process CSV2TSV { output: tuple val(meta), path('samplesheet.tsv'), val(fastq_readstructure_pairs), emit: ch_output + path "versions.yml", emit: versions when: task.ext.when == null || task.ext.when @@ -16,5 +17,10 @@ process CSV2TSV { script: """ sed 's/,/\t/g' ${sample_sheet} > samplesheet.tsv + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + sed: \$( sed --version | grep "sed (GNU sed) " | sed -e "s/sed (GNU sed) //g" ) + END_VERSIONS """ } diff --git a/subworkflows/local/fqtk_demultiplex/main.nf b/subworkflows/local/fqtk_demultiplex/main.nf index df4f5478..9066bcd5 100644 --- a/subworkflows/local/fqtk_demultiplex/main.nf +++ b/subworkflows/local/fqtk_demultiplex/main.nf @@ -21,11 +21,14 @@ workflow FQTK_DEMULTIPLEX { // Generate meta for each fastq ch_fastq_with_meta = generate_fastq_meta(FQTK.out.sample_fastq) + // Add versions to versions channel + ch_versions = FQTK.out.versions.mix(CSV2TSV.out.versions) + emit: fastq = ch_fastq_with_meta metrics = FQTK.out.metrics unassigned = FQTK.out.most_frequent_unmatched - versions = FQTK.out.versions + versions = ch_versions } /* From ee95c5d7f2d9764c57a5435f8f36017439b2a9de Mon Sep 17 00:00:00 2001 From: Alexander Peltzer Date: Mon, 12 Aug 2024 07:32:47 +0000 Subject: [PATCH 18/19] Update test to include sed version --- tests/pipeline/fqtk.nf.test.snap | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/pipeline/fqtk.nf.test.snap b/tests/pipeline/fqtk.nf.test.snap index 32800060..acb4a9b3 100644 --- a/tests/pipeline/fqtk.nf.test.snap +++ b/tests/pipeline/fqtk.nf.test.snap @@ -11,7 +11,7 @@ }, "software_versions": { "content": [ - "{FALCO={falco=1.2.1}, FASTP={fastp=0.23.4}, FQTK={fqtk=0.2.1}, MD5SUM={md5sum=8.3}, UNTAR_FLOWCELL={untar=1.34}, Workflow={nf-core/demultiplex=v1.5.0dev}}" + "{CSV2TSV={sed=4.8}, FALCO={falco=1.2.1}, FASTP={fastp=0.23.4}, FQTK={fqtk=0.2.1}, MD5SUM={md5sum=8.3}, UNTAR_FLOWCELL={untar=1.34}, Workflow={nf-core/demultiplex=v1.5.0dev}}" ], "meta": { "nf-test": "0.8.4", @@ -19,4 +19,4 @@ }, "timestamp": "2024-08-02T19:57:17.122084549" } -} \ No newline at end of file +} From ec132a3bc0488d2184502d17b9a0ab4f3d3f7576 Mon Sep 17 00:00:00 2001 From: nschcolnicov Date: Mon, 12 Aug 2024 12:47:02 +0000 Subject: [PATCH 19/19] PR comments --- docs/usage.md | 20 +++++++++++--------- modules/local/samshee/meta.yml | 4 ---- 2 files changed, 11 insertions(+), 13 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index 74466de8..c3efd5b8 100755 --- a/docs/usage.md +++ b/docs/usage.md @@ -118,10 +118,20 @@ genome: 'GRCh37' You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). -## Optional parameters +### Optional parameters + +## checkQC If you are running this pipeline with the bcl2fastq demultiplexer, the checkqc module is run. In this case, the default run will include the default config file for checkqc, but you can additionally provide your own checkqc config file using the parameter `--checkqc_config` and a path to a `yml`. See an example of a config file in the [checkqc repository](https://github.com/Molmed/checkQC/blob/dfba84ec63e1df60c0f84ccc96a154a330b28ce4/checkQC/default_config/config.yaml). +### Trimming + +The trimming process in our demultiplexing pipeline has been updated to ensure compatibility with 10x Genomics recommendations. By default, trimming in the pipeline is performed using fastp, which reliably auto-detects and removes adapter sequences without the need for storing adapter sequences. As users can also supply adapter sequences in a samplesheet and thereby triggering trimming in any `bcl2fastq` or `bclconvert` subworkflows, we have added a new parameter, `remove_adapter`, which is set to true by default. When `remove_adapter` is true, the pipeline automatically removes any adapter sequences listed in the `[Settings]` section of the Illumina sample sheet, replacing them with an empty string in order to not provoke this behaviour. This approach aligns with 10x Genomics' guidelines, as they advise against pre-processing FASTQ reads before inputting them into their software pipelines. If the `remove_adapter` setting is true but no adapter is removed, a warning will be displayed; however, this does not necessarily indicate an error, as some sample sheets may already lack these adapter sequences. Users can disable this behavior by setting `--remove_adapter false` in the command line, though this is not recommended. + +## samshee (Samplesheet validator) + +samshee ensures the integrity of Illumina v2 Sample Sheets by allowing users to apply custom validation rules. The module can be used together with the parameter `--validator_schema`, which accepts a JSON schema validator file. Users can specify this file to enforce additional validation rules beyond the default ones provided by the tool. To use this feature, simply provide the path to the JSON schema validator file via the `--validator_schema` parameter in the pipeline configuration. This enables tailored validation of Sample Sheets to meet specific requirements or standards relevant to your sequencing workflow. For more information about the tool or how to write the schema JSON file, please refer to [Samshee on GitHub](https://github.com/lit-regensburg/samshee). + ### Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: @@ -217,14 +227,6 @@ A pipeline might not always support every possible argument or option of a parti To learn how to provide additional arguments to a particular tool of the pipeline, please see the [customising tool arguments](https://nf-co.re/docs/usage/configuration#customising-tool-arguments) section of the nf-core website. -### Trimming - -The trimming process in our demultiplexing pipeline has been updated to ensure compatibility with 10x Genomics recommendations. By default, trimming in the pipeline is performed using fastp, which reliably auto-detects and removes adapter sequences without the need for storing adapter sequences. As users can also supply adapter sequences in a samplesheet and thereby triggering trimming in any `bcl2fastq` or `bclconvert` subworkflows, we have added a new parameter, `remove_adapter`, which is set to true by default. When `remove_adapter` is true, the pipeline automatically removes any adapter sequences listed in the `[Settings]` section of the Illumina sample sheet, replacing them with an empty string in order to not provoke this behaviour. This approach aligns with 10x Genomics' guidelines, as they advise against pre-processing FASTQ reads before inputting them into their software pipelines. If the `remove_adapter` setting is true but no adapter is removed, a warning will be displayed; however, this does not necessarily indicate an error, as some sample sheets may already lack these adapter sequences. Users can disable this behavior by setting `--remove_adapter false` in the command line, though this is not recommended. - -## samshee (Samplesheet validator) - -samshee ensures the integrity of Illumina v2 Sample Sheets by allowing users to apply custom validation rules. The module can be used together with the parameter `--validator_schema`, which accepts a JSON schema validator file. Users can specify this file to enforce additional validation rules beyond the default ones provided by the tool. To use this feature, simply provide the path to the JSON schema validator file via the `--validator_schema` parameter in the pipeline configuration. This enables tailored validation of Sample Sheets to meet specific requirements or standards relevant to your sequencing workflow. For more information about the tool or how to write the schema JSON file, please refer to [Samshee on GitHub](https://github.com/lit-regensburg/samshee). - ### nf-core/configs In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter. You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. diff --git a/modules/local/samshee/meta.yml b/modules/local/samshee/meta.yml index 0c6388ee..145ddd24 100644 --- a/modules/local/samshee/meta.yml +++ b/modules/local/samshee/meta.yml @@ -23,10 +23,6 @@ input: description: "illumina v2 samplesheet" pattern: "*.{csv}" output: - - fastq: - type: file - description: Unaligned FastQ files - pattern: "*.fastq.gz" - versions: type: file description: File containing software version