InPreD · tinavisnovska · Mar 6, 2024 · Feb 7, 2024 · Feb 8, 2024 · Feb 14, 2024
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -3,10 +3,25 @@ on:
   push:
     branches:
       - main
+  pull_request:
+    branches:
+      - main
     tags:
       - '*.*.*'
 
 jobs:
+  test:
+    name: Run unit tests
+    runs-on: ubuntu-latest
+    steps:
+      - 
+        name: Check out the repo
+        uses: actions/checkout@v4
+      - 
+        name: Unit testing 
+        uses: fylein/python-pytest-github-action@v2
+        with:
+          args: pip3 install -r requirements.txt && pytest
   build:
     name: Build Image
     runs-on: ubuntu-latest
@@ -27,6 +42,9 @@ jobs:
           tags: |
             latest
             type=semver,pattern={{raw}}
+            type=semver,pattern={{version}}
+            type=semver,pattern={{major}}.{{minor}}
+            type=semver,pattern={{major}}
       - 
         name: Login to Dockerhub
         uses: docker/login-action@v3

diff --git a/.gitignore b/.gitignore
@@ -1 +1,3 @@
 README.md.backup
+generator/__pycache__
+generator/tests/__pycache__
diff --git a/Dockerfile b/Dockerfile
@@ -1,12 +1,7 @@
 FROM python:3.11.4-slim
 ENV PATH=$PATH:/opt
-RUN apt-get update \
-    && apt-get install -y --no-install-recommends \
-    && rm -rf /var/lib/apt/lists/* \
-    && apt-get clean
 COPY requirements.txt /
 RUN pip install --no-cache-dir -r requirements.txt \
     && rm requirements.txt
 COPY samplesheet_generator.py /opt/
-COPY test /opt/test
 COPY indexes /opt/indexes
diff --git a/README.md b/README.md
@@ -3,47 +3,49 @@ generates samplesheet compatible with TSO500 LocalApp analysis.
 
 ## Contents
 
-1. [Dependencies](#dependencies)
-2. [Description of Input Parameters](#description-of-input-parameters)
-3. [Usage](#usage)
+1. [Introduction](#introduction)
+2. [Dependencies](#dependencies)
+3. [Description of Input Parameters](#description-of-input-parameters)
+4. [Usage](#usage)
 
 ## Introduction
 
-Running LocalApp analysis requires a samplesheet in a specific format consistent with the performed sequencing to guide the analysis. Sometimes the samplesheet generated by a sequencing machine is transferred so that LocalApp has access to the file. However, sometimes it is more efficient to generate such a samplesheet from scratch. This script automatizes the second option.
+Running LocalApp analysis requires a samplesheet in a specific format consistent with the performed sequencing to guide the analysis. Such a samplesheet can be created manually using a template samplesheet file provided by Illumina and filling in a run specific information for every sequencing run using Excel or similar. This script is an attempt to automatize the task of samplesheet generation as much as possible. The script gets a sequencing run information in a simple `tab` separated table and generates a samplesheet fulfilling LocalApp requirements.
+
 
 ## Dependencies
 
-python3>=3.11.4, 
-pandas>=2.2.0
+- python3>=3.11.4, 
+- pandas>=2.2.0
 
 
 ## Description of Input Parameters
 
-Names of the input parameters defined for the script and their possible values are listed in the table below. Except for the `--input_info_file` all the other parameters are strings or integers that will be incorporated into the output of the script. 
-
-
-| parameter | description |
-|:---|:---|
-|`--run-id`| Id of the sequencing run for which the samplesheet is generated. (**str**)|
-|`--index-type`| `dual` or `simple` Type of the used index. (**str**)|
-|`--index-length`| `8` or `10`. Number of nucleotides in the used index. (**int**) |
-|`--investigator-name`| Value of the `Investigator Name` field in the samplesheet. Using `name (inpred_node)` could be a good convention. (**str**)|
-|`--experiment-name`| Value of the `Experiment Name` field in the samplesheet. A list of study names from which the samples are could be a good convention. (**str**)|
-|`--input-info-file`| Full path to an input file. (**str**)|
-|`--read-length-1`| This value will be filled in the `Reads` section of the samplesheet. (**int**) [default = 101]|
-|`--read-length-2`| This value will be filled in the `Reads` section of the samplesheet. (**int**) [default = 101]|
-|`--adapter-read-1`| This value will be filled in the `Settings` section of the samplesheet, it is a nucleotide adapter sequence of read1. (**str**)|
-|`--adapter-read-2`| This value will be filled in the `Settings` section of the samplesheet, it is a nucleotide adapter sequence of read2. (**str**)|
-|`--adapter-behavior`| This value will be filled in the `Settings` section of the samplesheet. (**str**) [default = 'trim']|
-|`--minimum-trimmed-read-length`| This value will be filled in the `Settings` section of the samplesheet. (**int**) [default = 35]|
-|`--mask-short-reads`| This value will be filled in the `Settings` section of the samplesheet. (**int**) [default = 22]|
-|`--override-cycles`| This value will be filled in the `Settings` section of the samplesheet. (**str**)|
-|`--samplesheet-version`| Version in which the samplesheet should be generated. Only `v1` is implemented now.  (**str**) [default = 'v1']|
-|`--help`|Print the help message and exit.|
+Names of the input parameters defined for the script and their possible values are listed in the table below.
+
+
+| parameter | description | type | default |
+|:---|:---|:---:|:---:|
+|`--run-id`| ID of the sequencing run for which the samplesheet is generated. | str | |
+|`--index-type`| Type of the used index. Supported values are `dual` and `simple`. | str | `dual` |
+|`--index-length`| Number of nucleotides in the used index. Supported values are `8` and `10`. | int | |
+|`--investigator-name`| Value of the `Investigator Name` field in the samplesheet. It is preferred for the string to be in the form `name (inpred_node)`. The string cannot contain a comma. | str | `''` | 
+|`--experiment-name`| Value of the `Experiment Name` field in the samplesheet. It is preferred for the string to be a space separated list of all the studies from which the run samples are. The string cannot contain a comma. | str | `''` |
+|`--input-info-file`| Absolute path to an input info file. | str | |
+|`--read-length-1`| Length of the sequenced forward reads. This value will be filled in the `Reads` section of the samplesheet. | int | `101` |
+|`--read-length-2`| Length of the sequenced reverse reads. This value will be filled in the `Reads` section of the samplesheet. | int | `101` |
+|`--adapter-read-1`| Nucleotide adapter sequence of read1. This value will be filled in the `Settings` section of the samplesheet. | str | |
+|`--adapter-read-2`| Nucleotide adapter sequence of read2. This value will be filled in the `Settings` section of the samplesheet. | str | |
+|`--adapter-behavior`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. Supported values are `trim` and `mask`. For more info about BCL convert, see the [BCL convert user guide](https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html). | str | `trim` |
+|`--minimum-trimmed-read-length`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. | int |  `35` |
+|`--mask-short-reads`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. | int | `22` |
+|`--override-cycles`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. | str | | 
+|`--samplesheet-version`| Version in which the samplesheet is generated. Only `v1` is implemented now. | str | `v1` |
+|`--help`|Print the help message and exit.| | |
 
 File format of the `input_info_file` follows.
 
-### Input File Format
+### Input Info File Format
 
 The file is expected to contain `tab` separated values (`.tsv` file). The rows starting with character `#` are ignored (considered to be comments). The first non-commented row is considered to be a header containing column names.
 
@@ -131,47 +133,17 @@ apptainer run \
 
 ### Run Test Data Example
 
-The script is tested with data of a specific sequencing run. The run consists of artificial samples, including AcroMetrix samples. The sequencing was performed on nextseq, with the legacy parameter setting and file formats.
-
-#### Test Data Input File
-
-The input info file `infoFile.tsv` is located in `test`folder of this repository and in `/opt/test` of the created Docekr image. The content of the file follows:
-
-
-
-```
-sample_id	molecule	run_id	barcode	index
-CLAcroMetrix-D01-X01-X00	DNA	191206_NB501498_0174_AHWCNMBGXC	NA	TCCGGAGA
-```
+Test data are located in the `test` subfolder of the repository. Input info file is named `infoFile.tsv` and expected output is stored in `samplesheet.tsv`.
 
-#### Test Data Output 
+The script is tested with data of a specific sequencing run. The run consists of artificial samples, including AcroMetrix samples. The sequencing was performed on a NextSeq instrument, with the legacy parameter setting and file formats.
 
-```
-[Header]
-Investigator Name,Name (InPreD node)
-Experiment Name,OUS pathology test run
-Date,07/02/2024
-
-[Reads]
-101
-101
-
-[Settings]
-AdapterRead1,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
-AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
-AdapterBehavior,trim
-MinimumTrimmedReadLength,35
-MaskShortReads,22
-OverrideCycles,U7N1Y93;I8;I8;U7N1Y93
-
-[Data]
-Sample_ID,Sample_Type,Pair_ID,index,I7_Index_ID,index2,I5_Index_ID
-CLAcroMetrix-D01-X01-X00,DNA,CLAcroMetrix-D01-X01-X00,TCCGGAGA,D702,AGGATAGG,D503
-```
 
 #### Locally
 
 ```
+# ${GITHUB_REPOSITORY_LOCAL_PATH} is an absolute path 
+# to the samplesheet_generator repository on on the local compute.
+
 # define the testRunID value
 testRunID="191206_NB501498_0174_AHWCNMBGXC"
 
@@ -197,7 +169,14 @@ python3 samplesheet_generator.py \
 #### Docker
 
 ```
-docker run docker://inpred/samplesheet_generator:latest bash
+# ${GITHUB_REPOSITORY_LOCAL_PATH} is an absolute path 
+# to the samplesheet_generator repository on on the local compute.
+# ${INFO_INPUT_FILE_CONTAINER} is an absolute path to the input info file 
+# in the container. 
+
+docker run --rm -it \
+	-v ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
+	docker://inpred/samplesheet_generator:latest bash
 
 Docker> testRunID="191206_NB501498_0174_AHWCNMBGXC"
 Docker> python3 samplesheet_generator.py \
@@ -206,7 +185,7 @@ Docker> python3 samplesheet_generator.py \
 		--index-length 8 \
 		--investigator-name "Name (InPreD node)" \
 		--experiment-name "OUS pathology test run" \
-		--input-info-file /opt/test/infoFile.tsv \
+		--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
 		--read-length-1 101 \
 		--read-length-2 101 \
 		--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
@@ -221,7 +200,9 @@ Docker> python3 samplesheet_generator.py \
 #### Singularity/Apptainer
 
 ```
-singularity run docker://inpred/samplesheet_generator:latest bash
+singularity run \
+	-B ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
+	docker://inpred/samplesheet_generator:latest bash
 
 Singularity> testRunID="191206_NB501498_0174_AHWCNMBGXC"
 Singularity> python3 samplesheet_generator.py \
@@ -230,7 +211,7 @@ Singularity> python3 samplesheet_generator.py \
 		--index-length 8 \
 		--investigator-name "Name (InPreD node)" \
 		--experiment-name "OUS pathology test run" \
-		--input-info-file /opt/test/infoFile.tsv \
+		--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
 		--read-length-1 101 \
 		--read-length-2 101 \
 		--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
@@ -242,7 +223,9 @@ Singularity> python3 samplesheet_generator.py \
 		--samplesheet-version "v1"
 
 
-apptainer run docker://inpred/samplesheet_generator:latest bash
+apptainer run \
+	-B ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
+	docker://inpred/samplesheet_generator:latest bash
 
 Apptainer> testRunID="191206_NB501498_0174_AHWCNMBGXC"
 Apptainer> python3 samplesheet_generator.py \
@@ -251,7 +234,7 @@ Apptainer> python3 samplesheet_generator.py \
 		--index-length 8 \
 		--investigator-name "Name (InPreD node)" \
 		--experiment-name "OUS pathology test run" \
-		--input-info-file /opt/test/infoFile.tsv \
+		--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
 		--read-length-1 101 \
 		--read-length-2 101 \
 		--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \