Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening discussion about samplesheet_generator #1

Merged
merged 19 commits into from
Mar 6, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
ebf43d1
Fixed some README.md typos.
tinavisnovska Feb 7, 2024
2cb383d
Added the Introduction section into the Contents in README.md
tinavisnovska Feb 8, 2024
26bd6c0
Removed unnecessary comments from indexes/TSO500_NextSeq_simple_index…
tinavisnovska Feb 14, 2024
f0d201f
Removed unnecessary comments from indexes/TSO500_NovaSeq_dual_indexes…
tinavisnovska Feb 14, 2024
420ebef
Simplified and cleaned Dockerfile.
tinavisnovska Feb 14, 2024
0099dbb
In samplesheet_generator.py: added choices for some parameters, remov…
tinavisnovska Feb 14, 2024
77f2bba
Moved all the test related data to the test folder.
tinavisnovska Feb 14, 2024
3f1b418
Expanded parameter describing table, improved Inrto.
tinavisnovska Feb 15, 2024
e0e9568
Fixed typo.
tinavisnovska Feb 15, 2024
57be03c
Set up pytest testing with a simple test.
tinavisnovska Feb 20, 2024
b97f576
Added testing container for unit tests into workflows/main.yml
tinavisnovska Feb 21, 2024
5290dbf
Added test input vars in generator/tests/cli_test.py
tinavisnovska Feb 21, 2024
431828e
Made the basic unit testing work.
tinavisnovska Feb 21, 2024
00ce914
Fixed dual_indexes condition.
tinavisnovska Feb 21, 2024
438c22b
removing requirements-test.txt for now, can be added when used in dev…
tinavisnovska Feb 21, 2024
6f22ecf
Attempted to make github actions work on a fork, .github/workflows/m…
tinavisnovska Feb 26, 2024
87b4ba2
Attempted to make the actions work, fixed dependencies.
tinavisnovska Feb 26, 2024
0c22771
Attempt to fix actions and module discovery
tinavisnovska Feb 26, 2024
c731f16
Upgraded pandas back to 2.2.0, relaxed python requirements to >= 3.9 …
tinavisnovska Feb 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,25 @@ on:
push:
branches:
- main
pull_request:
branches:
- main
tags:
- '*.*.*'

jobs:
test:
name: Run unit tests
runs-on: ubuntu-latest
steps:
-
name: Check out the repo
uses: actions/checkout@v4
-
name: Unit testing
uses: fylein/python-pytest-github-action@v2
tinavisnovska marked this conversation as resolved.
Show resolved Hide resolved
with:
args: pip3 install -r requirements.txt && pytest
tinavisnovska marked this conversation as resolved.
Show resolved Hide resolved
tinavisnovska marked this conversation as resolved.
Show resolved Hide resolved
build:
name: Build Image
runs-on: ubuntu-latest
Expand All @@ -27,6 +42,9 @@ jobs:
tags: |
latest
type=semver,pattern={{raw}}
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=semver,pattern={{major}}
-
name: Login to Dockerhub
uses: docker/login-action@v3
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
README.md.backup
generator/__pycache__
generator/tests/__pycache__
5 changes: 0 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,12 +1,7 @@
FROM python:3.11.4-slim
ENV PATH=$PATH:/opt
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
COPY requirements.txt /
RUN pip install --no-cache-dir -r requirements.txt \
&& rm requirements.txt
COPY samplesheet_generator.py /opt/
COPY test /opt/test
COPY indexes /opt/indexes
121 changes: 52 additions & 69 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,47 +3,49 @@ generates samplesheet compatible with TSO500 LocalApp analysis.

## Contents

1. [Dependencies](#dependencies)
2. [Description of Input Parameters](#description-of-input-parameters)
3. [Usage](#usage)
1. [Introduction](#introduction)
2. [Dependencies](#dependencies)
3. [Description of Input Parameters](#description-of-input-parameters)
4. [Usage](#usage)

## Introduction

tinavisnovska marked this conversation as resolved.
Show resolved Hide resolved
Running LocalApp analysis requires a samplesheet in a specific format consistent with the performed sequencing to guide the analysis. Sometimes the samplesheet generated by a sequencing machine is transferred so that LocalApp has access to the file. However, sometimes it is more efficient to generate such a samplesheet from scratch. This script automatizes the second option.
Running LocalApp analysis requires a samplesheet in a specific format consistent with the performed sequencing to guide the analysis. Such a samplesheet can be created manually using a template samplesheet file provided by Illumina and filling in a run specific information for every sequencing run using Excel or similar. This script is an attempt to automatize the task of samplesheet generation as much as possible. The script gets a sequencing run information in a simple `tab` separated table and generates a samplesheet fulfilling LocalApp requirements.


## Dependencies

python3>=3.11.4,
pandas>=2.2.0
- python3>=3.11.4,
- pandas>=2.2.0


## Description of Input Parameters

Names of the input parameters defined for the script and their possible values are listed in the table below. Except for the `--input_info_file` all the other parameters are strings or integers that will be incorporated into the output of the script.


| parameter | description |
|:---|:---|
|`--run-id`| Id of the sequencing run for which the samplesheet is generated. (**str**)|
|`--index-type`| `dual` or `simple` Type of the used index. (**str**)|
|`--index-length`| `8` or `10`. Number of nucleotides in the used index. (**int**) |
|`--investigator-name`| Value of the `Investigator Name` field in the samplesheet. Using `name (inpred_node)` could be a good convention. (**str**)|
|`--experiment-name`| Value of the `Experiment Name` field in the samplesheet. A list of study names from which the samples are could be a good convention. (**str**)|
|`--input-info-file`| Full path to an input file. (**str**)|
|`--read-length-1`| This value will be filled in the `Reads` section of the samplesheet. (**int**) [default = 101]|
|`--read-length-2`| This value will be filled in the `Reads` section of the samplesheet. (**int**) [default = 101]|
|`--adapter-read-1`| This value will be filled in the `Settings` section of the samplesheet, it is a nucleotide adapter sequence of read1. (**str**)|
|`--adapter-read-2`| This value will be filled in the `Settings` section of the samplesheet, it is a nucleotide adapter sequence of read2. (**str**)|
|`--adapter-behavior`| This value will be filled in the `Settings` section of the samplesheet. (**str**) [default = 'trim']|
|`--minimum-trimmed-read-length`| This value will be filled in the `Settings` section of the samplesheet. (**int**) [default = 35]|
|`--mask-short-reads`| This value will be filled in the `Settings` section of the samplesheet. (**int**) [default = 22]|
|`--override-cycles`| This value will be filled in the `Settings` section of the samplesheet. (**str**)|
|`--samplesheet-version`| Version in which the samplesheet should be generated. Only `v1` is implemented now. (**str**) [default = 'v1']|
|`--help`|Print the help message and exit.|
Names of the input parameters defined for the script and their possible values are listed in the table below.


| parameter | description | type | default |
|:---|:---|:---:|:---:|
|`--run-id`| ID of the sequencing run for which the samplesheet is generated. | str | |
|`--index-type`| Type of the used index. Supported values are `dual` and `simple`. | str | `dual` |
|`--index-length`| Number of nucleotides in the used index. Supported values are `8` and `10`. | int | |
|`--investigator-name`| Value of the `Investigator Name` field in the samplesheet. It is preferred for the string to be in the form `name (inpred_node)`. The string cannot contain a comma. | str | `''` |
|`--experiment-name`| Value of the `Experiment Name` field in the samplesheet. It is preferred for the string to be a space separated list of all the studies from which the run samples are. The string cannot contain a comma. | str | `''` |
|`--input-info-file`| Absolute path to an input info file. | str | |
|`--read-length-1`| Length of the sequenced forward reads. This value will be filled in the `Reads` section of the samplesheet. | int | `101` |
|`--read-length-2`| Length of the sequenced reverse reads. This value will be filled in the `Reads` section of the samplesheet. | int | `101` |
|`--adapter-read-1`| Nucleotide adapter sequence of read1. This value will be filled in the `Settings` section of the samplesheet. | str | |
|`--adapter-read-2`| Nucleotide adapter sequence of read2. This value will be filled in the `Settings` section of the samplesheet. | str | |
|`--adapter-behavior`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. Supported values are `trim` and `mask`. For more info about BCL convert, see the [BCL convert user guide](https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html). | str | `trim` |
|`--minimum-trimmed-read-length`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. | int | `35` |
|`--mask-short-reads`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. | int | `22` |
|`--override-cycles`| This value will be filled in the `Settings` section of the samplesheet and passed as an input parameter to BCL convert. | str | |
|`--samplesheet-version`| Version in which the samplesheet is generated. Only `v1` is implemented now. | str | `v1` |
|`--help`|Print the help message and exit.| | |

File format of the `input_info_file` follows.

### Input File Format
### Input Info File Format

The file is expected to contain `tab` separated values (`.tsv` file). The rows starting with character `#` are ignored (considered to be comments). The first non-commented row is considered to be a header containing column names.

Expand Down Expand Up @@ -131,47 +133,17 @@ apptainer run \

### Run Test Data Example

The script is tested with data of a specific sequencing run. The run consists of artificial samples, including AcroMetrix samples. The sequencing was performed on nextseq, with the legacy parameter setting and file formats.

#### Test Data Input File

The input info file `infoFile.tsv` is located in `test`folder of this repository and in `/opt/test` of the created Docekr image. The content of the file follows:



```
sample_id molecule run_id barcode index
CLAcroMetrix-D01-X01-X00 DNA 191206_NB501498_0174_AHWCNMBGXC NA TCCGGAGA
```
Test data are located in the `test` subfolder of the repository. Input info file is named `infoFile.tsv` and expected output is stored in `samplesheet.tsv`.

#### Test Data Output
The script is tested with data of a specific sequencing run. The run consists of artificial samples, including AcroMetrix samples. The sequencing was performed on a NextSeq instrument, with the legacy parameter setting and file formats.

```
[Header]
Investigator Name,Name (InPreD node)
Experiment Name,OUS pathology test run
Date,07/02/2024

[Reads]
101
101

[Settings]
AdapterRead1,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
AdapterBehavior,trim
MinimumTrimmedReadLength,35
MaskShortReads,22
OverrideCycles,U7N1Y93;I8;I8;U7N1Y93

[Data]
Sample_ID,Sample_Type,Pair_ID,index,I7_Index_ID,index2,I5_Index_ID
CLAcroMetrix-D01-X01-X00,DNA,CLAcroMetrix-D01-X01-X00,TCCGGAGA,D702,AGGATAGG,D503
```

#### Locally
tinavisnovska marked this conversation as resolved.
Show resolved Hide resolved

```
# ${GITHUB_REPOSITORY_LOCAL_PATH} is an absolute path
# to the samplesheet_generator repository on on the local compute.

# define the testRunID value
testRunID="191206_NB501498_0174_AHWCNMBGXC"

Expand All @@ -197,7 +169,14 @@ python3 samplesheet_generator.py \
#### Docker

```
docker run docker://inpred/samplesheet_generator:latest bash
# ${GITHUB_REPOSITORY_LOCAL_PATH} is an absolute path
# to the samplesheet_generator repository on on the local compute.
# ${INFO_INPUT_FILE_CONTAINER} is an absolute path to the input info file
# in the container.

docker run --rm -it \
-v ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
docker://inpred/samplesheet_generator:latest bash

Docker> testRunID="191206_NB501498_0174_AHWCNMBGXC"
Docker> python3 samplesheet_generator.py \
Expand All @@ -206,7 +185,7 @@ Docker> python3 samplesheet_generator.py \
--index-length 8 \
--investigator-name "Name (InPreD node)" \
--experiment-name "OUS pathology test run" \
--input-info-file /opt/test/infoFile.tsv \
--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
Expand All @@ -221,7 +200,9 @@ Docker> python3 samplesheet_generator.py \
#### Singularity/Apptainer

```
singularity run docker://inpred/samplesheet_generator:latest bash
singularity run \
-B ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
docker://inpred/samplesheet_generator:latest bash

Singularity> testRunID="191206_NB501498_0174_AHWCNMBGXC"
Singularity> python3 samplesheet_generator.py \
Expand All @@ -230,7 +211,7 @@ Singularity> python3 samplesheet_generator.py \
--index-length 8 \
--investigator-name "Name (InPreD node)" \
--experiment-name "OUS pathology test run" \
--input-info-file /opt/test/infoFile.tsv \
--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
Expand All @@ -242,7 +223,9 @@ Singularity> python3 samplesheet_generator.py \
--samplesheet-version "v1"


apptainer run docker://inpred/samplesheet_generator:latest bash
apptainer run \
-B ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
docker://inpred/samplesheet_generator:latest bash

Apptainer> testRunID="191206_NB501498_0174_AHWCNMBGXC"
Apptainer> python3 samplesheet_generator.py \
Expand All @@ -251,7 +234,7 @@ Apptainer> python3 samplesheet_generator.py \
--index-length 8 \
--investigator-name "Name (InPreD node)" \
--experiment-name "OUS pathology test run" \
--input-info-file /opt/test/infoFile.tsv \
--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
Expand Down
Loading
Loading