-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate reports per run, per project and per lane #13
Changes from all commits
a31040e
0da5870
d233d8f
307e43c
627cf94
c9ba028
2cfc91d
fbfb02d
8a19929
eba628e
23f69d1
1ebf3f1
b0bf471
9e0eca3
98e60bf
d95c660
e0527cc
42159a1
ef61f9f
51b01e9
7c7f31f
1c4f6e0
211bfaf
acc82d0
df4e9cb
2303db9
1ea8ac0
048765f
4329bb9
3ff8503
8ac4d76
84c9b3d
aaf17b6
e6dfea9
02affeb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,5 @@ results/ | |
testing/ | ||
testing* | ||
*.pyc | ||
.nf-test | ||
.nf-test.log |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
sample,lane,project,fastq_1,fastq_2,rundir | ||
sample,lane,group,fastq_1,fastq_2,rundir | ||
SAMPLE_PAIRED_END,1,P001,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz,/path/to/rundir | ||
SAMPLE_SINGLE_END,2,P002,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,,/path/to/rundir |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,47 +10,45 @@ | |
|
||
## Samplesheet input | ||
|
||
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. | ||
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. | ||
|
||
```bash | ||
--input '[path to samplesheet file]' | ||
``` | ||
|
||
### Multiple runs of the same sample | ||
### Full samplesheet | ||
|
||
The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: | ||
The following simple run dir structure... | ||
|
||
```csv title="samplesheet.csv" | ||
sample,fastq_1,fastq_2 | ||
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz | ||
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz | ||
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz | ||
``` | ||
run_dir | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why did you replace the exemplary filenames with this synthetic example? I think this may lead to confusion, because it may prompt people to tediously rename their files prior to a run. We should make clear that Also, judgemental adjectives like "simple" (or "difficult" etc.) should ideally be avoided in a README. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since @mahesh-panchal requested we visualize the directory structure, I thought it would be easier to connect the dots to the example samplesheet if all the file names in the dir also contained the information shown in the samplesheet. I don't necessarily get the impression we are suggesting the files need to follow a particular naming convention by showing an example that is as informative as possible, but I don't feel too strongly about it. |
||
├── sample1_lane1_group1_r1.fq.gz | ||
├── sample2_lane1_group1_r1.fq.gz | ||
├── sample3_lane2_group2_r1.fq.gz | ||
└── sample4_lane2_group3_r1.fq.gz | ||
``` | ||
|
||
### Full samplesheet | ||
|
||
The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. | ||
|
||
A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. | ||
...would be represented in the following samplesheet (shown as .tsv for readability) | ||
|
||
```csv title="samplesheet.csv" | ||
sample,fastq_1,fastq_2 | ||
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz | ||
CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz | ||
CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz | ||
TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, | ||
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, | ||
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, | ||
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, | ||
sample lane group fastq_1 fastq_2 rundir | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think that the order of the columns is advisable like this? Intuitively, I would have put all categorical variables together at the end, so that additional columns can be added easily later, if required e.g. by other sequencing technologies. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion on this, and I feel this deserves to be discussed in a new issue/pr. This pr was not meant to change the input format. This commit to |
||
sample1 1 group1 path/to/run_dir/sample1_lane1_group1_r1.fq.gz path/to/run_dir | ||
sample2 1 group1 path/to/run_dir/sample2_lane1_group1_r1.fq.gz path/to/run_dir | ||
sample3 2 group2 path/to/run_dir/sample3_lane2_group2_r1.fq.gz path/to/run_dir | ||
sample4 2 group3 path/to/run_dir/sample4_lane2_group3_r1.fq.gz path/to/run_dir | ||
|
||
``` | ||
|
||
| Column | Description | | ||
| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | | ||
| `lane` | Lane where the sample was processed on an Illumina instrument (optional). | | ||
kedhammar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| `group` | Group the sample belongs too, useful when several groups are pooled together (optional). | | ||
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | | ||
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | | ||
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz" (optional). | | ||
| `rundir` | Path to the runfolder containing extra information about the sequencing run (optional) . | | ||
|
||
An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. | ||
Another [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. | ||
|
||
## Running the pipeline | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
config { | ||
|
||
testsDir "tests" | ||
workDir ".nf-test" | ||
configFile "tests/nextflow.config" | ||
profile "test,docker" | ||
|
||
} |
kedhammar marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -84,7 +84,7 @@ workflow PIPELINE_INITIALISATION { | |
.fromSamplesheet("input") // Validates samplesheet against $projectDir/assets/schema_input.json. Path to validation schema is defined by $projectDir/nextflow_schema.json | ||
.map { | ||
meta, fastq_1, fastq_2 -> | ||
def id_string = "${meta.sample}_${meta.project ?: "ungrouped"}_${meta.lane}" | ||
def id_string = "${meta.sample}_${meta.group ?: "ungrouped"}_${meta.lane}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't
This comment was marked as resolved.
Sorry, something went wrong. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's been removed from required. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😳 Have I then been reviewing the wrong/outdated version of this PR all the time? Because I ran There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed it so we would be able to run on sequencing platforms without lanes, e.g. ONT. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but what about the other paths, e.g. channel where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you point out where this would be an issue? I'll note that I don't mind re-working this code into something more explicit, I simply lack the know-how as of now 😆 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the question is, is
and then the id string should be: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarifying! From what I remember of the initial meeting, the Intuitively I think having a consistent way to generate the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't it be easier to use the user-provided sample column? Could be potentially combined with a short |
||
def updated_meta = meta + [ id: id_string ] | ||
if (!fastq_2) { | ||
return [ updated_meta.id, updated_meta + [ single_end:true ], [ fastq_1 ] ] | ||
|
@@ -101,7 +101,6 @@ workflow PIPELINE_INITIALISATION { | |
// meta, fastqs -> | ||
// return [ meta, fastqs.flatten() ] | ||
// } | ||
.view() | ||
.set { ch_samplesheet } | ||
|
||
emit: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
nextflow_pipeline { | ||
|
||
name "Test Workflow main.nf on MiSeq data" | ||
script "../main.nf" | ||
tag "seqinspector" | ||
tag "PIPELINE" | ||
|
||
test("MiSeq data test") { | ||
|
||
when { | ||
config "./MiSeq.main.nf.test.config" | ||
params { | ||
outdir = "$outputDir" | ||
} | ||
} | ||
|
||
then { | ||
assertAll( | ||
{ assert workflow.success }, | ||
{ assert snapshot( | ||
path("$outputDir/multiqc/lanes/L1/multiqc_data/multiqc_citations.txt"), | ||
path("$outputDir/multiqc/lanes/L1/multiqc_data/multiqc_fastqc.txt"), | ||
path("$outputDir/multiqc/lanes/L1/multiqc_data/multiqc_general_stats.txt"), | ||
path("$outputDir/multiqc/lanes/L1/multiqc_data/multiqc_software_versions.txt"), | ||
|
||
path("$outputDir/multiqc/groups/P001/multiqc_data/multiqc_citations.txt"), | ||
path("$outputDir/multiqc/groups/P001/multiqc_data/multiqc_fastqc.txt"), | ||
path("$outputDir/multiqc/groups/P001/multiqc_data/multiqc_general_stats.txt"), | ||
path("$outputDir/multiqc/groups/P001/multiqc_data/multiqc_software_versions.txt"), | ||
|
||
path("$outputDir/multiqc/multiqc_data/multiqc_citations.txt"), | ||
path("$outputDir/multiqc/multiqc_data/multiqc_fastqc.txt"), | ||
path("$outputDir/multiqc/multiqc_data/multiqc_general_stats.txt"), | ||
path("$outputDir/multiqc/multiqc_data/multiqc_software_versions.txt"), | ||
).match() | ||
} | ||
) | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
// Load the basic test config | ||
includeConfig 'nextflow.config' | ||
|
||
// Load the correct samplesheet for that test | ||
params { | ||
input = params.pipelines_testdata_base_path + 'seqinspector/testdata/MiSeq/samplesheet.csv' | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
{ | ||
"MiSeq data test": { | ||
"content": [ | ||
"multiqc_citations.txt:md5,4c806e63a283ec1b7e78cdae3a923d4f", | ||
"multiqc_fastqc.txt:md5,692b8aed0614ed1655f2c1cbea1ba312", | ||
"multiqc_general_stats.txt:md5,630167d67d3f92408cd1a04422c7196f", | ||
"multiqc_software_versions.txt:md5,7452f1f7aae2a8a4066c2ef6cd5ceb95", | ||
"multiqc_citations.txt:md5,4c806e63a283ec1b7e78cdae3a923d4f", | ||
"multiqc_fastqc.txt:md5,692b8aed0614ed1655f2c1cbea1ba312", | ||
"multiqc_general_stats.txt:md5,630167d67d3f92408cd1a04422c7196f", | ||
"multiqc_software_versions.txt:md5,7452f1f7aae2a8a4066c2ef6cd5ceb95", | ||
"multiqc_citations.txt:md5,4c806e63a283ec1b7e78cdae3a923d4f", | ||
"multiqc_fastqc.txt:md5,692b8aed0614ed1655f2c1cbea1ba312", | ||
"multiqc_general_stats.txt:md5,630167d67d3f92408cd1a04422c7196f", | ||
"multiqc_software_versions.txt:md5,7452f1f7aae2a8a4066c2ef6cd5ceb95" | ||
], | ||
"timestamp": "2024-05-30T13:14:20.263485" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about an extra "individual" field for when you have multiple samples from the same patient (thinking cancer sample sarek style)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is this what you mean by group?