-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate reports per run, per project and per lane #13
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Issue with the previous implementation was that sometimes MULTIQC_PER_LANE would execute before the extra files were collected into `ch_multiqc_extra_files`, causing `null` to be added to the list of files passed to multiqc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like what I was suggesting 👍🏽
@@ -84,7 +84,7 @@ workflow PIPELINE_INITIALISATION { | |||
.fromSamplesheet("input") // Validates samplesheet against $projectDir/assets/schema_input.json. Path to validation schema is defined by $projectDir/nextflow_schema.json | |||
.map { | |||
meta, fastq_1, fastq_2 -> | |||
def id_string = "${meta.sample}_${meta.project ?: "ungrouped"}_${meta.lane}" | |||
def id_string = "${meta.sample}_${meta.group ?: "ungrouped"}_${meta.lane}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't lane
need a default value too if it's not required?
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been removed from required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😳 Have I then been reviewing the wrong/outdated version of this PR all the time? Because I ran gh pr 13 checkout
and it is still in there for me locally ?!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it so we would be able to run on sequencing platforms without lanes, e.g. ONT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but what about the other paths, e.g. channel where meta.group
has a setting, but meta.lane
has nothing (and the filter is on meta.group
)?
The name will include null
in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point out where this would be an issue?
I'll note that I don't mind re-working this code into something more explicit, I simply lack the know-how as of now 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question is, is id
used for anything important: For example with the Promethion test, the csv looks like:
sample,lane,group,fastq_1,fastq_2,rundir
hg001,,r10p41_e8p2_human_runs_jkw,https://github.com/nf-core/test-datasets/raw/seqinspector/testdata/PromethION/20230505_1857_1B_PAO99309_94e07fab/fastq_pass/PAO99309_pass__94e07fab_c3641428_1.fastq.gz,,
and then the id string should be: hg001_r10p41_e8p2_human_runs_jkw_null
. Does it matter that this is the case?
When you're grouping the files by group r10p41_e8p2_human_runs_jkw
I guess lane information is not needed at all downstream of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! From what I remember of the initial meeting, the id
was intended as simply a concatenation of fields to ensure uniqueness within the pipeline run. This is still ensured even if some of the fields are null, right?
Intuitively I think having a consistent way to generate the id
that sometimes contains null is preferable to setting up different conventions for generating it across different sequencing platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be easier to use the user-provided sample column? Could be potentially combined with a short uuid
or hash
to be unique in case we have samples that extend over multiple input files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work and interesting proposal for the channel architecture. I have a different pattern in mind, but will need to try out first, if it works.
@@ -84,7 +84,7 @@ workflow PIPELINE_INITIALISATION { | |||
.fromSamplesheet("input") // Validates samplesheet against $projectDir/assets/schema_input.json. Path to validation schema is defined by $projectDir/nextflow_schema.json | |||
.map { | |||
meta, fastq_1, fastq_2 -> | |||
def id_string = "${meta.sample}_${meta.project ?: "ungrouped"}_${meta.lane}" | |||
def id_string = "${meta.sample}_${meta.group ?: "ungrouped"}_${meta.lane}" |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an impressive work. Conceptually, I wonder if another channel architecture could simplify usage. But I will need to experiment first to see if that idea would work in the first place. Hence, only the minor remarks here first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because those .filter()
and .multiMap()
chains seemed unnecessarily complex, I experimented with an own solution (including the .cross()
operator, which I then dropped again), but ultimately the differences aren't large. Like you, I ended up moving the grouping variables out of the meta map and grouped over them.
Most of what makes your solution seemingly more complicated is the juggling with MultiQC config files, which I have not included in my minimal example, so it is an unfair competition. On the plus side, I have only now understood your approach, I believe :-)
#!/usr/bin/env nextflow
workflow {
ch_samplesheet = Channel.of(
[['sample':'SampleA', 'group':'S1', 'lane':'1' ], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/Sample1_S1_L001_R1_001.fastq.gz']],
[['sample':'SampleB', 'group':'S1', 'lane':'2'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/SampleA_S2_L001_R1_001.fastq.gz']],
[['sample':'SampleC', 'group':'S2', 'lane':'1'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/Sample23_S3_L001_R1_001.fastq.gz']],
[['sample':'SampleD', 'group':'S2', 'lane':'1'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/sampletest_S4_L001_R1_001.fastq.gz']],
[['sample':'Undetermined', 'group':null, 'lane':'1'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/Undetermined_S0_L001_R1_001.fastq.gz']]
)
// ------------------------------------------------------
// Apply the various QC Tools to that same input channel
// ------------------------------------------------------
ch_qc_outputs = some_qc_tool(ch_samplesheet)
// Mix everything together: No problem as long as the meta remains intact
// I tried join here, but it does not tolerate null values apparently.
ch_qc_outputs = ch_qc_outputs.mix(some_other_qc_tool(ch_samplesheet))
// ----------------------------------------------------------------------------------------------------------------------------------
// At the very end, we move the grouping variables to the front and groupTuples based on combination of all possible grouping levels
// ----------------------------------------------------------------------------------------------------------------------------------
// we also simplify the meta to the sample name, the only thing still needed.
ch_qc_outputs_final = ch_qc_outputs.map{ meta, sample -> [ "${meta.group}", "${meta.lane}", ["${meta.sample}", sample]]}.groupTuple(by: [0,1])
// -----------------------------------------------------------------------------------------------
// Group again to the desired level (e.g. lanes)
// -----------------------------------------------------------------------------------------------
ch_qc_outputs_lane_subsets = ch_qc_outputs_final.groupTuple(by: [1])
ch_qc_outputs_group_subsets = ch_qc_outputs_final.groupTuple(by: [0])
}
process some_qc_tool {
input:
tuple val(meta), path(fastq)
output:
tuple val(meta), path("*.log"), emit: qc
script:
"""
echo "QC of $fastq" > qc_stats.log
"""
}
process some_other_qc_tool {
input:
tuple val(meta), path(fastq)
output:
tuple val(meta), path("*.log"), emit: qc
script:
"""
echo "QC of $fastq" > qc_stats.log
"""
}
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Co-authored-by: Matthias Zepper <[email protected]>
Co-authored-by: Matthias Zepper <[email protected]>
Co-authored-by: Matthias Zepper <[email protected]>
This comment was marked as resolved.
This comment was marked as resolved.
|
I've tried to clean up the discussion thread and have pushed some additional commits to address simple issues. Requesting re-reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a few open questions, but I also think that could be changed in a subsequent refactor if desired. Thus, I suggest merging and tackle that later?
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, | ||
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, | ||
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, | ||
sample lane group fastq_1 fastq_2 rundir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that the order of the columns is advisable like this? Intuitively, I would have put all categorical variables together at the end, so that additional columns can be added easily later, if required e.g. by other sequencing technologies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on this, and I feel this deserves to be discussed in a new issue/pr. This pr was not meant to change the input format. This commit to usage.md
just fixes the documentation so that it's up to date with what the format actually is.
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz | ||
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz | ||
``` | ||
run_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you replace the exemplary filenames with this synthetic example? I think this may lead to confusion, because it may prompt people to tediously rename their files prior to a run. We should make clear that seqinspector
takes the relevant information from the columns in the sample sheet and not suggest that the file names matter.
Also, judgemental adjectives like "simple" (or "difficult" etc.) should ideally be avoided in a README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @mahesh-panchal requested we visualize the directory structure, I thought it would be easier to connect the dots to the example samplesheet if all the file names in the dir also contained the information shown in the samplesheet.
I don't necessarily get the impression we are suggesting the files need to follow a particular naming convention by showing an example that is as informative as possible, but I don't feel too strongly about it.
@@ -84,7 +84,7 @@ workflow PIPELINE_INITIALISATION { | |||
.fromSamplesheet("input") // Validates samplesheet against $projectDir/assets/schema_input.json. Path to validation schema is defined by $projectDir/nextflow_schema.json | |||
.map { | |||
meta, fastq_1, fastq_2 -> | |||
def id_string = "${meta.sample}_${meta.project ?: "ungrouped"}_${meta.lane}" | |||
def id_string = "${meta.sample}_${meta.group ?: "ungrouped"}_${meta.lane}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be easier to use the user-provided sample column? Could be potentially combined with a short uuid
or hash
to be unique in case we have samples that extend over multiple input files?
) | ||
|
||
// Generate reports by group | ||
multiqc_extra_files_per_group = ch_multiqc_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose and construction of the extra_files channel(s) still remains elusive to me.
Firstly, it is constructed from the ch_multiqc_files
channel, that into which all the output files from the QC tools are mixed. This means it is a rather large channel that then needs to be filtered and mapped. If the only purpose is to get all group levels, I would prefer to start from the ch_samplesheet
, which already should comprise all relevant information.
Secondly, all the files in ch_multiqc_extra_files
(as of now) are not specific to the generated MultiQC report. ch_workflow_summary
, ch_multiqc_custom_methods_description
and ch_collated_versions
are all global. Thus, I fail to see why this channel needs to be constructed for every grouping level instead of being reused for all MultiQC processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but when a process or workflow (in this case MULTIQC) is run on two or more queue channels, it will try to zip them and run on each pair of values.
If you don't duplicate the multiqc extra file to follow the same grouping as in the first channel, nextflow will run the lane 1 files against the first multiqc extra file, then the lane 2 files against the second multiqc extra file and so on.
This was very hard to get right, I'm all ears if you see a better way to do it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized I confused myself with my explanation (which for sure is sign that this should be simplified). Let me try again 😅
When the files are provided to MULTIQC_BY_LANE
, lane_mqc_files.samples_per_lane
looks like this: [list of samples for lane 1], [list of samples for lane 2], ...
. The extra multiqc files are needed for each report and need to be included in each of these lists. I thought the best solution would be to use a map over these lists and append the extra files each time, but I could never got that to work.
If you find a better way to perform this operation, I'm all for it :)
.map { meta, sample -> [ "[GROUP:${meta.group}]", meta, sample ] } | ||
.groupTuple() | ||
.tap { mqc_by_group } | ||
.collectFile{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really required to construct a custom MultiQC config just to set the output file paths? I somehow think that it should be possible to handle that in the publishDir
of the module.config
? Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that if you run the MultiQC module once for each lane without specifying different configs each time, it will create files with the same name regardless the lane number. Since the filename is all you have to play with when setting publishDir
, it becomes very hard to sort them out into different folders.
First, prepare a samplesheet with your input data that looks as follows: | ||
|
||
`samplesheet.csv`: | ||
|
||
```csv | ||
sample,fastq_1,fastq_2 | ||
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz | ||
sample,lane,group,fastq_1,fastq_2,rundir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about an extra "individual" field for when you have multiple samples from the same patient (thinking cancer sample sarek style)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is this what you mean by group?
I'll merge this now, thank you all for your reviews and comments. There are still some open discussions that I think are worth addressing but are not critical to this feature, we can keep discussing them here and in subsequent PRs. |
This PR introduces MultiQC report generation by lane, by rundir and by sample group.
Closes #3
PR checklist
nf-core lint
).nf-test test main.nf.test -profile test,docker
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).