Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate reports per run, per project and per lane #13

Merged
merged 35 commits into from
May 30, 2024

Conversation

Aratz
Copy link
Collaborator

@Aratz Aratz commented Mar 28, 2024

This PR introduces MultiQC report generation by lane, by rundir and by sample group.

Closes #3

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If necessary, also make a PR on the nf-core/seqinspector branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@Aratz Aratz self-assigned this Mar 28, 2024

This comment was marked as outdated.

This comment was marked as outdated.

@Aratz Aratz changed the base branch from master to dev April 8, 2024 13:30
@matrulda

This comment was marked as resolved.

@Aratz

This comment was marked as resolved.

@Aratz Aratz marked this pull request as ready for review May 14, 2024 10:51
Copy link
Member

@mahesh-panchal mahesh-panchal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like what I was suggesting 👍🏽

docs/usage.md Outdated Show resolved Hide resolved
@@ -84,7 +84,7 @@ workflow PIPELINE_INITIALISATION {
.fromSamplesheet("input") // Validates samplesheet against $projectDir/assets/schema_input.json. Path to validation schema is defined by $projectDir/nextflow_schema.json
.map {
meta, fastq_1, fastq_2 ->
def id_string = "${meta.sample}_${meta.project ?: "ungrouped"}_${meta.lane}"
def id_string = "${meta.sample}_${meta.group ?: "ungrouped"}_${meta.lane}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't lane need a default value too if it's not required?

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been removed from required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😳 Have I then been reviewing the wrong/outdated version of this PR all the time? Because I ran gh pr 13 checkout and it is still in there for me locally ?!?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it so we would be able to run on sequencing platforms without lanes, e.g. ONT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but what about the other paths, e.g. channel where meta.group has a setting, but meta.lane has nothing (and the filter is on meta.group)?
The name will include null in it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point out where this would be an issue?

I'll note that I don't mind re-working this code into something more explicit, I simply lack the know-how as of now 😆

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question is, is id used for anything important: For example with the Promethion test, the csv looks like:

sample,lane,group,fastq_1,fastq_2,rundir
hg001,,r10p41_e8p2_human_runs_jkw,https://github.com/nf-core/test-datasets/raw/seqinspector/testdata/PromethION/20230505_1857_1B_PAO99309_94e07fab/fastq_pass/PAO99309_pass__94e07fab_c3641428_1.fastq.gz,,

and then the id string should be: hg001_r10p41_e8p2_human_runs_jkw_null. Does it matter that this is the case?
When you're grouping the files by group r10p41_e8p2_human_runs_jkw I guess lane information is not needed at all downstream of this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! From what I remember of the initial meeting, the id was intended as simply a concatenation of fields to ensure uniqueness within the pipeline run. This is still ensured even if some of the fields are null, right?

Intuitively I think having a consistent way to generate the id that sometimes contains null is preferable to setting up different conventions for generating it across different sequencing platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier to use the user-provided sample column? Could be potentially combined with a short uuid or hash to be unique in case we have samples that extend over multiple input files?

workflows/seqinspector.nf Outdated Show resolved Hide resolved
workflows/seqinspector.nf Outdated Show resolved Hide resolved
workflows/seqinspector.nf Show resolved Hide resolved
workflows/seqinspector.nf Show resolved Hide resolved
Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work and interesting proposal for the channel architecture. I have a different pattern in mind, but will need to try out first, if it works.

main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
@@ -84,7 +84,7 @@ workflow PIPELINE_INITIALISATION {
.fromSamplesheet("input") // Validates samplesheet against $projectDir/assets/schema_input.json. Path to validation schema is defined by $projectDir/nextflow_schema.json
.map {
meta, fastq_1, fastq_2 ->
def id_string = "${meta.sample}_${meta.project ?: "ungrouped"}_${meta.lane}"
def id_string = "${meta.sample}_${meta.group ?: "ungrouped"}_${meta.lane}"

This comment was marked as resolved.

tests/MiSeq.main.nf.test Outdated Show resolved Hide resolved
tests/PromethION.main.nf.test Outdated Show resolved Hide resolved
docs/usage.md Show resolved Hide resolved
workflows/seqinspector.nf Outdated Show resolved Hide resolved
workflows/seqinspector.nf Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impressive work. Conceptually, I wonder if another channel architecture could simplify usage. But I will need to experiment first to see if that idea would work in the first place. Hence, only the minor remarks here first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because those .filter() and .multiMap() chains seemed unnecessarily complex, I experimented with an own solution (including the .cross() operator, which I then dropped again), but ultimately the differences aren't large. Like you, I ended up moving the grouping variables out of the meta map and grouped over them.

Most of what makes your solution seemingly more complicated is the juggling with MultiQC config files, which I have not included in my minimal example, so it is an unfair competition. On the plus side, I have only now understood your approach, I believe :-)

#!/usr/bin/env nextflow

workflow {
ch_samplesheet = Channel.of(
[['sample':'SampleA', 'group':'S1', 'lane':'1' ], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/Sample1_S1_L001_R1_001.fastq.gz']],
[['sample':'SampleB', 'group':'S1', 'lane':'2'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/SampleA_S2_L001_R1_001.fastq.gz']],
[['sample':'SampleC', 'group':'S2', 'lane':'1'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/Sample23_S3_L001_R1_001.fastq.gz']],
[['sample':'SampleD', 'group':'S2', 'lane':'1'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/sampletest_S4_L001_R1_001.fastq.gz']],
[['sample':'Undetermined', 'group':null, 'lane':'1'], ['/nf-core/test-datasets/raw/seqinspector/testdata/NovaSeq6000/200624_A00834_0183_BHMTFYDRXX/Undetermined_S0_L001_R1_001.fastq.gz']]
)


// ------------------------------------------------------
// Apply the various QC Tools to that same input channel
// ------------------------------------------------------

ch_qc_outputs = some_qc_tool(ch_samplesheet)
// Mix everything together: No problem as long as the meta remains intact
// I tried join here, but it does not tolerate null values apparently.
ch_qc_outputs = ch_qc_outputs.mix(some_other_qc_tool(ch_samplesheet))


// ----------------------------------------------------------------------------------------------------------------------------------
// At the very end, we move the grouping variables to the front and groupTuples based on combination of all possible grouping levels
// ----------------------------------------------------------------------------------------------------------------------------------

// we also simplify the meta to the sample name, the only thing still needed.
ch_qc_outputs_final = ch_qc_outputs.map{ meta, sample -> [ "${meta.group}", "${meta.lane}", ["${meta.sample}", sample]]}.groupTuple(by: [0,1])

// -----------------------------------------------------------------------------------------------
// Group again to the desired level (e.g. lanes)
// -----------------------------------------------------------------------------------------------

ch_qc_outputs_lane_subsets = ch_qc_outputs_final.groupTuple(by: [1])

ch_qc_outputs_group_subsets = ch_qc_outputs_final.groupTuple(by: [0])

}

process some_qc_tool {
    input: 
        tuple val(meta), path(fastq)
    output:
         tuple val(meta), path("*.log"), emit: qc
    script:
    """
    echo "QC of $fastq" > qc_stats.log
    """
}

process some_other_qc_tool {
    input: 
        tuple val(meta), path(fastq)
    output:
        tuple val(meta), path("*.log"), emit: qc
    script:
    """
    echo "QC of $fastq" > qc_stats.log
    """
}

@MatthiasZepper

This comment was marked as resolved.

@mahesh-panchal

This comment was marked as resolved.

@MatthiasZepper

This comment was marked as resolved.

@kedhammar

This comment was marked as resolved.

Copy link

github-actions bot commented May 23, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 02affeb

+| ✅ 177 tests passed       |+
!| ❗  21 tests had warnings |!

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in ci.yml: You can customise CI pipeline run tests as required
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
  • pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.

✅ Tests passed:

Run details

  • nf-core/tools version 2.14.1
  • Run at 2024-05-30 11:22:09

@kedhammar
Copy link

I've tried to clean up the discussion thread and have pushed some additional commits to address simple issues. Requesting re-reviews.

Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a few open questions, but I also think that could be changed in a subsequent refactor if desired. Thus, I suggest merging and tackle that later?

conf/modules.config Show resolved Hide resolved
TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,
TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,
TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,
sample lane group fastq_1 fastq_2 rundir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that the order of the columns is advisable like this? Intuitively, I would have put all categorical variables together at the end, so that additional columns can be added easily later, if required e.g. by other sequencing technologies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on this, and I feel this deserves to be discussed in a new issue/pr. This pr was not meant to change the input format. This commit to usage.md just fixes the documentation so that it's up to date with what the format actually is.

CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz
```
run_dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you replace the exemplary filenames with this synthetic example? I think this may lead to confusion, because it may prompt people to tediously rename their files prior to a run. We should make clear that seqinspector takes the relevant information from the columns in the sample sheet and not suggest that the file names matter.

Also, judgemental adjectives like "simple" (or "difficult" etc.) should ideally be avoided in a README.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @mahesh-panchal requested we visualize the directory structure, I thought it would be easier to connect the dots to the example samplesheet if all the file names in the dir also contained the information shown in the samplesheet.

I don't necessarily get the impression we are suggesting the files need to follow a particular naming convention by showing an example that is as informative as possible, but I don't feel too strongly about it.

@@ -84,7 +84,7 @@ workflow PIPELINE_INITIALISATION {
.fromSamplesheet("input") // Validates samplesheet against $projectDir/assets/schema_input.json. Path to validation schema is defined by $projectDir/nextflow_schema.json
.map {
meta, fastq_1, fastq_2 ->
def id_string = "${meta.sample}_${meta.project ?: "ungrouped"}_${meta.lane}"
def id_string = "${meta.sample}_${meta.group ?: "ungrouped"}_${meta.lane}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be easier to use the user-provided sample column? Could be potentially combined with a short uuid or hash to be unique in case we have samples that extend over multiple input files?

)

// Generate reports by group
multiqc_extra_files_per_group = ch_multiqc_files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose and construction of the extra_files channel(s) still remains elusive to me.

Firstly, it is constructed from the ch_multiqc_files channel, that into which all the output files from the QC tools are mixed. This means it is a rather large channel that then needs to be filtered and mapped. If the only purpose is to get all group levels, I would prefer to start from the ch_samplesheet, which already should comprise all relevant information.

Secondly, all the files in ch_multiqc_extra_files (as of now) are not specific to the generated MultiQC report. ch_workflow_summary, ch_multiqc_custom_methods_description and ch_collated_versions are all global. Thus, I fail to see why this channel needs to be constructed for every grouping level instead of being reused for all MultiQC processes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but when a process or workflow (in this case MULTIQC) is run on two or more queue channels, it will try to zip them and run on each pair of values.

If you don't duplicate the multiqc extra file to follow the same grouping as in the first channel, nextflow will run the lane 1 files against the first multiqc extra file, then the lane 2 files against the second multiqc extra file and so on.

This was very hard to get right, I'm all ears if you see a better way to do it :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized I confused myself with my explanation (which for sure is sign that this should be simplified). Let me try again 😅

When the files are provided to MULTIQC_BY_LANE, lane_mqc_files.samples_per_lane looks like this: [list of samples for lane 1], [list of samples for lane 2], .... The extra multiqc files are needed for each report and need to be included in each of these lists. I thought the best solution would be to use a map over these lists and append the extra files each time, but I could never got that to work.

If you find a better way to perform this operation, I'm all for it :)

.map { meta, sample -> [ "[GROUP:${meta.group}]", meta, sample ] }
.groupTuple()
.tap { mqc_by_group }
.collectFile{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really required to construct a custom MultiQC config just to set the output file paths? I somehow think that it should be possible to handle that in the publishDir of the module.config? Or am I missing something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that if you run the MultiQC module once for each lane without specifying different configs each time, it will create files with the same name regardless the lane number. Since the filename is all you have to play with when setting publishDir, it becomes very hard to sort them out into different folders.

First, prepare a samplesheet with your input data that looks as follows:

`samplesheet.csv`:

```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
sample,lane,group,fastq_1,fastq_2,rundir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about an extra "individual" field for when you have multiple samples from the same patient (thinking cancer sample sarek style)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is this what you mean by group?

@Aratz
Copy link
Collaborator Author

Aratz commented May 30, 2024

I'll merge this now, thank you all for your reviews and comments. There are still some open discussions that I think are worth addressing but are not critical to this feature, we can keep discussing them here and in subsequent PRs.

@Aratz Aratz merged commit 9cb1d68 into nf-core:dev May 30, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate reports per run, per project and per lane
6 participants