Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FastQ-Screen database multiplexing #53

Merged
merged 46 commits into from
Dec 9, 2024
Merged

Add FastQ-Screen database multiplexing #53

merged 46 commits into from
Dec 9, 2024

Conversation

edmundmiller
Copy link

@edmundmiller edmundmiller commented Oct 29, 2024

PR checklist

  • This comment contains a description of changes (with reason).
  • Add fastqscreen module
  • Limit scope of nf-test CI
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • [ ] If necessary, also make a PR on the nf-core/seqinspector branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@edmundmiller edmundmiller added this to the Essential functionality milestone Oct 29, 2024
@edmundmiller edmundmiller changed the base branch from master to dev October 29, 2024 13:43
Copy link

github-actions bot commented Oct 29, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 7b6c18b

+| ✅ 192 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗  21 tests had warnings |!

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
  • pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-12-05 15:51:50

@FranBonath
Copy link
Member

I am currently working on this to update the docs and get the missing param in.

@nf-core nf-core deleted a comment from github-actions bot Oct 30, 2024
@kedhammar kedhammar added the help wanted Extra attention is needed label Nov 7, 2024
@kedhammar
Copy link

Rough summary of current status after some investigating by me and @FranBonath

  • Running the pipeline with FastQ Screen for multiple samples and references in the test profile causes all sample-reference combinations to be run in the work directory, but only a one-per-sample subset of sample-reference combinations to be added to the publishdir (because the processes send their identically named output files to the same publishdir).
  • Subsequently, the MultiQC will not pull all relevant information.

We think we want the output files of the process to contain the names of both the sample and reference used to generate them, and make sure they all end up in the publishdir.

NOTE
Personally, I think having one job for each sample-reference combination for hundreds of samples and dozens of references is gonna make us end up with thousands of super tiny SLURM jobs, work dirs, outdirs, etc. which might be excessive? My 2 cents is to consider decreasing the parallellization, maybe parallellize by sample or reference but not both.

@edmundmiller
Copy link
Author

  • Running the pipeline with FastQ Screen for multiple samples and references in the test profile causes all sample-reference combinations to be run in the work directory, but only a one-per-sample subset of sample-reference combinations to be added to the publishdir (because the processes send their identically named output files to the same publishdir).

Okay I figured out a way around this. Works pretty well with MultiQC. Probably going to want to use https://seqera.io/blog/multiqc-grouped-samples/

We think we want the output files of the process to contain the names of both the sample and reference used to generate them, and make sure they all end up in the publishdir.

Also got this for free, but IMO I think publishing them should just be skipped, if you're just going to use the results inside of MultiQC.

NOTE Personally, I think having one job for each sample-reference combination for hundreds of samples and dozens of references is gonna make us end up with thousands of super tiny SLURM jobs, work dirs, outdirs, etc. which might be excessive? My 2 cents is to consider decreasing the parallellization, maybe parallellize by sample or reference but not both.

May I suggest, an array job? I think that would make your HPC admins even happier. https://www.nextflow.io/docs/latest/reference/process.html

@edmundmiller
Copy link
Author

Ah okay looking at the expected fastqscreen data now https://github.com/MultiQC/test-data/blob/main/data/modules/fastq_screen/v0.14.0/scRNAseq_HISAT_example1_screen.txt

It's probably easier to handle all of the databases in one run per sample.

So two options:

  1. Combine the TSVs after seqinspector runs
  2. Have a seperate "create seqinspector config" process or some Nextflow and magically pull in all the databases(sounding more complicated actually)

@edmundmiller
Copy link
Author

Okay I'm stumbed on both, see the commits for my attempts if anyone has time for this 46d1bfd
d021ed0

@kedhammar
Copy link

kedhammar commented Nov 22, 2024

@edmundmiller thanks for taking the time to wrestle with this!

I've also spent a fair bit of time on it and am unfortunately equally stumped.

I prefer the solution in which we put the tool to its intended use-case of mapping a single sample to multiple references simultaneously since we then get the appropriate outputs for MultiQC for free. And an appropriate degree of parallelization imo.

I have a functional example that runs and illustrates the kind of solution I'd like

process TEST {
    input:
    tuple val(db_name), path(db_path, name: "db_path*"), val(aligner)

    script:
    """
    echo "DATABASE ${db_name} ./${db_path}/genome ${aligner}" >> fastq_screen.conf
    """
}

workflow {
    ch_db = Channel
        .fromList([
            ["Ecoli", "s3://ngi-igenomes/igenomes/Escherichia_coli_K_12_MG1655/NCBI/2001-10-15/Sequence/Bowtie2Index/", "bowtie2"],
        ])
        .collect()
        .view()

    TEST(ch_db)
}

but I can't make it work for more than one reference.

I was advised by Phil to make a post on the Seqera community
https://community.seqera.io/t/resolving-variable-number-of-s3-paths-in-a-list-of-lists-fed-to-a-process/1481

@kedhammar
Copy link

kedhammar commented Dec 5, 2024

As of commit 4b278bf, the pipeline can be run in test profile using FastQ Screen as intended, at least for me on GitPod 👀

We use a .csv listing the names, paths and aligners of our references and feed it into the process to build the FastQ Screen config within the context of the work directory, using the mounted input files.

Still need to

  • remove hardcoded basename of mapping refs/indexes, add as .csv field along with mounted parentdir -> 63ebd67
  • update the documentation -> 7396d4c
  • figure out whether to keep the validation of the references .csv, and if so, how it needs to be tweaked -> 2608088
  • figure out whether we want to use a diff to patch the module or, if no one else is using the module, whether we should submit it as an update -> 586ac9b
  • update test snapshots? -> e3a8862

I had to change the way the versions.yaml was written, I got weird errors from the here-file approach that I couldn't get to the bottom of.

@kedhammar
Copy link

Looks like the nf-test CI is failing, but I'm fairly confident it's unrelated to this PR. I've asked in the nf-test channel on Slack now.

In the meantime, this branch may finally be ready for review 😎

@edmundmiller @Aratz

@kedhammar
Copy link

CI now patched 🚀

Copy link
Member

@maxulysse maxulysse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

That's a nice collaborative PR.

I'd really recommend to use the nft-utils plugins for the pipeline level tests.

@kedhammar kedhammar merged commit 83539bd into dev Dec 9, 2024
11 checks passed
@kedhammar kedhammar deleted the fastqscreen branch December 9, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants