Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skip tools parameter for tool selection #68

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

Aratz
Copy link
Collaborator

@Aratz Aratz commented Nov 22, 2024

Hi!

I've been working on tool selection lately and tried to implement the profile-based proposal mentioned in #32 but I've found it hard to:

  1. define a smooth UI. I find it a bit redundant to have to specify both a file describing all the available profiles and to have to specify which profile is used. "profile" is probably a little confusing too because it is to close to Nextflow config profiles. Even though a simple "skip_tools" argument is quite basic, it has the advantage of being immediately understandable and usable by anyone, and it case it gets too lengthy it can be specified in a params-file or even a nextflow config file.
  2. motivate why we need an alternative way to customize tools while this is already possible with nextflow configs. If we were to have this option in tool profiles too, we would need to make it clear what's the precedence between nextflow config and tool profile, i.e. in case both are defined, do we ignore one of them? or do we concatenate all ext.args together? Either way this will make debugging harder in many cases.

Hence I would really advocate for a simpler solution that makes better use of functionalities already provided by Nextflow. Tool selection and customization is something many other pipelines do and I think it would be much better to try to stick to the nf-core standard. The precedence between config files is well documented and this will make the user experience better for users who are already familiar with Nextflow and other nf-core pipelines.

This PR provides a very simple solution where the user provide a list of tools to be excluded. This can be provided either through command line arguments or through a config file. As specified in Nexflow docs, cli arguments will override values defined in config files. This is similar to what's implemented in nf-core/demultiplex/.

Essentially this still makes it possible to define custom profiles. For example, one can write a nanopore.config, specify which tools should be skipped, and even add extra arguments to some tools with withName:TOOL statements (as specified in the nf-core documentation: https://nf-co.re/docs/usage/configuration#customising-tool-arguments).

This doesn't mean we should completely abandon the original idea, we can always improve the current solution if we find it too limiting in the future.

Closes #32

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nf-test test).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • CHANGELOG.md is updated.

@Aratz Aratz self-assigned this Nov 22, 2024
Copy link

github-actions bot commented Nov 22, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 438c715

+| ✅ 191 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗  21 tests had warnings |!

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
  • pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-11-22 13:22:57

@MatthiasZepper
Copy link
Member

motivate why we need an alternative way to customize tools while this is already possible with nextflow configs.

So you propose to include extensive configs defining multiple pipeline specific profiles with this pipeline? Could you please include an example of this in the PR or here as comment, then? I still struggle to wrap my head around that.

Tool selection and customization is something many other pipelines do and I think it would be much better to try to stick to the nf-core standard.

I am not aware that there is a standard. In contrast, I was baffled by the individual differences.

The precedence between config files is well documented and this will make the user experience better for users who are already familiar with Nextflow and other nf-core pipelines.

It is? I am aware about the docs for process selectors and for various config files, but find these priorities very hard to comprehend.

Ultimately, such top-down hierarchies do not help us anyway. We would need to patch additional labels to the modules, and then somehow handle withLabel for processes with multiple labels. What is the precedence in case of -profile singularity,uppmax,nanopore,extensive vs. -profile uppmax,singularity,extensive,nanopore?

@Aratz
Copy link
Collaborator Author

Aratz commented Nov 25, 2024

motivate why we need an alternative way to customize tools while this is already possible with nextflow configs.

So you propose to include extensive configs defining multiple pipeline specific profiles with this pipeline? Could you please include an example of this in the PR or here as comment, then? I still struggle to wrap my head around that.

If by profile you mean tool profile (instead of nextflow profile) then yes more or less. For each use case, one would need to write either a params-file or a nextflow config file (in case some tools need some specific parameters). This would be left to the user though. I think for v1 it is acceptable to just run everything that doesn't need extra db arguments for instance. These files would look like this:

params.json:

{
    "skip_tools": "seqfu_stats"
}

or, eg. nanopore.config:

params {
    skip_tools = 'seqfu_stats'
}

process {
    withName: FASTQC {
        ext.args = "--nano"
    }
}

Tool selection and customization is something many other pipelines do and I think it would be much better to try to stick to the nf-core standard.

I am not aware that there is a standard. In contrast, I was baffled by the individual differences.

Well there is a standard in how configuration settings are defined in a nextflow config file and in how tool parameters are defined there too.

The precedence between config files is well documented and this will make the user experience better for users who are already familiar with Nextflow and other nf-core pipelines.

It is? I am aware about the docs for process selectors and for various config files, but find these priorities very hard to comprehend.

Ultimately, such top-down hierarchies do not help us anyway. We would need to patch additional labels to the modules, and then somehow handle withLabel for processes with multiple labels. What is the precedence in case of -profile singularity,uppmax,nanopore,extensive vs. -profile uppmax,singularity,extensive,nanopore?

Well actually the configs would be provided with for instance -c extensive.config -c nanopore.config (again I think we should choose another term than profile to avoid the confusion with nextflow profiles), and in this case, as the docs states, each config file would be applied in turn, that is: extensive.config would be applied first, and then nanopore.config, i.e. the value of skip_tools in nanopore.config would override the one in extensive.config.

But at this point I don't think there is much need for being able to combine different configuration files anyways. Maybe this will change in the future, but until then and as long as we only have a dozen tools or so I find it perfectly acceptable to stick to nextflow configs.

Again, this PR is quite basic, but it does fulfill all our requirements given the number of tools we currently plan to have in v1.0.

@MatthiasZepper
Copy link
Member

If by profile you mean tool profile (instead of nextflow profile) then yes more or less. For each use case, one would need to write either a params-file or a nextflow config file (in case some tools need some specific parameters).

Well, that is exactly the complexity, I wanted to avoid by unifying all settings in a YAML, which is also a lot easier to customize.

Particularly when using Nextflow profiles as tool profiles, we should then also stick with the standard API -profile and specify them accordingly in the main nextflow.config and not via -c. But if you wish, I have no objections against calling them differently - with an inspector theme in mind, dossier or case could work?

My main concern with the Nextflow config profiles is the Danger box here:

When using the profiles feature in your config file, do NOT set attributes in the same scope both inside and outside a profiles context.
In the above example, the process.cpus attribute is not correctly applied because the process scope is also used in the foo and bar profiles.

I fear this is easy to get wrong in combination with a poorly written institutional config. On an upside, it is indeed straightforward to test for the correct config, with nextflow config --flat, so we could define nf-test to test automatically the compatibility with all institutional configs in the nf-core/configs.

This would be left to the user though.

On the contrary, that is one of the main selling points of this pipeline - the in-depth knowledge of a sequencing facility. My experience from rnaseq is, that the peculiarities of tool arguments must be handled by us developers, because most people expect that a pipeline has reasonable defaults to just run.

Take for example, the extra_trimgalore_args parameter of that pipeline. While people frequently post their parameter files, I have not seen a single one that specified the --2colour argument to TrimGalore in extra_trimgalore_args. However, that is required for correct quality trimming when data from NovaSeq and NextSeq instruments is processed, which is arguably the default by now. I reckon very few people sequence on a MiSeq anymore for RNA-seq - maybe in the future on Aviti again.

@Aratz
Copy link
Collaborator Author

Aratz commented Dec 2, 2024

I'm not sure I understand why would anyone need to define tool profiles within Nextflow profiles? As you said, this would add unnecessary complexity, while it's perfectly fine and much safer to specify this parameter within a config file and pass it to the pipeline with -c.

And if you think we should provide some default configs for some selected applications, sure, this feature doesn't prevent that either.

@MatthiasZepper
Copy link
Member

I'm not sure if I understand why would anyone need to define tool profiles within Nextflow profiles?

I do not know about anyone, but I know about us. We as pipeline developers need to ship the pipeline with a few default routes that can be used by changing a simple global parameter (my suggestion) or switching the config profiles (your suggestion as far as I understand)

Since you suggested to harness the default Nextflow configuration capabilities, I presume you want to use the ext.when directive to control if a particular tool is run, and the ext.args to adapt the behavior of the tool?

Foremost, I am not sure if both of these are future-proof, since I believe Seqera ponders about dropping support for them in future Nextflow versions, but if we use them nonetheless, we would end up with something like this to implement our Seqinspector cases/dossiers, correct?

profiles {
    extensive {
        params {
            some_contamination_screening_reference_default   = 's3://ngi-igenomes/seqinspector/default'
        }
        process {
            withLabel:contamination {
                    ext.when   = { TRUE }
                }
            withLabel:fastqfilechecks {
                    ext.when   = { TRUE }
            }
           withLabel:nanopore {
                    ext.when   = { FALSE }
            }
            withName: 'SOME_TOOL' {
                    ext.args   = { '-b -d' }
            }
    }}
    basic {
        process {
            withLabel:contamination {
                    ext.when   = { FALSE }
                }
            withLabel:fastqfilechecks {
                    ext.when   = { TRUE }
                }
            withLabel:nanopore {
                    ext.when   = { FALSE }
                }
            withName: 'SOME_TOOL' {
                    ext.args   = { '-a' }
            }
    }}
   nanopore {
        process {
            withLabel:nanopore {
                    ext.when   = { TRUE }
                }
            withName: 'SOME_NANOPOREQC_TOOL' {
                    ext.args   = { '--ont' }
            }
    }}
}

And that would ultimately enable a UX like:
nextflow run nf-core/seqinspector -profiles "uppmax,singularity,extensive".

That is nice. But how do we now, e.g. run the Nanopore specific tools for a Nanopore QC?

Is there a difference between
nextflow run nf-core/seqinspector -profiles "basic,nanopore"
nextflow run nf-core/seqinspector -profiles "nanopore,basic"

How is the withLabel / ext.when priority resolved in that case? I do not know, because afaik Nextflow's config documentation does not clarify this, and it matters, because either the tools run or not.

@Aratz
Copy link
Collaborator Author

Aratz commented Dec 6, 2024

I believe there are some misunderstandings both about the content and the scope of this PR:

First about the content, this PR only adds a new parameter, skip_tools, to disable tools when running the pipeline. This is similar both in the way it is used and in the way it is implemented to what is found in e.g. nf-core/demultiplex or nf-core/sarek. As with all nextflow parameters, it can be set in different ways, such as a cli argument, a params file, or a config file. Specific config files can thus be set in order to process specific kinds of data by disabling some tools and customizing some others (see the example I provided above). Most importantly: it does not offer to combine two configs, is not intended to be used with Nextflow profiles, and does not make use of ext.when in any way.

Second, about the scope, this PR implements what I believe would be the minimal requirements for a first release when it comes to tool selection. It is not supposed to provide the perfect user experience, just to provide a basic way to select tools. In other words, it is only supposed to be the first step towards tool selection and to serve as a baseline to compare it to future solutions.

Because the original idea is slightly more complicated (in particular when it comes to the possibility to combine profiles and how to achieve this), I believe it needs more thoughts before it is implemented, which is what I tried to motivate in the PR's description. That being said, I think being able to define and combine tool profiles is a great feature to have and that we should absolutely do it. I just think it will take some time before we implement something viable and that we should not condition v1 to having this fully implemented.

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Dec 8, 2024

Having worked myself into the ground with a previous profile implementation attempt due to my inability and rather diffuse Nextflow requirements, I am very sympathetic towards breaking it down to more manageable tasks.

Had you been framing your PR in this way, I would have no objections and happily focused on code review only. But even after rereading this whole discussion multiple times, the misunderstanding on my behalf cannot be remedied. Perhaps I truly overestimate the scope of this PR, but why is almost your whole PR description a plea to drop the profile/case idea entirely, then? I read redundant to have, hard to motivate why, advocate etc. and my impression persists: Your code contributions are tied to a fundamental diversion from the plan that we have agreed on in the development meeting.

I can admittedly not follow, if you first claim that we can achieve all requirements purely with Nextflow configs, but later refuse to comment on my mock-up with anything more specific than "it is not intended to work like that". Well, but how else?

The mock-up was essentially my attempt to redirect this high-level discussion, that would probably be much more suitable for a developmental meeting, back to specific code questions. But you seem to be hell-bent on incidentally ascertaining in this PR discussion what the minimal requirements for a first release are, and that nothing more than this is needed in terms of tool selection? (edited: This summary is indeed sharper than the original statement and therefore falsely exaggerated.)

I cannot recall any discussion in the developmental meeting regarding zeroing in onto a release. Personally, I perceive this pipeline still ramshackle and not anywhere near the state of a worthy nf-core pipeline. There are so many more mature pipelines than this still in dev; I would not mind a 0.1 release, but a 1.0.0? I suggest bringing this on the agenda of a developmental meeting, rather than casually deciding this in a PR discussion without involving the others?

In either way, I have come to the conclusion that I can't ask you to implement a feature that you seemingly consider dispensable.

@matrulda matrulda self-requested a review December 9, 2024 07:29
Copy link

@matrulda matrulda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This looks great! @Aratz has left some minor comments for you.

This implementation is definitely simpler than what we previously discussed. I agree that skip_tools is more understandable and easier to grasp than adding another layer of configuration.

I understand the desire for the pipeline to come with default routes that are ready to use. However, as @Aratz mentioned, we could add configurations that users can utilize if they so desire.

I see the challenge in trying to combine different "modes" (avoiding the word profile here), like lean and illumina. In my opinion, we don't necessarily need modes like lean/extensive; the most important thing is that the tools are compatible with the data. Starting with the assumption that one config will suffice, such as the nanopore.config mentioned earlier, makes total sense to me.

I believe this implementation meets our needs, but of course, we can discuss it further at a development meeting before merging.

Additionally, I would really appreciate it if we could all maintain a friendly tone. :)

docs/usage.md Show resolved Hide resolved
nextflow_schema.json Show resolved Hide resolved
@MatthiasZepper
Copy link
Member

Thanks, @matrulda, for chiming in. I agree, that a fresh perspective is helpful to resolve this conflict.

I believe this implementation meets our needs, but of course, we can discuss it further at a development meeting before merging.

From my side, I am fine with merging it prior to further discussions. My objections are not rooted in the code itself, but in the dismissal of the whole concept of modes / routes, which I believe should happen in a developmental meeting and not in a PR discussion.

Personally, I still think we need to have them, but am the first to admit that it makes things very complicated. I wasted dozens of hours on that, and ultimately failed.

Additionally, I would really appreciate it if we could all maintain a friendly tone. :)

Do you have the impression that this was not the case?

As far as I am concerned, I do indeed feel strongly about this topic, given my unfruitful previous work on that subject and my desire for perfectionism, but also carefully worded my replies to emphasize assumptions and to ensure the subjectivity is clear. I also took note that Adrien did the same, so from my site, I never felt personally attacked, just confused and estranged by our very different perceptions.

But if I have misunderstood or exaggerated some statements, I would like to apologize. It was not my intention to discuss anything outside the question of how we deal with profiles/modes/routes, or to imply ad hominem traits.

@matrulda
Copy link

matrulda commented Dec 10, 2024

Thanks, @matrulda, for chiming in. I agree, that a fresh perspective is helpful to resolve this conflict.

Glad to hear that :)

From my side, I am fine with merging it prior to further discussions. My objections are not rooted in the code itself, but in the dismissal of the whole concept of modes / routes, which I believe should happen in a developmental meeting and not in a PR discussion.

I understand that, but I think @Aratz was very clear that he did not dismiss that concept, but that it for now could be put aside.
Even though we might not need a meeting for this, it would be nice to get a Yay or Nay for some other people in the group. What say you @alneberg @FranBonath @kedhammar ?

Do you have the impression that this was not the case?

Yeah, I interpreted your answers as a bit hostile. I'm happy to hear that was not your intention. We all communicate in different ways and nuances don't always get through in text.

@alneberg
Copy link
Member

I agree with @matrulda that this is good enough to have something. I was wondering if we would like to have the subsampling involved in the skip_tools parameter as well? Currently, the subsampling is regulated using the --sample_size parameter. Are we happy with using only that? I'm not sure what my opinion is, but maybe it's cleaner if sample_size is used more as a standard ext.args and whether to subsample or not is inflicted using the skip_tools parameter?

@kedhammar
Copy link

On a scale from yay to nay, I'm also leaning more towards yay. Establishing a simple and intuitive baseline functionality, as a first step. Not disregarding any possible future implementations.

If we arrive at a point where we have a dozen platform-specific tools that are mutually exclusive, it might be helpful to add some example tool configs to the assets dir, or similar. I guess we'd have to create them anyway, to be used for functional testing of the different kinds of data in our test profile.

I would like this branch to pull from dev and address the newly added tools, prior to a final review and merge.

@Aratz
Copy link
Collaborator Author

Aratz commented Dec 13, 2024

@matrulda @alneberg @kedhammar Thanks for the feedback, I'll first fix the template update, then rebase this PR and address your comments

@alneberg unfortunately, the SEQTK_SAMPLE module expects the samplesize to be given as a parameter (see https://github.com/nf-core/modules/blob/08108058ea36a63f141c25c4e75f9f872a5b2296/modules/nf-core/seqtk/sample/main.nf#L11), but I agree it would be clearer if it could only be skipped using skip_tools rather than setting the samplesize to 0. I'll fix that 👍

@alneberg
Copy link
Member

@alneberg unfortunately, the SEQTK_SAMPLE module expects the samplesize to be given as a parameter (see https://github.com/nf-core/modules/blob/08108058ea36a63f141c25c4e75f9f872a5b2296/modules/nf-core/seqtk/sample/main.nf#L11), but I agree it would be clearer if it could only be skipped using skip_tools rather than setting the samplesize to 0. I'll fix that 👍

Ah, I see. Yeah then we'll keep that as a parameter, and it's not really in the scope of this PR anyway. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants