-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable tool selection for seqinspector #23
Conversation
Hi Matthias, Phil linked me here because I'm interested to see how classes are being used in the lib directory. I am struggling to see why all of this code is necessary. To begin with, I'm not sure why a single param with a list of tools is not sufficient, maybe you can fill me in there. But even if I go with the three params you added, couldn't you boil all of this down to a single helper function that does a few set operations?
That's probably a 10-line function that you could put next to the entry workflow, no need for classes or the lib directory |
Thanks, your feedback is much appreciated. Having no background in Java or Groovy and only having tweaked existing pipelines so far, it is indeed a daunting task for me to conceptualize such a central piece of code in a way that I hope will be future-proof for greatly expanding the pipeline later.
My main consideration here was that Seqinspector is not so much a classical pipeline (in the sense of running a well-defined sequence of processing steps) as it is a tray of tools to select from. Most tools will run in parallel and independently of each other. However, not all tools that we will offer are suitable for each supported sequencing platform or need platform-specific configuration. A QC for Nanopore data is entirely different to Illumina etc. The same with the assay-types: You will not want to use the same QC criteria for an RNA-seq and a HiC for example. As a pipeline developer, one could of course put all the responsibility on the user and say that they need to sport the expertise to know what tools they have to run and how to configure them for their assay with a custom config. But from a convenience perspective, it seemed much nicer to be able to say: I have sequencing data from an For that reason, I wanted the pipeline to support profiles as an entity that can hold pre-selections made by us developers.
Since the number of possible combinations of But to selectively turn tools off, you will also need AND (and the inclusive AND). The XOR was an operation that I implemented mostly for technical reasons, after noticing that non-common keys got eliminated from the maps by my initial AND/OR implementations. To prevent dropping entries from the maps, I felt that doing first an XOR and then AND/OR of two profiles and concatenating the result should become a common pattern in the pipeline code. However, I later modified my AND/OR to retain non-common entries directly, since I did not see a use case for eliminating keys. (If we should expose any of this functionality to the end user, like I did in this draft, is something we need to discuss in our next developer meeting). With regard to the implementation, my preference of a key-value map with binary values over Sets is weak. The hope was that this would nudge developers to restrict the evaluation of profiles to a single part of the pipeline and only access the results of the profile combinations downstream, instead of littering the Set operations throughout the pipeline code. I hope, this prevents some bugs and improves the ease of testing.
Which is exactly what I implemented, no? // Convert the specified custom tools into a profile
def custom_tools = Utilities.buildToolTracker(tool_include_list, tool_exclude_list)
// Apply the custom tools to the existing tool profile: orOperation to activate the include_tools
seqinspector_tools = seqinspector_tools.orOperation(custom_tools)
// Apply the custom tools to the existing tool profile: iAndOperation to deactivate the exclude_tools
seqinspector_tools = seqinspector_tools.iAndOperation(custom_tools) While I still think that most users will want to stick to using profiles, I already foresaw that there would be the need to support a manual selection of tools nonetheless.
Maybe with some advanced Nextflow/Groovy wizardry, but nothing that I am capable of.
I do not really see where your aversion to custom classes comes from. In Python, they are a good way to keep code clean and well-structured and avoid code duplications, so I assumed that would also apply to Java/Groovy and thus Nextflow. To see the reason for and potential of introducing the custom classes, one should probably rather look at Oncoanalyser since it is an already mature pipeline. I have extensively looked at other nf-core pipelines to see how they approached the tool selection problem, and was in awe about the elegance of the Oncoanalyser approach because of its custom classes. From Oncoanalyser, I drew a lot of inspiration and actually copied much literal code. A huge perk is the extensive validation that it performs beyond what is possible with the Contrast that with the number of bugs in RNAseq that arose due to incomplete updates, because it was just forgotten to update something in one part of the pipeline, and you know why I would like to keep it tidy from the start. But upon reading the Oncoanalyser code, you will notice that it also checks e.g. that the matching reference genome is provided for the tool selection etc. That is some functionality we will need for Seqinspector as well that we can flexibly add later to "our" custom classes as well. Lastly, there is one more thing that I see as an necessity for Seqinspector and that goes without precedent in Oncoanalyser: The config of In summary: I think, custom classes are nice to keep the main pipeline code concise and tidy and will be ideal for the functionality we currently need or will soon have to implement. RNAseq for example doesn't use any, but has several (formerly local and now central) modules for input validation and reference preparation that collectively also comprise hundreds of lines of code. It is just less obvious, because it is more of an organically grown codebase distributed over many files, rather than one centralized class definition that defines much of the functionality upfront. |
Thanks for the extra context. I'd like to take some time to understand this tool selection logic in these two pipelines as a case study, if you'll entertain me. Especially since I was the one pushing against the First some notes about classes and then we'll get to your code. I agree that classes are quite useful, but classes in their most general form are overkill for Nextflow, which is not trying to be a general-purpose object oriented language like Java or Groovy or even Python. This hasn't mattered in the past because Nextflow delegates everything to Groovy, so all Groovy syntax is supported. But now that we are formalizing the language and deciding which parts of Groovy to keep or discard, here's where I have landed regarding classes:
As for your code, the main thing I'm wondering about is this
|
I have to admit, I would do this via a yaml file, to make a dictionary to query. It would be explicit to the user which tools are in a given profile (key), and allow the user to add keys to the yaml to customise it. You can then validate it against the json schema too to make sure key values are valid for any given profile. |
Sure. I appreciate your input, since you are the Nextflow luminary.
I was already avoiding any I can't follow regarding But yes, we will later have to modify arguments of tools depending on the selected profile.
Thanks a lot for your comprehensive input! Much appreciated for understanding the background of your decisions and what should be avoided. I just bundled fields and methods into a class, since I am used to that coming from a Python, where exactly that is commonplace (Look at tools or MultiQC...)
Perfectly fine with me. Looking at the pipeline template, it seems to me that I need to include { completionEmail } from '../../nf-core/utils_nfcore_pipeline'
include { completionSummary } from '../../nf-core/utils_nfcore_pipeline'
include { dashedLine } from '../../nf-core/utils_nfcore_pipeline'
include { nfCoreLogo } from '../../nf-core/utils_nfcore_pipeline'
include { imNotification } from '../../nf-core/utils_nfcore_pipeline'
include { UTILS_NFCORE_PIPELINE } from '../../nf-core/utils_nfcore_pipeline'
include { workflowCitation } from '../../nf-core/utils_nfcore_pipeline'
(Also see below). Fair, I am for now redefining the
Sure! I indeed want to have the option to record a But some of the default tools and two of the contamination screening tools are unsuited for Nanopore data and that is what our user wishes to QC check in that case. So either I have a dedicated
Guarding is too strong or immutability is too strong. I just meant that I suggest evaluating the profiles once, e.g. in the I would mainly like to avoid redoing complex evaluations for each and every module that we run like Sarek does it. I see a benefit of doing: if (seqinspector_tools['FASTQC']){
// run module
} over if (params.save_mapped && !params.save_output_as_bam) ||
(params.skip_tools && params.skip_tools.split(',').contains('markduplicates')) &&
!(params.tools && params.tools.split(',').contains('sentieon_dedup')) ){
// run module (in Sarek)
} I see how such a complex evaluation may be necessary, but it is easy to forget a bracket somewhere or a negation and suddenly a module runs differently than indented. If I centralize the evaluation, I can for example just check for |
If you have time on Friday, it would be great if you could join the meeting to elaborate! But in brief: I did not consider a YAML yet, but sounds good to me. How we specify the profiles is something we can still decide - in particular for user-provided profiles it seems like a good idea. But internally, you will probably still have a map to hold that information later? Or do you repeatedly want to read in the YAML files during a pipeline run and re-validate them against the schema as well? I also think something like static Map<String, ToolTracker> ToolProfiles = [
ALL: Ultilities.buildToolTracker(Constants.Tool.values().toList(), []), // generate a tool tracker with all tools set to true
NONE: Ultilities.buildToolTracker([], Constants.Tool.values().toList()), // generate a tool tracker with all tools set to false
] could prove helpful, because those profiles do not need to be updated once new modules are added. |
The new config parser still allows all of the existing
I'm with you there. It should be possible to replace all of this conditional logic with a single upfront calculation as you are trying to do. Regarding the includes, you would have to include each function that you intend to use. But I thought you would only need these functions once at the beginning of the pipeline? In any case, you can also do a compound include: include {
completionEmail ; completionSummary ; dashedLine
} from '../../nf-core/utils_nfcore_pipeline' But I wouldn't worry about it for now anyway since either approach will work. A module would just be better for sharing in the future.
I think I understand. So a profile can contain both true values and false values to denote (soft) requirements and (hard) incompatibilities. And the IAND means "enable the tools explicitly enabled by the other profile, and disable any tools explicitly disabled by it", where explicit disables override explicit enables. Does that sound right? This is pretty interesting! I knew I would find some treasure if I dug deep enough. Here's how I think it could be modeled with sets:
Here's that 10-line function I promised: class ToolProfile {
Set<Tool> enable
Set<Tool> disable
}
enum Tools {
FASTQC,
// ...
}
// TODO: pre-defined tool profiles
def getEnabledTools(profileNames, includes, excludes, toolProfiles) {
def profiles = // get list of profiles for each profile name
def allEnabled = // get union of all enable sets
def allDisabled = // get union of all disable sets
// note: the order of operations here communicates the precedence of each setting
return allEnabled - allDisabled + includes - excludes
} I'm still essentially using a record type to model the profile; you could also use a tuple or map but I agree the custom class is clearer. But now the tool selection is just set operations, no need for the custom boolean operations, and hopefully the intent is now clearer. What do you think? Am I missing anything? |
To Mahesh's point, you could define the built-in profiles in a YAML and load them into a map as you have now. Just a matter of whether you'd rather have the definitions sit in code or in YAML. If this tool selection logic ever becomes a module, the enums and built-in profiles would clearly have to be external and passed in as arguments to something like |
That is nifty and good to know. Yes, most functions will only be needed for the initialization of the pipeline. Perhaps, there are some in the
Almost.
Incompatibilities are indeed a prime reason for disabling a tool, but of course a As of now, I do not make any assumptions on the softness / hardness of an operation. The But, as I already wrote above: "If we should expose any of this functionality to the end user, like I did in this draft, is something we need to discuss in our next developer meeting". Probably, it would indeed suffice to perform one final
Thanks a lot! I like the simplicity of your approach! Working with a single Set seemed cumbersome to me, but a class that holds two of them should indeed cover 95% of the use cases and probably 200% of what we will ever actually implement in the pipeline. I will think about it and discuss with the others, but in either case get rid of the mixed classes that bundle fields and methods. I was not aware how much additional complexity that imposes on Nextflow. |
Well, hopefully this sketch gives you an idea of how you might use the existing data structures like sets to model whatever constraints you need. Even the priority resolution is clearly communicated by the order of these operations: allEnabled - allDisabled + includes - excludes So the disables override the enables, the manual selection overrides the profile selection, etc. I'm curious to see where you might take this idea, especially with the other constraints you mentioned like sequence type, etc. I would argue that the bundled code + data approach is not only complicated to implement, but it also makes your job harder because it encourages you down a path where it's much easier to get tangled up. I think this is partly why many languages are moving towards the data-driven approach of compound data types + standalone functions (e.g. records in Java, dataclasses in Python). On the Nextflow side, aside from the better support for enum and record types, I think we need to better educate users on how to use the standard library to full effect, rather than simply delegating to the Java/Groovy docs. This has been a good use case for something that could be a tutorial on domain modeling with basic data structures. |
|
Closed in favor of a different approach, that uses Profiles in YAML format and nf-schema for validation. |
This PR proposes a tool selection mechanism for Seqinspector. There are several nf-core pipelines which implement parameter-based tool selections in various degrees of complexity.
Sarek for example uses Groovy code directly in the config and applies
.split()
and.contains()
repeatedly to check if that particular tool should run.Unfortunately, this approach is not future-proof, because putting Groovy code in configuration files is deprecated. In addition, the
ext.when
directive is also phased out.In addition to that, this notation only allows selecting or skipping single tools at a time. For a pipeline like Seqinspector, support for profiles seemed desirable. With profiles, several related tools can be quickly turned on or off, e.g. depending on the sequencing platform or assay type.
Using Sets would have been an option, as they can be intersected for combined, but ultimately central reference with binary values seemed more straightforward.
Therefore, I have copied the approach of (and some Utility functions from) Oncoanalyser to select the tools that should run from parameters. On the technical level, I have opted for a slightly different data structure, though.
While Oncoanalyser has a Processes class, that returns the result of a Set comparison as Run stage, I have implemented a Binary Map, that can be queried with the tool name and returns a binary value:
To generate the final
seqinspector_tools
map, profiles can be applied and combined according to a flexible parameter. For example, (hypothetical) run specifications could look likedefault OR contamination_screening
to apply the default setting plus the tools from contamination_screening respectivelydefault AND disable_read_quality
to apply a negative profile.For now, all profiles that can be used with Seqinspector are hard-coded constants, but we could introduce a parameter to load e.g. a CSV file as profile relatively easily.
This is a draft PR to gather the first feedback, if you like the overall direction...
PR checklist
nf-core lint
).nf-test test main.nf.test -profile test,docker
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).