-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subworkflow Infrastructure #662
Conversation
We should also think about the |
148491d
to
4cd3649
Compare
After a brief look (overall looks good to me, keeping as similar structure as modules is good as then developers are familiar with the structure already), we should also consider naming guidelines here...
|
- bowtie2: | ||
description: | | ||
Bowtie 2 is an ultrafast and memory-efficient tool for aligning | ||
sequencing reads to long reference sequences. | ||
homepage: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml | ||
documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml | ||
doi: 10.1038/nmeth.1923 | ||
- samtools: | ||
description: | | ||
SAMtools is a set of utilities for interacting with and post-processing | ||
short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. | ||
These files are generated as output by short read aligners like BWA. | ||
homepage: http://www.htslib.org/ | ||
documentation: hhttp://www.htslib.org/doc/samtools.html | ||
doi: 10.1093/bioinformatics/btp352 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we eventually have a website rendering this information, this part could be pulled in from the corresponding modules, avoiding redundant information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love it! Not sure how to make that magic happen, though. I really wish yaml had native include statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ewels is planning to do this (for modules at least): https://nfcore.slack.com/archives/CE7DN1U7M/p1628502439008800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - just need the module identifier. Maybe more metadata for this though: for example repo and git sha? Some discussion on #tools
about support for non nf-core repos.
I guess it should basically mimic the info saved about modules used in pipelines? And maybe pipelines then only need to record the subworkflows that they use and not the modules used by those subworkflows..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think a modules
section instead of a tools
section to not repeat ourselves.
I think we're going to run into issues with testing using specific git sha's for the modules. Sure, end user and the installation can work, but how do we develop and maintain them?
I think like @grst using the specific git sha would be a cool feature. I think we should just make an issue of it and cross that bridge when we've figured out more pain points around this and iterate on it. It's not like this repo hasn't been completely restructured a few times already ;).
Breaking out of the comments - it's really interesting that you guys have a different take to me on how modules will interact with subworkflows 😅 Definitely worth hashing out more. Whilst we are making the opposite points, our arguments seem to be the same. I think that pinning the module hash in a subworkflow will be easier to maintain and automate than not doing so. For me a subworkflow is basically a pipeline. I want it to be fully reproducible and not break when one of its dependencies is updated. If I'm using a subworkflow in a pipeline then I don't really care what modules it uses. I want the subworkflow as a single unit. I don't want to have to update and modules for subworkflows, I just want to update the subworkflow itself (if it needs it). The only time I would then use modules in a pipeline is for the custom stuff that I can't pull in from a subworkflow. Does this make sense? Again automated CI I would expect to be simpler with pinned git shas. If we're not storing the module code in git (which I don't think we should) then the test will need to "install" the modules needed for the subworkflow, which is added complexity. However, the tests should be much more stable and it should be trivial (I think) to lint for every subworkflow using the latest version of the modules it uses. So no module PRs should be merged without also updating the pinned versions within the relevant subworkflows (and this update could also be automated pretty easily I think). Could you guys elaborate on which specifics would be more complex with pinned modules? Phil |
Good points raised indeed! The way I see it we should make sub-workflows as isolated as an entity as we have done with modules i.e. you can update one or more sub-workflows if required without impacting others. In order for this to work the only real solution is to have multiple installs of (potentially the same) module within nested sub-workflow directories e.g. using this sub-workflow as an example:
I think we need to remember that a sub-workflow can pull in other sub-workflows too like the example above, and hence each sub-workflow will have it's own module dependencies that will need to resolved recursively somehow when installing the parent sub-workflow. It would be nice to have pinned SHA hashes in I don't like the fact that you could potentially install exactly the same module multiple times in the same repo but I am struggling to think of a better way that would allow us to create self-enclosed sub-workflows. We definitely can't assume that we can update a module and not have to update any sub-workflows it is used in because e.g. adding an extra As @grst mentioned we will need to bash out some really good tooling for this sort of thing to work but I personally think it's the most flexible approach. Comments welcome :) |
On sub-dependencies, this seems related to a CS problem, so I'll try to explain how some options that I'm aware of in that field. Much of this will probably be familiar to many readers.
Coming from a computer-science perspective, this seems related to the diamond dependency problem. You have two major choices - allow multiple versions of a dependency to exist in a project, or force all to use the same version. It sounds like for nf-core, it's definitely desirable to allow multiple versions, because that means you don't need nearly as much synchronization between different codebases. NPM, a javascript package manager, historically stores sub-dependencies as nested-directories. Most of the time, this overhead is fine in practice. But because this can suck (duplication of identical code costs time, space, etc), it ships with an
becomes
, with the exception that different versions of packages would still be nested. (e.g: if libB and libD required incompatible versions of libC, libC would stay as two separate subdirectories). Another approach is to put each version of each dependency in a unique, predictable location, e.g: using the hash as a directory component.
However, users probably don't want to type However, both of these flattening strategies can come with additional cognitive overhead for the user, and might not be right for nf-core. Editing a dependency during development is usually easier with the more "obvious" nested structure, so that you're free to experiment without affecting other pieces that also use that dependency. If anyone is interested in a longer read, I'd recommend the article So you want to write a package manager. The core nf-core (hehe) team likely doesn't need to revisit this, but for anyone else interested, it's a good read. Specifically on this issue, it contains the recommendation:
|
@fwip Thanks for the detailed post! I was trying to avoid suggesting we write a package manager. I'm afraid the other solutions both sound like they're going to need a package manager.
@grst said it well. I think first and foremost, getting the subworkflows into a central repo, and getting them tested, and allowing a centralized collaboration on them is the first step. We can pin an issue, stating to use them at your own risk and to please upstream any fixes and report bugs but they may not be supported. Or keep them on a separate branch from master. I think once we've collected them all, and just using the modules that are in the repo, then we revisit all the tooling discussions. So I propose:
|
The subworkflow would just import the modules from their location in the modules directory, there is nothing to install and no added complexity.
As @hpatel pointed out already, we would basically update the same file at several locations within the same repository (it can be many locations, especially in the case of nested subworkflows). This will make lead to a lot more changed files per PR making them harder to review.
Once the subworkflow gets installed into a pipeline, I would indeed pin the modules. Always the latest version of modules (with which the subworkflow has been tested 🚀) would get pulled in. In any case, I like @emiller88's suggestion: |
Yeah I don't like the idea of keeping the source for the subworkflow module dependencies in the modules repo. As the subworkflows are never run directly from this repo I think that it's fine to exclude them though. Then they can be installed whenever the pipeline is run - either included in a pipeline or for testing. It's added complexity to fetch those files, sure. But not much - it's a predictable URL so only a few lines of code. |
And yeah I'm not saying that everything has to be done in this PR :) Fine to get a proof of concept up and running and go from there. Was just quite an interesting discussion about future plans - good to agree a strategy! |
That may be the solution. In the pipeline, there's a directory in |
0392ffa
to
4417019
Compare
Note that I added nextflow.configs to the subworkflows. I think this is a good way forward, we can specify things there such as params, and maybe get a little more verbose with them rather than trying to pass a set of a set for example we could just list those out. They can then be included in the module.config or in the main nextflow.config to avoid duplicating params for subworkflows or creating a middleman. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for now!
Anything left to do that we should keep track of in a separate issue for the future?
The thing that would come to my mind is installation of subworkflows into a pipeline, but that would be for #tools.
Yep, I agree that should be under tools. Maybe make an issue there and link it back here? I think we should make a project to keep track of the subworkflows from the various workflows to keep track of what needs to be added. |
Okay created an issue in tools. Also created https://github.com/nf-core/modules/milestone/4 to track the subworkflows Closes #707 We might also want an issue template? |
539e715
to
1b4ee12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty straightforward to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think is a nice idea to incorporate a nextflow.config
for subworkflows and at some point instead of including it in the main nextflow.config
or in the module.config
, as discussed in your previous comment, it might be automatically included. At least this is what I tried to push on this nextflow issue.
pattern: '*.ebwt' | ||
# TODO | ||
output: | ||
- bam: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many more outputs than this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is why you put the TODO
statement above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! We hadn't come up with a standard for the meta.yml for the subworkflows yet, and I'm trying to not have this get caught in limbo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough 👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing @emiller88 ! Some minor comments but mainly if we are able to split up the tests for the sub-workflows?
Co-authored-by: Harshil Patel <[email protected]>
f437db5
to
d3611fc
Compare
aa1691b
to
05898f5
Compare
Beautiful 😍 |
@nf-core/modules-team
Night has fallen in Texas! A couple of trade-offs we can make here to avoid a TON of maintenance.
1. How to handle testing when underlying modules make changes
What we're trying to avoid:
For every module used by workflows, and allow the modules' definitions to be the source of truth.
I've gone with the second, because it's YAML all the way down anyway. And then when testing locally, you'll be pushed into testing the subworkflows if you change the module, automagically.
2. Dir for each subworkflow
Currently, the subworkflows just sit in
subworkflows/nf-core/align_bowtie2.nf
I think we should move these tosubworkflows/nf-core/align_bowtie2/main.nf
I've gone ahead and implemented this. I think they might need a similar structure to the modules, rather than all in one directory. This could also allow for custom
functions.nf
like ininput_check
3.
subworkflows/nf-core
vssubworkflows
This one I haven't made a choice on. Currently, all the subworkflows don't work because the file paths in them are, for example,
../../modules/nf-core/software/samtools/stats/main
but thatnf-core
isn't in the path. The double nested directories aren't my favorite, but I think they keep it consistent between this repo and the pipelines, so we don't have tosed
all theinclude
paths when installing. We'd have to move all the modules, which we should wait to do a second PR for, so we can verify the testing works in this one because moving everything is going to invoke #391.Things Left
meta.yml
for BAM_SORT_SAMTOOLSmeta.yml
for BAM_STATS_SAMTOOLS