Subworkflow Infrastructure #662

edmundmiller · 2021-08-10T01:14:32Z

@nf-core/modules-team

Night has fallen in Texas! A couple of trade-offs we can make here to avoid a TON of maintenance.

1. How to handle testing when underlying modules make changes

Use YAML anchors in the pytest-modules.yaml(Haven't tested this, and I'm not sure whether it would work)
Add the other tags under tags:

What we're trying to avoid:

subworkflows/align_bowtie2:
  - modules/nf-core/software/bowtie2/align/main
  - modules/nf-core/software/samtools/sort/main
  - modules/nf-core/software/samtools/index/main
  - modules/nf-core/software/samtools/stats/main
  - modules/nf-core/software/samtools/idxstats/main
  - modules/nf-core/software/samtools/flagstat/main
  - tests/subworkflows/nf-core/align_bowtie2/**
  - subworkflows/nf-core/align_bowtie2.nf
  - subworkflows/nf-core/bam_sort_samtools.nf
  - subworkflows/nf-core/bam_stats_samtools.nf

For every module used by workflows, and allow the modules' definitions to be the source of truth.

I've gone with the second, because it's YAML all the way down anyway. And then when testing locally, you'll be pushed into testing the subworkflows if you change the module, automagically.

2. Dir for each subworkflow

Currently, the subworkflows just sit in subworkflows/nf-core/align_bowtie2.nf I think we should move these to subworkflows/nf-core/align_bowtie2/main.nf

I've gone ahead and implemented this. I think they might need a similar structure to the modules, rather than all in one directory. This could also allow for custom functions.nf like in input_check

3. `subworkflows/nf-core` vs `subworkflows`

This one I haven't made a choice on. Currently, all the subworkflows don't work because the file paths in them are, for example, ../../modules/nf-core/software/samtools/stats/main but that nf-core isn't in the path. The double nested directories aren't my favorite, but I think they keep it consistent between this repo and the pipelines, so we don't have to sed all the include paths when installing. We'd have to move all the modules, which we should wait to do a second PR for, so we can verify the testing works in this one because moving everything is going to invoke #391.

Things Left

edmundmiller · 2021-08-10T01:55:26Z

We should also think about the nf-core subworkflow install workflow. How will we handle subworkflows that depend on other subworkflows?

jfy133 · 2021-08-10T06:48:37Z

After a brief look (overall looks good to me, keeping as similar structure as modules is good as then developers are familiar with the structure already), we should also consider naming guidelines here...

bam_sort_samtools includes the stats subworkflow, but that isn't clear from the name. It might be overkill, particularly if we get to complex subworkflows, but should we recommend summarising each stage of the subworkflow in the name somehow?

grst · 2021-08-10T07:36:48Z

subworkflows/nf-core/align_bowtie2/meta.yml

+    - bowtie2:
+          description: |
+              Bowtie 2 is an ultrafast and memory-efficient tool for aligning
+              sequencing reads to long reference sequences.
+          homepage: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
+          documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
+          doi: 10.1038/nmeth.1923
+    - samtools:
+        description: |
+            SAMtools is a set of utilities for interacting with and post-processing
+            short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li.
+            These files are generated as output by short read aligners like BWA.
+        homepage: http://www.htslib.org/
+        documentation: hhttp://www.htslib.org/doc/samtools.html
+        doi: 10.1093/bioinformatics/btp352


If we eventually have a website rendering this information, this part could be pulled in from the corresponding modules, avoiding redundant information.

I love it! Not sure how to make that magic happen, though. I really wish yaml had native include statements.

@ewels is planning to do this (for modules at least): https://nfcore.slack.com/archives/CE7DN1U7M/p1628502439008800

Agreed - just need the module identifier. Maybe more metadata for this though: for example repo and git sha? Some discussion on #tools about support for non nf-core repos.

I guess it should basically mimic the info saved about modules used in pipelines? And maybe pipelines then only need to record the subworkflows that they use and not the modules used by those subworkflows..?

Yeah, I think a modules section instead of a tools section to not repeat ourselves.

I think we're going to run into issues with testing using specific git sha's for the modules. Sure, end user and the installation can work, but how do we develop and maintain them?

I think like @grst using the specific git sha would be a cool feature. I think we should just make an issue of it and cross that bridge when we've figured out more pain points around this and iterate on it. It's not like this repo hasn't been completely restructured a few times already ;).

subworkflows/nf-core/align_bowtie2/main.nf

tests/subworkflows/nf-core/align_bowtie2/test.yml

ewels · 2021-08-12T21:19:34Z

Breaking out of the comments - it's really interesting that you guys have a different take to me on how modules will interact with subworkflows 😅 Definitely worth hashing out more. Whilst we are making the opposite points, our arguments seem to be the same.

I think that pinning the module hash in a subworkflow will be easier to maintain and automate than not doing so. For me a subworkflow is basically a pipeline. I want it to be fully reproducible and not break when one of its dependencies is updated. If I'm using a subworkflow in a pipeline then I don't really care what modules it uses. I want the subworkflow as a single unit. I don't want to have to update and modules for subworkflows, I just want to update the subworkflow itself (if it needs it). The only time I would then use modules in a pipeline is for the custom stuff that I can't pull in from a subworkflow. Does this make sense?

Again automated CI I would expect to be simpler with pinned git shas. If we're not storing the module code in git (which I don't think we should) then the test will need to "install" the modules needed for the subworkflow, which is added complexity. However, the tests should be much more stable and it should be trivial (I think) to lint for every subworkflow using the latest version of the modules it uses. So no module PRs should be merged without also updating the pinned versions within the relevant subworkflows (and this update could also be automated pretty easily I think).

Could you guys elaborate on which specifics would be more complex with pinned modules?

Phil

drpatelh · 2021-08-12T21:57:36Z

Good points raised indeed! The way I see it we should make sub-workflows as isolated as an entity as we have done with modules i.e. you can update one or more sub-workflows if required without impacting others. In order for this to work the only real solution is to have multiple installs of (potentially the same) module within nested sub-workflow directories e.g. using this sub-workflow as an example:

subworkflows/
    nf-core/
        bam_sort_samtools/              
            bam_stats_samtools.nf                             ## imported sub-workflow by main.nf sub-workflow
            nf-core/modules/samtools/flagstat/main.nf         ## imported module by bam_stats_samtools.nf
            nf-core/modules/samtools/idxstats/main.nf         ## imported module by bam_stats_samtools.nf
            nf-core/modules/samtools/stats/main.nf            ## imported module by bam_stats_samtools.nf

            nf-core/modules/samtools/sort/main.nf             ## imported module by main.nf sub-workflow
            nf-core/modules/samtools/index/main.nf            ## imported module by main.nf sub-workflow
            main.nf                                           ## main.nf sub-workflow
            subworkflow.json                                  ## SHA JSON like modules.json

I think we need to remember that a sub-workflow can pull in other sub-workflows too like the example above, and hence each sub-workflow will have it's own module dependencies that will need to resolved recursively somehow when installing the parent sub-workflow. It would be nice to have pinned SHA hashes in subworkflow.json for both the modules as well as the sub-workflows for complete transparency and reproducibility.

I don't like the fact that you could potentially install exactly the same module multiple times in the same repo but I am struggling to think of a better way that would allow us to create self-enclosed sub-workflows. We definitely can't assume that we can update a module and not have to update any sub-workflows it is used in because e.g. adding an extra input channel will break said sub-workflow - everything needs to be tested and updated together.

As @grst mentioned we will need to bash out some really good tooling for this sort of thing to work but I personally think it's the most flexible approach. Comments welcome :)

fwip · 2021-08-12T22:48:31Z

On sub-dependencies, this seems related to a CS problem, so I'll try to explain how some options that I'm aware of in that field. Much of this will probably be familiar to many readers.

I don't like the fact that you could potentially install exactly the same module multiple times in the same repo but I am struggling to think of a better way that would allow us to create self-enclosed sub-workflows. We definitely can't assume that we can update a module and not have to update any sub-workflows it is used in because e.g. adding an extra input channel will break said sub-workflow - everything needs to be tested and updated together.

Coming from a computer-science perspective, this seems related to the diamond dependency problem. You have two major choices - allow multiple versions of a dependency to exist in a project, or force all to use the same version. It sounds like for nf-core, it's definitely desirable to allow multiple versions, because that means you don't need nearly as much synchronization between different codebases.

NPM, a javascript package manager, historically stores sub-dependencies as nested-directories. Most of the time, this overhead is fine in practice. But because this can suck (duplication of identical code costs time, space, etc), it ships with an npm dedupe command that examines the dependencies and "flattens" the tree where possible. e.g:

workflows/
  libA/
    libB/
      libC/
  libD/
    libC/

becomes

workflows/
  libA/
  libB/
  libC/
  libD/

, with the exception that different versions of packages would still be nested. (e.g: if libB and libD required incompatible versions of libC, libC would stay as two separate subdirectories).

Another approach is to put each version of each dependency in a unique, predictable location, e.g: using the hash as a directory component.

workflows/libA@ab8302dlf/
workflows/libB@c3302da94/
workflows/libC@00f9a9382/
workflows/libC@4938282fa/

However, users probably don't want to type import { libB } from "../workflows/libB@c3302da94/", so this approach usually comes with a smarter "linker" or other system to help map these to human-meaningful names. The package manager pnpm uses an approach like this. (It creates hard-links on the filesystem in the "normal" nested structure to a 'global' store).

However, both of these flattening strategies can come with additional cognitive overhead for the user, and might not be right for nf-core. Editing a dependency during development is usually easier with the more "obvious" nested structure, so that you're free to experiment without affecting other pieces that also use that dependency.

If anyone is interested in a longer read, I'd recommend the article So you want to write a package manager. The core nf-core (hehe) team likely doesn't need to revisit this, but for anyone else interested, it's a good read. Specifically on this issue, it contains the recommendation:

If your language permits it, and the type system won’t choke on it, and the global state risks are negligible, and you’re cool with some binary/process footprint bloating and (probably negligible) runtime performance costs, AND bogging down static analysis and transpilation tooling is OK, then duplication is the shared deps solution you’re looking for. That’s a lot of conditions, but it may still be preferable to reconciliation strategies, as most require user intervention — colloquially known as DEPENDENCY HELL — and all involve potentially uncomfortable compromises in the logic itself.

edmundmiller · 2021-08-13T01:28:04Z

@fwip Thanks for the detailed post! I was trying to avoid suggesting we write a package manager. I'm afraid the other solutions both sound like they're going to need a package manager.

There could be rare cases where a subworkflow is not compatible with the latest version of a module. But I'd risk it until we have such a case and then figure out a solution.

@grst said it well. I think first and foremost, getting the subworkflows into a central repo, and getting them tested, and allowing a centralized collaboration on them is the first step. We can pin an issue, stating to use them at your own risk and to please upstream any fixes and report bugs but they may not be supported. Or keep them on a separate branch from master.

I think once we've collected them all, and just using the modules that are in the repo, then we revisit all the tooling discussions.

So I propose:

Wrap up this PR
Collect all the subworkflows, and get them working.
Come up with tooling to install them, and maintain them.

grst · 2021-08-13T08:06:46Z

then the test will need to "install" the modules needed for the subworkflow, which is added complexity.

The subworkflow would just import the modules from their location in the modules directory, there is nothing to install and no added complexity.

So no module PRs should be merged without also updating the pinned versions within the relevant subworkflows

As @hpatel pointed out already, we would basically update the same file at several locations within the same repository (it can be many locations, especially in the case of nested subworkflows). This will make lead to a lot more changed files per PR making them harder to review.

I don't want to have to update and modules for subworkflows, I just want to update the subworkflow itself (if it needs it).

Once the subworkflow gets installed into a pipeline, I would indeed pin the modules. Always the latest version of modules (with which the subworkflow has been tested 🚀) would get pulled in.

In any case, I like @emiller88's suggestion:
Build whatever prototype to test subworkflows he thinks is easiest (he's implementing the tests, after all) and once we got a couple of them in (maybe 10-15ish) we can revisit this discussion based on everything we learned on the way.

ewels · 2021-08-14T13:33:20Z

This will make lead to a lot more changed files per PR making them harder to review.

Yeah I don't like the idea of keeping the source for the subworkflow module dependencies in the modules repo. As the subworkflows are never run directly from this repo I think that it's fine to exclude them though. Then they can be installed whenever the pipeline is run - either included in a pipeline or for testing.

It's added complexity to fetch those files, sure. But not much - it's a predictable URL so only a few lines of code.

ewels · 2021-08-14T13:44:00Z

And yeah I'm not saying that everything has to be done in this PR :) Fine to get a proof of concept up and running and go from there. Was just quite an interesting discussion about future plans - good to agree a strategy!

edmundmiller · 2021-08-14T18:53:16Z

Yeah I don't like the idea of keeping the source for the subworkflow module dependencies in the modules repo. As the subworkflows are never run directly from this repo I think that it's fine to exclude them though. Then they can be installed whenever the pipeline is run - either included in a pipeline or for testing.

That may be the solution. In the pipeline, there's a directory in subworkflows/modules that keeps the versions of the modules needed by the subworkflows.

edmundmiller · 2021-09-10T19:24:52Z

Note that I added nextflow.configs to the subworkflows. I think this is a good way forward, we can specify things there such as params, and maybe get a little more verbose with them rather than trying to pass a set of a set for example we could just list those out. They can then be included in the module.config or in the main nextflow.config to avoid duplicating params for subworkflows or creating a middleman.

grst

LGTM for now!

Anything left to do that we should keep track of in a separate issue for the future?
The thing that would come to my mind is installation of subworkflows into a pipeline, but that would be for #tools.

edmundmiller · 2021-09-15T19:40:49Z

Yep, I agree that should be under tools. Maybe make an issue there and link it back here?

I think we should make a project to keep track of the subworkflows from the various workflows to keep track of what needs to be added.

edmundmiller · 2021-09-15T19:54:02Z

Okay created an issue in tools.

Also created https://github.com/nf-core/modules/milestone/4 to track the subworkflows

Closes #707
Closes #708
Closes #709

We might also want an issue template?

grst

Lgtm

jfy133

Looks pretty straightforward to me

JoseEspinosa

LGTM! I think is a nice idea to incorporate a nextflow.config for subworkflows and at some point instead of including it in the main nextflow.config or in the module.config, as discussed in your previous comment, it might be automatically included. At least this is what I tried to push on this nextflow issue.

subworkflows/nf-core/align_bowtie2/main.nf

subworkflows/nf-core/align_bowtie2/meta.yml

drpatelh · 2021-10-08T08:34:17Z

subworkflows/nf-core/align_bowtie2/meta.yml

+      pattern: '*.ebwt'
+# TODO
+output:
+  - bam:


There are many more outputs than this?

Ah, this is why you put the TODO statement above?

Exactly! We hadn't come up with a standard for the meta.yml for the subworkflows yet, and I'm trying to not have this get caught in limbo.

Fair enough 👍🏽

subworkflows/nf-core/bam_sort_samtools/meta.yml

subworkflows/nf-core/bam_stats_samtools/meta.yml

tests/config/pytest_modules.yml

drpatelh

Looks amazing @emiller88 ! Some minor comments but mainly if we are able to split up the tests for the sub-workflows?

Co-authored-by: Harshil Patel <[email protected]>

This reverts commit d3611fc.

drpatelh · 2021-10-08T15:02:31Z

Beautiful 😍

edmundmiller added help wanted Extra attention is needed tests Related to automated tests labels Aug 10, 2021

edmundmiller requested review from ewels, chris-cheshire, grst, jfy133 and drpatelh August 10, 2021 01:14

edmundmiller self-assigned this Aug 10, 2021

edmundmiller force-pushed the subworkflow-ci branch from 148491d to 4cd3649 Compare August 10, 2021 01:58

grst reviewed Aug 10, 2021

View reviewed changes

edmundmiller force-pushed the subworkflow-ci branch 4 times, most recently from 0392ffa to 4417019 Compare September 10, 2021 19:14

drpatelh added the Ready for Review label Sep 10, 2021

grst approved these changes Sep 15, 2021

View reviewed changes

edmundmiller mentioned this pull request Sep 15, 2021

Subworkflow tools nf-core/tools#1272

Closed

edmundmiller added 5 commits October 7, 2021 14:53

chore(align_bowtie2): Remove duplicate tag

53fc924

style: Format yamls

d02097a

test(subworkflows): Only check versions for modules

e7c9c63

chore: Update subworkflows to match rnaseq dev

9e84fb6

fix(subworkflows): Update paths

1b4ee12

edmundmiller force-pushed the subworkflow-ci branch from 539e715 to 1b4ee12 Compare October 8, 2021 02:08

fix(bam_sort_samtools): Fix sort parameters for testing

dd43643

edmundmiller requested a review from grst October 8, 2021 02:28

grst approved these changes Oct 8, 2021

View reviewed changes

jfy133 approved these changes Oct 8, 2021

View reviewed changes

JoseEspinosa approved these changes Oct 8, 2021

View reviewed changes