Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subworkflow Infrastructure #662

Merged
merged 45 commits into from
Oct 8, 2021
Merged

Subworkflow Infrastructure #662

merged 45 commits into from
Oct 8, 2021

Conversation

edmundmiller
Copy link
Contributor

@edmundmiller edmundmiller commented Aug 10, 2021

@nf-core/modules-team

Night has fallen in Texas! A couple of trade-offs we can make here to avoid a TON of maintenance.

1. How to handle testing when underlying modules make changes

  1. Use YAML anchors in the pytest-modules.yaml(Haven't tested this, and I'm not sure whether it would work)
  2. Add the other tags under tags:

What we're trying to avoid:

subworkflows/align_bowtie2:
  - modules/nf-core/software/bowtie2/align/main
  - modules/nf-core/software/samtools/sort/main
  - modules/nf-core/software/samtools/index/main
  - modules/nf-core/software/samtools/stats/main
  - modules/nf-core/software/samtools/idxstats/main
  - modules/nf-core/software/samtools/flagstat/main
  - tests/subworkflows/nf-core/align_bowtie2/**
  - subworkflows/nf-core/align_bowtie2.nf
  - subworkflows/nf-core/bam_sort_samtools.nf
  - subworkflows/nf-core/bam_stats_samtools.nf

For every module used by workflows, and allow the modules' definitions to be the source of truth.

I've gone with the second, because it's YAML all the way down anyway. And then when testing locally, you'll be pushed into testing the subworkflows if you change the module, automagically.

2. Dir for each subworkflow

Currently, the subworkflows just sit in subworkflows/nf-core/align_bowtie2.nf I think we should move these to subworkflows/nf-core/align_bowtie2/main.nf

I've gone ahead and implemented this. I think they might need a similar structure to the modules, rather than all in one directory. This could also allow for custom functions.nf like in input_check

3. subworkflows/nf-core vs subworkflows

This one I haven't made a choice on. Currently, all the subworkflows don't work because the file paths in them are, for example, ../../modules/nf-core/software/samtools/stats/main but that nf-core isn't in the path. The double nested directories aren't my favorite, but I think they keep it consistent between this repo and the pipelines, so we don't have to sed all the include paths when installing. We'd have to move all the modules, which we should wait to do a second PR for, so we can verify the testing works in this one because moving everything is going to invoke #391.

Things Left

  • meta.yml for BAM_SORT_SAMTOOLS
  • meta.yml for BAM_STATS_SAMTOOLS
  • Decide on 3
  • Fix test file output
  • Test BAM_SORT_SAMTOOLS
  • Test BAM_STATS_SAMTOOLS

@edmundmiller edmundmiller added help wanted Extra attention is needed tests Related to automated tests labels Aug 10, 2021
@edmundmiller edmundmiller self-assigned this Aug 10, 2021
@edmundmiller
Copy link
Contributor Author

We should also think about the nf-core subworkflow install workflow. How will we handle subworkflows that depend on other subworkflows?

@jfy133
Copy link
Member

jfy133 commented Aug 10, 2021

After a brief look (overall looks good to me, keeping as similar structure as modules is good as then developers are familiar with the structure already), we should also consider naming guidelines here...

bam_sort_samtools includes the stats subworkflow, but that isn't clear from the name. It might be overkill, particularly if we get to complex subworkflows, but should we recommend summarising each stage of the subworkflow in the name somehow?

Comment on lines 9 to 23
- bowtie2:
description: |
Bowtie 2 is an ultrafast and memory-efficient tool for aligning
sequencing reads to long reference sequences.
homepage: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
doi: 10.1038/nmeth.1923
- samtools:
description: |
SAMtools is a set of utilities for interacting with and post-processing
short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li.
These files are generated as output by short read aligners like BWA.
homepage: http://www.htslib.org/
documentation: hhttp://www.htslib.org/doc/samtools.html
doi: 10.1093/bioinformatics/btp352
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we eventually have a website rendering this information, this part could be pulled in from the corresponding modules, avoiding redundant information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it! Not sure how to make that magic happen, though. I really wish yaml had native include statements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ewels is planning to do this (for modules at least): https://nfcore.slack.com/archives/CE7DN1U7M/p1628502439008800

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - just need the module identifier. Maybe more metadata for this though: for example repo and git sha? Some discussion on #tools about support for non nf-core repos.

I guess it should basically mimic the info saved about modules used in pipelines? And maybe pipelines then only need to record the subworkflows that they use and not the modules used by those subworkflows..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think a modules section instead of a tools section to not repeat ourselves.

I think we're going to run into issues with testing using specific git sha's for the modules. Sure, end user and the installation can work, but how do we develop and maintain them?

I think like @grst using the specific git sha would be a cool feature. I think we should just make an issue of it and cross that bridge when we've figured out more pain points around this and iterate on it. It's not like this repo hasn't been completely restructured a few times already ;).

subworkflows/nf-core/align_bowtie2/main.nf Show resolved Hide resolved
@ewels
Copy link
Member

ewels commented Aug 12, 2021

Breaking out of the comments - it's really interesting that you guys have a different take to me on how modules will interact with subworkflows 😅 Definitely worth hashing out more. Whilst we are making the opposite points, our arguments seem to be the same.

I think that pinning the module hash in a subworkflow will be easier to maintain and automate than not doing so. For me a subworkflow is basically a pipeline. I want it to be fully reproducible and not break when one of its dependencies is updated. If I'm using a subworkflow in a pipeline then I don't really care what modules it uses. I want the subworkflow as a single unit. I don't want to have to update and modules for subworkflows, I just want to update the subworkflow itself (if it needs it). The only time I would then use modules in a pipeline is for the custom stuff that I can't pull in from a subworkflow. Does this make sense?

Again automated CI I would expect to be simpler with pinned git shas. If we're not storing the module code in git (which I don't think we should) then the test will need to "install" the modules needed for the subworkflow, which is added complexity. However, the tests should be much more stable and it should be trivial (I think) to lint for every subworkflow using the latest version of the modules it uses. So no module PRs should be merged without also updating the pinned versions within the relevant subworkflows (and this update could also be automated pretty easily I think).

Could you guys elaborate on which specifics would be more complex with pinned modules?

Phil

@drpatelh
Copy link
Member

drpatelh commented Aug 12, 2021

Good points raised indeed! The way I see it we should make sub-workflows as isolated as an entity as we have done with modules i.e. you can update one or more sub-workflows if required without impacting others. In order for this to work the only real solution is to have multiple installs of (potentially the same) module within nested sub-workflow directories e.g. using this sub-workflow as an example:

subworkflows/
    nf-core/
        bam_sort_samtools/              
            bam_stats_samtools.nf                             ## imported sub-workflow by main.nf sub-workflow
            nf-core/modules/samtools/flagstat/main.nf         ## imported module by bam_stats_samtools.nf
            nf-core/modules/samtools/idxstats/main.nf         ## imported module by bam_stats_samtools.nf
            nf-core/modules/samtools/stats/main.nf            ## imported module by bam_stats_samtools.nf

            nf-core/modules/samtools/sort/main.nf             ## imported module by main.nf sub-workflow
            nf-core/modules/samtools/index/main.nf            ## imported module by main.nf sub-workflow
            main.nf                                           ## main.nf sub-workflow
            subworkflow.json                                  ## SHA JSON like modules.json

I think we need to remember that a sub-workflow can pull in other sub-workflows too like the example above, and hence each sub-workflow will have it's own module dependencies that will need to resolved recursively somehow when installing the parent sub-workflow. It would be nice to have pinned SHA hashes in subworkflow.json for both the modules as well as the sub-workflows for complete transparency and reproducibility.

I don't like the fact that you could potentially install exactly the same module multiple times in the same repo but I am struggling to think of a better way that would allow us to create self-enclosed sub-workflows. We definitely can't assume that we can update a module and not have to update any sub-workflows it is used in because e.g. adding an extra input channel will break said sub-workflow - everything needs to be tested and updated together.

As @grst mentioned we will need to bash out some really good tooling for this sort of thing to work but I personally think it's the most flexible approach. Comments welcome :)

@fwip
Copy link

fwip commented Aug 12, 2021

On sub-dependencies, this seems related to a CS problem, so I'll try to explain how some options that I'm aware of in that field. Much of this will probably be familiar to many readers.

I don't like the fact that you could potentially install exactly the same module multiple times in the same repo but I am struggling to think of a better way that would allow us to create self-enclosed sub-workflows. We definitely can't assume that we can update a module and not have to update any sub-workflows it is used in because e.g. adding an extra input channel will break said sub-workflow - everything needs to be tested and updated together.

Coming from a computer-science perspective, this seems related to the diamond dependency problem. You have two major choices - allow multiple versions of a dependency to exist in a project, or force all to use the same version. It sounds like for nf-core, it's definitely desirable to allow multiple versions, because that means you don't need nearly as much synchronization between different codebases.

NPM, a javascript package manager, historically stores sub-dependencies as nested-directories. Most of the time, this overhead is fine in practice. But because this can suck (duplication of identical code costs time, space, etc), it ships with an npm dedupe command that examines the dependencies and "flattens" the tree where possible. e.g:

workflows/
  libA/
    libB/
      libC/
  libD/
    libC/

becomes

workflows/
  libA/
  libB/
  libC/
  libD/

, with the exception that different versions of packages would still be nested. (e.g: if libB and libD required incompatible versions of libC, libC would stay as two separate subdirectories).

Another approach is to put each version of each dependency in a unique, predictable location, e.g: using the hash as a directory component.

workflows/libA@ab8302dlf/
workflows/libB@c3302da94/
workflows/libC@00f9a9382/
workflows/libC@4938282fa/

However, users probably don't want to type import { libB } from "../workflows/libB@c3302da94/", so this approach usually comes with a smarter "linker" or other system to help map these to human-meaningful names. The package manager pnpm uses an approach like this. (It creates hard-links on the filesystem in the "normal" nested structure to a 'global' store).

However, both of these flattening strategies can come with additional cognitive overhead for the user, and might not be right for nf-core. Editing a dependency during development is usually easier with the more "obvious" nested structure, so that you're free to experiment without affecting other pieces that also use that dependency.

If anyone is interested in a longer read, I'd recommend the article So you want to write a package manager. The core nf-core (hehe) team likely doesn't need to revisit this, but for anyone else interested, it's a good read. Specifically on this issue, it contains the recommendation:

If your language permits it, and the type system won’t choke on it, and the global state risks are negligible, and you’re cool with some binary/process footprint bloating and (probably negligible) runtime performance costs, AND bogging down static analysis and transpilation tooling is OK, then duplication is the shared deps solution you’re looking for. That’s a lot of conditions, but it may still be preferable to reconciliation strategies, as most require user intervention — colloquially known as DEPENDENCY HELL — and all involve potentially uncomfortable compromises in the logic itself.

@edmundmiller
Copy link
Contributor Author

@fwip Thanks for the detailed post! I was trying to avoid suggesting we write a package manager. I'm afraid the other solutions both sound like they're going to need a package manager.

There could be rare cases where a subworkflow is not compatible with the latest version of a module. But I'd risk it until we have such a case and then figure out a solution.

@grst said it well. I think first and foremost, getting the subworkflows into a central repo, and getting them tested, and allowing a centralized collaboration on them is the first step. We can pin an issue, stating to use them at your own risk and to please upstream any fixes and report bugs but they may not be supported. Or keep them on a separate branch from master.

I think once we've collected them all, and just using the modules that are in the repo, then we revisit all the tooling discussions.

So I propose:

  1. Wrap up this PR
  2. Collect all the subworkflows, and get them working.
  3. Come up with tooling to install them, and maintain them.

@grst
Copy link
Member

grst commented Aug 13, 2021

then the test will need to "install" the modules needed for the subworkflow, which is added complexity.

The subworkflow would just import the modules from their location in the modules directory, there is nothing to install and no added complexity.

So no module PRs should be merged without also updating the pinned versions within the relevant subworkflows

As @hpatel pointed out already, we would basically update the same file at several locations within the same repository (it can be many locations, especially in the case of nested subworkflows). This will make lead to a lot more changed files per PR making them harder to review.

I don't want to have to update and modules for subworkflows, I just want to update the subworkflow itself (if it needs it).

Once the subworkflow gets installed into a pipeline, I would indeed pin the modules. Always the latest version of modules (with which the subworkflow has been tested 🚀) would get pulled in.

In any case, I like @emiller88's suggestion:
Build whatever prototype to test subworkflows he thinks is easiest (he's implementing the tests, after all) and once we got a couple of them in (maybe 10-15ish) we can revisit this discussion based on everything we learned on the way.

@ewels
Copy link
Member

ewels commented Aug 14, 2021

This will make lead to a lot more changed files per PR making them harder to review.

Yeah I don't like the idea of keeping the source for the subworkflow module dependencies in the modules repo. As the subworkflows are never run directly from this repo I think that it's fine to exclude them though. Then they can be installed whenever the pipeline is run - either included in a pipeline or for testing.

It's added complexity to fetch those files, sure. But not much - it's a predictable URL so only a few lines of code.

@ewels
Copy link
Member

ewels commented Aug 14, 2021

And yeah I'm not saying that everything has to be done in this PR :) Fine to get a proof of concept up and running and go from there. Was just quite an interesting discussion about future plans - good to agree a strategy!

@edmundmiller
Copy link
Contributor Author

Yeah I don't like the idea of keeping the source for the subworkflow module dependencies in the modules repo. As the subworkflows are never run directly from this repo I think that it's fine to exclude them though. Then they can be installed whenever the pipeline is run - either included in a pipeline or for testing.

That may be the solution. In the pipeline, there's a directory in subworkflows/modules that keeps the versions of the modules needed by the subworkflows.

@edmundmiller edmundmiller force-pushed the subworkflow-ci branch 4 times, most recently from 0392ffa to 4417019 Compare September 10, 2021 19:14
@edmundmiller
Copy link
Contributor Author

Note that I added nextflow.configs to the subworkflows. I think this is a good way forward, we can specify things there such as params, and maybe get a little more verbose with them rather than trying to pass a set of a set for example we could just list those out. They can then be included in the module.config or in the main nextflow.config to avoid duplicating params for subworkflows or creating a middleman.

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now!

Anything left to do that we should keep track of in a separate issue for the future?
The thing that would come to my mind is installation of subworkflows into a pipeline, but that would be for #tools.

@edmundmiller
Copy link
Contributor Author

Yep, I agree that should be under tools. Maybe make an issue there and link it back here?

I think we should make a project to keep track of the subworkflows from the various workflows to keep track of what needs to be added.

@edmundmiller
Copy link
Contributor Author

Okay created an issue in tools.

Also created https://github.com/nf-core/modules/milestone/4 to track the subworkflows

Closes #707
Closes #708
Closes #709

We might also want an issue template?

@edmundmiller edmundmiller requested a review from grst October 8, 2021 02:28
Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty straightforward to me

Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think is a nice idea to incorporate a nextflow.config for subworkflows and at some point instead of including it in the main nextflow.config or in the module.config, as discussed in your previous comment, it might be automatically included. At least this is what I tried to push on this nextflow issue.

pattern: '*.ebwt'
# TODO
output:
- bam:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many more outputs than this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is why you put the TODO statement above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! We hadn't come up with a standard for the meta.yml for the subworkflows yet, and I'm trying to not have this get caught in limbo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough 👍🏽

Copy link
Member

@drpatelh drpatelh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing @emiller88 ! Some minor comments but mainly if we are able to split up the tests for the sub-workflows?

@drpatelh
Copy link
Member

drpatelh commented Oct 8, 2021

Beautiful 😍

@drpatelh drpatelh merged commit c19671d into master Oct 8, 2021
@edmundmiller edmundmiller deleted the subworkflow-ci branch October 8, 2021 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed Ready for Review tests Related to automated tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants