Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle module / process imports #8

Closed
ewels opened this issue Dec 5, 2019 · 78 comments
Closed

Handle module / process imports #8

ewels opened this issue Dec 5, 2019 · 78 comments

Comments

@ewels
Copy link
Member

ewels commented Dec 5, 2019

Lots of people use nf-core pipelines offline. We want to make the process of using modules from a different repository as simple as possible.

One solution would be to use git submodule to add nf-core/modules as a git submodule to every pipeline. By default, doing git clone will not pull the submodules. Doing git clone --recursive or git submodule update --init --recursive will pull the module repository.

Loading logic could then be:

  • Try to load the files locally - works if submodule is initialised. Fails otherwise.
  • If fails, try to load from the web
  • If fails, exit with an error

Then by default most people running online will pull the online files dynamically. But pulling a pipeline to use offline is super easy and does not require any changes to files or config.

Currently nf-core download manually pulls institutional config files and edits nextflow.config so that the pipeline loads these files. This could also be done with submodules as above, without any need to edit any files.

Limitations would be that we have to manage the git hash of the modules repository in two places - the git submodule file and the nextflow.config file. We can lint to check that these two are the same. Also, this forces pipelines to use a single hash for all modules in the pipeline. I think this is probably ok for reasons of maintaining sanity though.

Thoughts?

@drpatelh
Copy link
Member

drpatelh commented Dec 5, 2019

A naive implementation is mentioned here #3, and follows the current procedure we use for nf-core/configs.

Excuse my simple brain but would the git submodule approach allow us to also dynamically obtain the latest hash for the modules repo so we dont have to update it manually in nextflow.config (or point out that it needs to be updated as part of the linting)?. For example, when we release a pipeline the hash will need to remain static to use those particular versions of the module files.

We will definitely need to have a more fine grained control over versioning for modules compared to the configs repo.

@ewels
Copy link
Member Author

ewels commented Dec 5, 2019

No, we will still have to manually update it in nextflow.config for the remote loading to be done (step 2 in my workflow above, if not cloned recursively). This is as your example in #3.

Prepping an example now, will make more sense when you see it hopefully 👍

@ewels
Copy link
Member Author

ewels commented Dec 6, 2019

Example pipeline repo: https://github.com/nf-core/dsl2pipelinetest

To be combined with code in #9

@ewels
Copy link
Member Author

ewels commented Dec 6, 2019

Ok, so discussion with @pditommaso on gitter refers to this [link]

it's not a good idea
put the modules as a subtree in the project
but in a nutshell, it's like having the complete copy in your working tree
but still you have the ability to sync with the remote one

So the suggestion is that we never load remote files here - we just always include the entire nf-core/modules repo in every pipeline.

Pros:

  • No change to current behaviour - git clone and nextflow pull is super simple
  • Will work offline without any issues

Downsides:

  • Repos could get pretty big, if nf-core/modules gets big
  • ..?

I personally like this 😉

@pditommaso
Copy link
Contributor

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

if you want to control the version of each module independently you should include each of the as a separate substree.

@ewels
Copy link
Member Author

ewels commented Dec 6, 2019

So one major downside - with git submodules you have a nice .gitmodules file that explicitly shows what commit hash you currently have:

[submodule "modules"]
	path = modules
	url = https://github.com/nf-core/modules.git
Subproject commit a88b867498d783a84ec659017ed94ee2acaaa22b

With git subtree everything is in one repo, so it's much more difficult to inspect at what commit the modules repo is at. I think the only way to do it is by inspecting the commit messages in git log...

@ewels
Copy link
Member Author

ewels commented Dec 6, 2019

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

Yes - I think that having everything at one commit is a sacrifice worth making for simplicity though. In fact I'd prefer to enforce this as otherwise stuff could get very confusing quickly.. 😝

@ewels
Copy link
Member Author

ewels commented Dec 6, 2019

Downsides of submodules is that people using git clone will hit problems as they have to use --recursive. Nextflow can handle this with nextflow pull, so that's fine. The GitHub zip file download probably also won't work (used by nf-core download as well as being a visible button on the repo web pages).

@apeltzer
Copy link
Member

apeltzer commented Dec 6, 2019

Can we find a way to resolve the latter issue somehow else with nf-core download? E.g. by locally doing a git pull recursive / using submodules, packaging that up in nf-core download and providing it to the user for copying it over?

@ewels
Copy link
Member Author

ewels commented Dec 16, 2019

Yes, it's pretty easy to fix with nf-core download by refactoring that code 👍 - but the download button on GitHub won't work still. Fairly minor problem though. I think I'm most keen on the submodules now with the more explicit and traceable lock on the modules remote repo. I think it will just be too easy to mess stuff up with the subtree 😟

@apeltzer
Copy link
Member

I agree - don't care too much about the Github download button either as we provide a proper alternative and can document that as well 👍

@ewels
Copy link
Member Author

ewels commented Dec 17, 2019

@aunderwo - I'd be curious to hear your thoughts on this one! Just reading your blog post where you mention git subrepo..

@junjun-zhang
Copy link

junjun-zhang commented Jan 14, 2020

A possible problem including nf-core/modules is that you will need to update all modules altogether, that still could be a strategy.

if you want to control the version of each module independently you should include each of the as a separate substree.

We are taking a different approach to import remote modules that addresses the above concern and does allow us to version control each module independently. Here are the modules, and here is how these modules are imported into the pipeline repository, basically they are materialized locally (needs to run something similar npm install to import/sync module files).

Since the module files are local, same as other normal files in the pipeline's git repo, once sync'd and committed to git, there is nothing additional needed to make the pipeline work.

@ewels
Copy link
Member Author

ewels commented Jan 15, 2020

Hi @junjun-zhang,

This is definitely an interesting idea.. So this would involve creating a new subcommand for the nf-core helper tool I guess, which would manage the copying / updating of process files. I guess we could also add lint tests to ensure that the code for the processes is not modified away from the upstream repository at all. It would certainly make the pipelines easier to read / understand too..

Phil

@apeltzer
Copy link
Member

Yes, a very interesting approach!
The only drawback I see with it is, that we need to use a separate tool/method to handle this. A user that only uses nextflow on a cluster that e.g. cannot be configured by the user might have some issues as such an installation involves talking to IT / HPC admins first... which the plain usage of Nextflow does not require. Other than that I can only follow the comment from Phil - would make a lot of standard things easier ...

@ewels
Copy link
Member Author

ewels commented Jan 15, 2020

No, I'm not sure that this is correct - here the nf-core helper command would only be needed by pipeline developers when editing the pipeline. The process code would be flat files within the pipeline repository, so nothing special for the end user (in fact, even less than using git submodules).

@apeltzer
Copy link
Member

Ok I guess I didn't understand it then yet entirely - after reading again, I think I understood it now as well. I think nf-core tools extension is the way to go then. Fully flexible & we can expect developers to be able to do this when doing the dev work - for users it doesn't interfere at all 👍

@ewels
Copy link
Member Author

ewels commented Jan 15, 2020

Following the logic of the npm way of doing things, I guess we could then have a meta information files with the details of where each process comes from.. eg. processes.json that has the name, version, repo, hash & path for each package that has been pulled in (maybe we don’t need all of these fields? Maybe I’m missing some?).

@junjun-zhang how do you handle the management of these imported files?

@junjun-zhang
Copy link

@ewels what you mentioned is possible, one way or the other those information is useful to keep for dependency management. I am fairly new to Nextflow, still trying to learn more, so just to share my own thoughts here. What we are experimenting is something quick and simple but well-supports one of the most important features - explicit declaration of module dependencies down to specific versions. This is to fulfill the ultimate goal of reproducible / version controlled pipeline builds. At this point, our plan is to write a simple script (likely in Python) to detect dependencies of remote modules by searching for lines starting with include "./modules/raw.githubusercontent.com/xxxxx" in Nextflow pipeline code, then fetch the contents and store them under the local modules folder. Of course, this is very preliminary and basic, locking module content down with git commit hash etc would be great future improvement.

Dependency management is an important feature for any programming language, GO Language initially did not have good support for it. There have been numerous solutions developed by the GO community until the introduction of GO Modules. Some blogs might be interesting to read: here, here and here. I am not suggesting to take the same approach as GO Modules, but it's certainly a great source of inspiration. Ultimately, I think it's up to Nextflow language to choose it's own official approach for dependency management. For that, I'd like to hear how others think, particularly @pditommaso

@pditommaso
Copy link
Contributor

Since there are many package managers out, there's nothing that could be used to manage NF module assets? I was even thinking to use npm. Would that be such crazy?

@antunderwood
Copy link
Contributor

@aunderwo - I'd be curious to hear your thoughts on this one! Just reading your blog post where you mention git subrepo..

I have found subrepo (wrapper around git subtree) a more transparent way of dealing with modules particularly since the files pulled in via subrepo are not links

@junjun-zhang
Copy link

Since there are many package managers out, there's nothing that could be used to manage NF module assets? I was even thinking to use npm. Would that be such crazy?

That seems a bold idea, don't know npm enough to make further comment. Leveraging existing solutions is definitely plausible. Might conda be another possible option? Here is how Conda describes herself:

Package, dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Nextflow could possibly be added to the above list?

@pditommaso
Copy link
Contributor

Not an expert either, but my understanding is that npm allow managing any asset irrespective of the prog lang.

Conda would be even better since is very well know too for the bioinfo community, however, I'm sure it allows the copying of the module files in the project directory as we need. I think it's designed to keep them in a conda central directory which would not work for our use case. But, I repeat, I not 100% about this point.

@ewels
Copy link
Member Author

ewels commented Jan 17, 2020

I feel that conda might be a little bit confusing, as all of these tools are possible to install via other conda channels. I can imagine doing conda install bwa and it copying a few nextflow files somewhere random. Also as you say, I think conda always keeps stuff in its own environment directories. npm install works in the current working / project directory though, which is probably more what we want.

@ewels
Copy link
Member Author

ewels commented Jan 17, 2020

One down side of npm is that each package needs to be within its own git repository. This isn't necessarily all bad (we've discussed doing this before anyway). On the plus side, we can publish it within a nextflow / nf-core scope which would make the installation names pretty clear.

@ewels
Copy link
Member Author

ewels commented Jan 21, 2020

The more I think about this, the more I think that we should copy the approach of npm / bioconda but build this functionality in to the nf-core tools package. This is already a dependency, so doesn't add any complexity for developers, and means that we have complete control of how we want the system to work.

This is of course less good as a general nextflow (not nf-core) option, but I think that maybe that is ok for now.

@ewels ewels changed the title Use git submodules Handle module / process imports Jan 21, 2020
@pditommaso
Copy link
Contributor

Thought having an ad-hoc nf-core package manager tool surely streamline the experience for the final user, I would suggest resisting the temptation to create yet another package manager and related specification (metafiles? how to manage releases? version numbers, etc).

Maybe a compromise could be to implement a wrapper over an existing package manager to simplify/hide the interaction for the final user and at the same rely on a well-established package managing foundation.

The external dependency with conda/npm/etc I don't think it's so critical because the module files would be in any case included in the GH project repository. Therefore the pipeline user would not need to use any third-party package manager. It would only be required by the pipeline curator when update/sync the deps.

@ewels
Copy link
Member Author

ewels commented Jan 25, 2020

Yes this was my initial thought as well. But I still see two main drawbacks:

  • We will have to have a separate repository for every tool / process we want to import
  • We have a lot of pipeline curators! So the extract dependencies are still a little irritating.. (though not too bad and if commands are wrapped then I agree it's not a big deal)

@junjun-zhang
Copy link

junjun-zhang commented Feb 21, 2020

@ewels, good thinking! If I understand you correctly, you are talking about something conceptually like the following structure for workflow with sub-workflows:

my_wf/
├── main.nf
└── modules/
    ├── mod_3.nf
    ├── mod_4.nf
    ├── modules/
    │   ├── mod_1.nf
    │   └── mod_2.nf
    ├── sub-wf_1.nf
    └── sub-wf_2.nf

my_wf uses module mod_3 and mod_4, and sub-workflow sub-wf_1 and sub-wf_2. sub-wf_1 uses mod_1, sub-wf_2 uses mod_2. This way, the importing NF script can always write the include statement as this pattern: include MOD_1 from './modules/mod_1'.

That could work, but I was thinking all of the modules were supposed to be under the same level and there is no hierarchical structure (obviously it does not have to be this way).

One thing clear to me is that if we go with this, workflows / modules that are intended to be sharable should have unique names, not like currently main.nf is pretty much for everything.

@ewels
Copy link
Member Author

ewels commented Feb 23, 2020

Yup, exactly. Though I would err towards using a directory for each tool instead of a file, then having each file called main.nf. Then the whole directory can be grabbed and we have a bit more flexibility for the code organisation.

What I had in mind was something like this:

my_wf/
├── main.nf
└── modules
    ├── tools
    │   ├── tool_1
    │   │   └── main.nf
    │   └── tool_2
    │       └── main.nf
    └── workflows
        └── sub_wf_1
            ├── main.nf
            └── modules
                └── tools
                    ├── tool_3
                    │   └── main.nf
                    └── tool_4
                        └── main.nf

Note that in the above, I envision sub-workflows using being fully-fledged pipelines in themselves, so they would mirror the structure of the pipeline importing them.

@pditommaso
Copy link
Contributor

This is an interesting point. In the current form included paths are resolved against the including script path, but this could result in duplicating the same module when imported by two different sub-workflows, that's is not good.

I'm starting to think the included path should be resolved against the main project directory.

@junjun-zhang
Copy link

junjun-zhang commented Feb 23, 2020

I'm starting to think the included path should be resolved against the main project directory.

+1 for that. I am actually more leaning towards flat structure, it not only avoids double importing, but also much simpler.

Maybe Nextflow could amend its module include mechanism for a bit. Include path starts with . continue to be interpreted as relative path. Include path starts with other characters than . or / (/ is actually not allowed) would have two paths to search sequentially: 1) same path as the current including script path which is equivalent to ./; 2) a user configured path under main project directory, such as: "${workflow.projectDir}/modules", or could introduce a new runtime metadata variable: moduleDir. Pinging @pditommaso for his comments on this.

my_wf
├── main.nf
└── modules
    ├── sub_wf_1
    │   ├── helper_process.nf  // I imagine some sort of helper maybe necessary in some situation
    │   └── main.nf
    ├── sub_wf_2
    │   └── main.nf
    ├── tool_1
    │   └── main.nf
    ├── tool_2
    │   └── main.nf
    └── tool_3
        └── main.nf

For the above example, the include statements may look like:

  • In my_wf/main.nf: include "sub_wf_1/main"; include "sub_wf_2/main"; include "tool_1/main"; include "tool_2/main"
  • In my_wf/modules/sub_wf_1/main.nf: include "helper_process"; include "tool_1/main"
  • In my_wf/modules/sub_wf_2/main.nf: include "tool_3/main"

Note that tool_1 is included in both my_wf and sub_wf_1

@ewels
Copy link
Member Author

ewels commented Feb 24, 2020

I disagree - I think that double importing the same module is a necessary evil. If we avoid double-importing, we lose control over the versioning. Then sub-workflows could end up using different versions of imported processes depending on which workflow imports them. This is bad for reproducibility and in the worst case scenario will break stuff.

The flip side is that a pipeline that uses multiple sub-workflows could run different versions of the same tool. This could be confusing, but I think that this is safer: it shouldn't break anything.

Does it really matter to double import processes?

@junjun-zhang
Copy link

junjun-zhang commented Feb 24, 2020

I thought a lot about versioning and being able to include specific versions. How about all installed modules are versioned using their corresponding folder names?

my_wf
├── main.nf
└── modules
    ├── sub_wf_1.v3
    │   ├── helper_process.nf
    │   └── main.nf
    ├── sub_wf_2.v2
    │   └── main.nf
    ├── tool_1 -> tool_1.v3  // can even use symlink pointing to latest version if one really wants always latest in some legit cases, I don't personally like this though
    ├── tool_1.v3
    │   └── main.nf
    ├── tool_1.v2
    │   └── main.nf
    ├── tool_2.v4
    │   └── main.nf
    └── tool_3.v2
        └── main.nf

Then include statements would be:

  • In my_wf/main.nf: include "sub_wf_1.v3/main"; include "sub_wf_2.v2/main"; include "tool_1/main"; include "tool_2.v4/main"
  • In my_wf/modules/sub_wf_1.v3/main.nf: include "./helper_process"; include "tool_1.v2/main"
  • In my_wf/modules/sub_wf_2.v2/main.nf: include "tool_3.v2/main"

Note that my_wf includes latest version (v3) of tool_1, sub_wf_1.v3 includes v2 of it. Now they all live in harmony.

I much agree that reproducibility is one of the top priorities in bioinformatics pipelines, all include statements should be explicit about which version is being imported. Like Conda environment file, if one wants the environment to be reproducible, all dependencies should specify specific versions.

One the other point: there is no difference between importing tools and importing sub-workflows, which I believe is a good thing. From importing script's (my_wf/main.nf) point of view, it's all the same including and making use of the imported modules, regardless it's a tool (DSL2 process) or sub-workflow (DSL2 workflow).

@pditommaso
Copy link
Contributor

Once I've made the comment I've realised as well that resolving the module path against a common directory, would open the door to possible module version conflicts.

Also, this structure can already be achieved in the current implementation just using the following idiom:

include foo from "$baseDir/X/Y"
include bar from "$baseDir/P/Q"

Say that, I agree with Phil that each sub-workflow should bring their own modules and preventing in this way any potential conflict.

At the same time, I share the view of JunJun that in the long run, complex pipelines, could become a mess and a flat structure could be easier to maintain.

Now, regarding the problem of version conflicts. This exactly what these tools (conda, npm, etc) are designed for. Therefore I think how conflicts should be managed (and the resulting directory structure) should be delegated to the tool that is chosen to handle the packaging of the modules.

@junjun-zhang
Copy link

junjun-zhang commented Feb 25, 2020

That’s great, didn’t know include foo from "$baseDir/X/Y" works in 20.01, it didn’t work in previous versions.

Regards the possibility sub-workflows may depend on different versions of the same tool, I think this is a feature needs to be supported, meaning that they both need to be brought in to the main workflow. I am not aware whether Conda or other package manager supports installation of different versions of the same package at the same time.

@ewels
Copy link
Member Author

ewels commented Feb 25, 2020

I'm still struggling to understand why all tools need to be on the same level - it seems like a huge amount of complexity to add and I can't see any advantages. If we just copy them in with a subworkflow then we don't have to do any management at all and the entire situation remains very simple..

At the same time, I share the view of JunJun that in the long run, complex pipelines, could become a mess and a flat structure could be easier to maintain.

I don't see how this would be the case though, as developers will never edit the imported code. So sure the final workflow could in theory end up with a complex file tree, but it doesn't really matter as the developer will never need to look in to those files. If they want to edit anything in the subworkflow, they edit it in that repository where the file tree is simple and consistent with all other pipelines..

@junjun-zhang
Copy link

@ewels you got a point. There is one question I am not clear how others think. Given the following example, sub_wf_1 uses tool_1 and tool_2. It's clear tool_1 and tool_2 can be versioned, packaged and registered in, say, nf-core Conda channel. The question is when we package sub_wf_1, will tool_1 and tool_2 be packaged together?

sub_wf_1
    ├── main.nf
    └── modules
        ├── tool_1
        │   └── main.nf
        └── tool_2
            └── main.nf

If it's a yes, then it's something like a Uber JAR in JAVA world. When sub_wf_1 is imported, it will have tool_1 and tool_2 included as well, the importing script has no choice to exclude them. In such case, only the nested structure you proposed will work.

If it's a no, only sub_wf_1/main.nf will be packaged. It's still possible to create nested structure with some extra work. What I am not sure is whether Conda supports this nested packaging, or whether it's a good practice in Conda. Or, maybe here is the custom tools like nf-core tools come into play?

@ewels
Copy link
Member Author

ewels commented Feb 25, 2020

The question is when we package sub_wf_1, will tool_1 and tool_2 be packaged together?

I think the answer is yes, going way back up to this comment at the top of this thread, where we established that all workflows should include hard copies of their imported modules in the same repository.

@ewels ewels mentioned this issue Feb 25, 2020
6 tasks
@junjun-zhang
Copy link

all workflows should include hard copies of their imported modules in the same repository

For the workflow repo, yes, it should include code of itself and all of its dependent modules. I think we are on the same page for that.

However, what goes to the workflow package and gets uploaded to the registry server (like Anaconda) is a separate question. It could be all included, or could be just the workflow code only. For the latter, when installing the workflow, it first pulls down the workflow code then fetches its dependencies from individual module packages. Both should be possible, just not sure which is a more sensible choice for us.

@drpatelh
Copy link
Member

drpatelh commented Feb 25, 2020

I'm slightly confused about the practicality of all of this 🤔 Given the sub-workflow scenario we could potentially be using different versions of the same tool in a given workflow? How would you even begin to conventionally write that up in a paper? Yes the pipeline in its entirety is reproducible and I understand that version control is important and shouldnt be compromised but surely there is a way in which we can instruct individual modules to use particular versions of software? I also understand that module files may become redundant with individual tool updates but this seems to be a more practical aspect to put under version control. I'm not suggesting I have the answers but that maybe we need to be thinking about this differently?

Is it plausible to have full version control between the main script > sub-workflow > individual module file whilst maintaining a level of abstraction as to which software container is going to be used? Or maybe Ive misunderstood and need to :homer_disappear:

@grst
Copy link
Member

grst commented Feb 26, 2020

I have the feeling that this is going a bit in circles and maybe you guys have to figure that out at the Hackathon while having a few beers ;)
Maybe we also need to move forward both (i.e. simplistic vs. package manager) approaches a bit to see how it goes. I put together a proof-of-concept for conda, maybe someone else can create sth similar for the other approach?

Some more thoughts:

  • Like @drpatelh, I don't really like the idea of using different versions of the same tool within a single pipeline. Maybe it's better if people are forced to update a module if they hit a version conflict instead of using different, potentially outdated versions in their pipeline?
  • What would be your strategy for versioning modules with the simplistic approach? I mentioned this earlier, but it has not been picked up so far:

I don't think a git repository is good for keeping track of "released versions". Yes, there are tags and releases, but we want individual releases for each module. For this to work, we would at least require some external system that links a certain version number of a module to the corresponding commit hash.

  • Ho do the approaches compare if we think big, i.e. 1000s of modules, 100s of versions, 100s of PRs per week, sub-sub-sub-sub workflows etc.? Bioconda/conda-forge proof that it's manageable using their build system. What are the challenges we would face creating our own?

  • Using the simplistic build system, how would sub-sub-sub-... workflows resolve? If all sub-workflows are self-contained it would certainly become a mess. (Think of a pipeline predicting targets for personalized cancer vaccinations that depends on a neoantigen prediction pipeline that depends on fusion neoantigen prediction pipeline that depends on a variant calling workflow that depends on a DNA-seq workflow that depends on a quality control workflow that depends on the fastqc module.) ... But maybe it really doesn't matter that much.

Finally, answering @junjun-zhang's question:

What I am not sure is whether Conda supports this nested packaging, or whether it's a good practice in Conda.

I believe it would be possible by tweaking the recipe file, but it's not considered good practice, or at least, i've never seen this before. After all, the whole point of conda is to resolve all dependencies s.t. there are no conflicts.

@junjun-zhang
Copy link

@grst your points are well taken! I like the idea of building quick PoCs and think big. Not that we have to do the big things at the beginning, but definitely beneficial planning for it.

What would be your strategy for versioning modules with the simplistic approach? I mentioned this earlier, but it has not been picked up so far

how about this: #2 (comment) ? To be honest, I like it a lot, it's super simple and get-job-done! I was afraid it seemed such a hack to others, but as soon as I see it's also being used by GO Lang, big relief.

@ewels
Copy link
Member Author

ewels commented Feb 26, 2020

I've just started working on a simple proof of concept PR for a simplistic copy-from-github method. I'll link it here when there is something to show 👍

Maybe it's better if people are forced to update a module if they hit a version conflict instead of using different, potentially outdated versions in their pipeline?

I think this is probably the only way to manage this problem and we came up with the exact same idea earlier today. Basically in the linting check that there are no duplicate tools with different versions in a pipeline. If there are, the linting fails and the author has to change imports / upstream pipelines until this is resolved. Except it was pointed out that there may be some edge cases where different versions are required, so it would probably be a warning instead of a hard failure.

For versioning I think we can use the metadata yaml files that come with the import, plus some kind of extension of the current version calls that we currently have..?

Thinking big is good, but only if it doesn't come at the expense of adding lots of complexity that may slow down growth - it's a balance! 😄 I'm a little encouraged at the idea that we hope to wrap whatever solution we go for in nf-core tools subcommands, so we can switch the back end at a later date without affecting the developer flow much / at all.

@ewels
Copy link
Member Author

ewels commented Mar 4, 2020

WIP nf-core tools command for importing module files into a pipeline as a draft PR here: nf-core/tools#579

@piotr-faba-ardigen
Copy link
Member

To add something from myself. I just tested both implemented approaches:

  1. conda by @grst implemented in conda build system #14
  2. nf-core modules by @ewels implement in nf-core/tools#579

It seems that both quickly allow to achieve the goal. That is, modules end up in the right directory structure.

I like nf-core modules more as it does not introduce additional dependency.

On the other hand, for now conda has this appealing feature that the channel where the modules are hosted is customizable, making it useful for more general use case. However, that said, it does not mean that nf-core modules could not be expanded to take repository, it pulls from, as an argument. However, going in that direction could easily grow the code as various repository provides (github, gitlab etc) have different APIs. Nextflow itself in this manner is supporting only the most popular ones: https://www.nextflow.io/docs/latest/sharing.html .

I may be missing other important points. But at this moment it seems to me, that the choice between the two falls into the category of personal preference.

@apeltzer
Copy link
Member

apeltzer commented Mar 5, 2020

Pretty much in line with what @piotr-faba-ardigen just wrote, tested both locally here (okay, in a VM to be fair), but that does seem to work in both cases. The second part that Piotr just argued about is also strikingly important to me I think, as e.g. we cannot share all the modules we have but would like to be able to e.g. have multiple "channels" of modules - if we can add some flexibility to nf-core modules that allows this, that would be super cool - that way users can rely on the hopefully big repository of nf-core/modules in the future, BUT also use and rely on their own modules too.

We could even add some linting in the future to check where the modules exactly came from, but as long as we follow kind of a hierarchical model similar to bioconda, conda-forge, anaconda etc, it should be fine to adopt the concept for modules here too.

I'd also like to keep things as simple as possible, although benefitting from experiences at bioconda is a good idea :-)

@ewels
Copy link
Member Author

ewels commented Jul 1, 2020

Ok, after a little further discussion on the nf-core slack in various channels and again just now with @grst, I think we should wrap this up.

Let's go for the basic home made nf-core modules approach for now - in the future we can always revise this and just update that command to use the conda backend instead if we want to. This way we keep things simple and lightweight for now so that we can get moving and start to build stuff 🚀

I've moved my initial proof-of-concept code onto a module-import branch on nf-core/tools with this draft PR to monitor its status: nf-core/tools#615 - feel free to dig in and make some PRs to that branch with missing functionality. When we have a basic set in place we can merge it / release.

Thanks all for an excellent discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.