-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle module / process imports #8
Comments
A naive implementation is mentioned here #3, and follows the current procedure we use for Excuse my simple brain but would the We will definitely need to have a more fine grained control over versioning for |
No, we will still have to manually update it in Prepping an example now, will make more sense when you see it hopefully 👍 |
Example pipeline repo: https://github.com/nf-core/dsl2pipelinetest To be combined with code in #9 |
Ok, so discussion with @pditommaso on gitter refers to this [link]
So the suggestion is that we never load remote files here - we just always include the entire nf-core/modules repo in every pipeline. Pros:
Downsides:
I personally like this 😉 |
A possible problem including if you want to control the version of each module independently you should include each of the as a separate substree. |
So one major downside - with git submodules you have a nice
With |
Yes - I think that having everything at one commit is a sacrifice worth making for simplicity though. In fact I'd prefer to enforce this as otherwise stuff could get very confusing quickly.. 😝 |
Downsides of submodules is that people using |
Can we find a way to resolve the latter issue somehow else with |
Yes, it's pretty easy to fix with |
I agree - don't care too much about the Github download button either as we provide a proper alternative and can document that as well 👍 |
@aunderwo - I'd be curious to hear your thoughts on this one! Just reading your blog post where you mention git subrepo.. |
We are taking a different approach to import remote modules that addresses the above concern and does allow us to version control each module independently. Here are the modules, and here is how these modules are imported into the pipeline repository, basically they are materialized locally (needs to run something similar Since the module files are local, same as other normal files in the pipeline's git repo, once sync'd and committed to git, there is nothing additional needed to make the pipeline work. |
Hi @junjun-zhang, This is definitely an interesting idea.. So this would involve creating a new subcommand for the Phil |
Yes, a very interesting approach! |
No, I'm not sure that this is correct - here the |
Ok I guess I didn't understand it then yet entirely - after reading again, I think I understood it now as well. I think nf-core tools extension is the way to go then. Fully flexible & we can expect developers to be able to do this when doing the dev work - for users it doesn't interfere at all 👍 |
Following the logic of the @junjun-zhang how do you handle the management of these imported files? |
@ewels what you mentioned is possible, one way or the other those information is useful to keep for dependency management. I am fairly new to Nextflow, still trying to learn more, so just to share my own thoughts here. What we are experimenting is something quick and simple but well-supports one of the most important features - explicit declaration of module dependencies down to specific versions. This is to fulfill the ultimate goal of reproducible / version controlled pipeline builds. At this point, our plan is to write a simple script (likely in Python) to detect dependencies of remote modules by searching for lines starting with Dependency management is an important feature for any programming language, GO Language initially did not have good support for it. There have been numerous solutions developed by the GO community until the introduction of GO Modules. Some blogs might be interesting to read: here, here and here. I am not suggesting to take the same approach as GO Modules, but it's certainly a great source of inspiration. Ultimately, I think it's up to Nextflow language to choose it's own official approach for dependency management. For that, I'd like to hear how others think, particularly @pditommaso |
Since there are many package managers out, there's nothing that could be used to manage NF module assets? I was even thinking to use |
I have found subrepo (wrapper around git subtree) a more transparent way of dealing with modules particularly since the files pulled in via subrepo are not links |
That seems a bold idea, don't know
Nextflow could possibly be added to the above list? |
Not an expert either, but my understanding is that Conda would be even better since is very well know too for the bioinfo community, however, I'm sure it allows the copying of the module files in the project directory as we need. I think it's designed to keep them in a conda central directory which would not work for our use case. But, I repeat, I not 100% about this point. |
I feel that conda might be a little bit confusing, as all of these tools are possible to install via other conda channels. I can imagine doing |
One down side of npm is that each package needs to be within its own git repository. This isn't necessarily all bad (we've discussed doing this before anyway). On the plus side, we can publish it within a nextflow / nf-core scope which would make the installation names pretty clear. |
The more I think about this, the more I think that we should copy the approach of This is of course less good as a general nextflow (not nf-core) option, but I think that maybe that is ok for now. |
Thought having an ad-hoc nf-core package manager tool surely streamline the experience for the final user, I would suggest resisting the temptation to create yet another package manager and related specification (metafiles? how to manage releases? version numbers, etc). Maybe a compromise could be to implement a wrapper over an existing package manager to simplify/hide the interaction for the final user and at the same rely on a well-established package managing foundation. The external dependency with conda/npm/etc I don't think it's so critical because the module files would be in any case included in the GH project repository. Therefore the pipeline user would not need to use any third-party package manager. It would only be required by the pipeline curator when update/sync the deps. |
Yes this was my initial thought as well. But I still see two main drawbacks:
|
@ewels, good thinking! If I understand you correctly, you are talking about something conceptually like the following structure for workflow with sub-workflows:
That could work, but I was thinking all of the modules were supposed to be under the same level and there is no hierarchical structure (obviously it does not have to be this way). One thing clear to me is that if we go with this, workflows / modules that are intended to be sharable should have unique names, not like currently |
Yup, exactly. Though I would err towards using a directory for each tool instead of a file, then having each file called What I had in mind was something like this:
Note that in the above, I envision sub-workflows using being fully-fledged pipelines in themselves, so they would mirror the structure of the pipeline importing them. |
This is an interesting point. In the current form included paths are resolved against the including script path, but this could result in duplicating the same module when imported by two different sub-workflows, that's is not good. I'm starting to think the included path should be resolved against the main project directory. |
+1 for that. I am actually more leaning towards flat structure, it not only avoids double importing, but also much simpler. Maybe Nextflow could amend its module include mechanism for a bit. Include path starts with
For the above example, the include statements may look like:
Note that |
I disagree - I think that double importing the same module is a necessary evil. If we avoid double-importing, we lose control over the versioning. Then sub-workflows could end up using different versions of imported processes depending on which workflow imports them. This is bad for reproducibility and in the worst case scenario will break stuff. The flip side is that a pipeline that uses multiple sub-workflows could run different versions of the same tool. This could be confusing, but I think that this is safer: it shouldn't break anything. Does it really matter to double import processes? |
I thought a lot about versioning and being able to include specific versions. How about all installed modules are versioned using their corresponding folder names?
Then include statements would be:
Note that I much agree that reproducibility is one of the top priorities in bioinformatics pipelines, all One the other point: there is no difference between importing tools and importing sub-workflows, which I believe is a good thing. From importing script's ( |
Once I've made the comment I've realised as well that resolving the module path against a common directory, would open the door to possible module version conflicts. Also, this structure can already be achieved in the current implementation just using the following idiom:
Say that, I agree with Phil that each sub-workflow should bring their own modules and preventing in this way any potential conflict. At the same time, I share the view of JunJun that in the long run, complex pipelines, could become a mess and a flat structure could be easier to maintain. Now, regarding the problem of version conflicts. This exactly what these tools (conda, npm, etc) are designed for. Therefore I think how conflicts should be managed (and the resulting directory structure) should be delegated to the tool that is chosen to handle the packaging of the modules. |
That’s great, didn’t know Regards the possibility sub-workflows may depend on different versions of the same tool, I think this is a feature needs to be supported, meaning that they both need to be brought in to the main workflow. I am not aware whether Conda or other package manager supports installation of different versions of the same package at the same time. |
I'm still struggling to understand why all tools need to be on the same level - it seems like a huge amount of complexity to add and I can't see any advantages. If we just copy them in with a subworkflow then we don't have to do any management at all and the entire situation remains very simple..
I don't see how this would be the case though, as developers will never edit the imported code. So sure the final workflow could in theory end up with a complex file tree, but it doesn't really matter as the developer will never need to look in to those files. If they want to edit anything in the subworkflow, they edit it in that repository where the file tree is simple and consistent with all other pipelines.. |
@ewels you got a point. There is one question I am not clear how others think. Given the following example,
If it's a yes, then it's something like a Uber JAR in JAVA world. When If it's a no, only |
I think the answer is yes, going way back up to this comment at the top of this thread, where we established that all workflows should include hard copies of their imported modules in the same repository. |
For the workflow repo, yes, it should include code of itself and all of its dependent modules. I think we are on the same page for that. However, what goes to the workflow package and gets uploaded to the registry server (like Anaconda) is a separate question. It could be all included, or could be just the workflow code only. For the latter, when installing the workflow, it first pulls down the workflow code then fetches its dependencies from individual module packages. Both should be possible, just not sure which is a more sensible choice for us. |
I'm slightly confused about the practicality of all of this 🤔 Given the sub-workflow scenario we could potentially be using different versions of the same tool in a given workflow? How would you even begin to conventionally write that up in a paper? Yes the pipeline in its entirety is reproducible and I understand that version control is important and shouldnt be compromised but surely there is a way in which we can instruct individual modules to use particular versions of software? I also understand that module files may become redundant with individual tool updates but this seems to be a more practical aspect to put under version control. I'm not suggesting I have the answers but that maybe we need to be thinking about this differently? Is it plausible to have full version control between the main script > sub-workflow > individual module file whilst maintaining a level of abstraction as to which software container is going to be used? Or maybe Ive misunderstood and need to :homer_disappear: |
I have the feeling that this is going a bit in circles and maybe you guys have to figure that out at the Hackathon while having a few beers ;) Some more thoughts:
Finally, answering @junjun-zhang's question:
I believe it would be possible by tweaking the recipe file, but it's not considered good practice, or at least, i've never seen this before. After all, the whole point of conda is to resolve all dependencies s.t. there are no conflicts. |
@grst your points are well taken! I like the idea of building quick PoCs and think big. Not that we have to do the big things at the beginning, but definitely beneficial planning for it.
how about this: #2 (comment) ? To be honest, I like it a lot, it's super simple and get-job-done! I was afraid it seemed such a hack to others, but as soon as I see it's also being used by GO Lang, big relief. |
I've just started working on a simple proof of concept PR for a simplistic copy-from-github method. I'll link it here when there is something to show 👍
I think this is probably the only way to manage this problem and we came up with the exact same idea earlier today. Basically in the linting check that there are no duplicate tools with different versions in a pipeline. If there are, the linting fails and the author has to change imports / upstream pipelines until this is resolved. Except it was pointed out that there may be some edge cases where different versions are required, so it would probably be a warning instead of a hard failure. For versioning I think we can use the metadata yaml files that come with the import, plus some kind of extension of the current version calls that we currently have..? Thinking big is good, but only if it doesn't come at the expense of adding lots of complexity that may slow down growth - it's a balance! 😄 I'm a little encouraged at the idea that we hope to wrap whatever solution we go for in |
WIP |
To add something from myself. I just tested both implemented approaches:
It seems that both quickly allow to achieve the goal. That is, modules end up in the right directory structure. I like On the other hand, for now conda has this appealing feature that the channel where the modules are hosted is customizable, making it useful for more general use case. However, that said, it does not mean that I may be missing other important points. But at this moment it seems to me, that the choice between the two falls into the category of personal preference. |
Pretty much in line with what @piotr-faba-ardigen just wrote, tested both locally here (okay, in a VM to be fair), but that does seem to work in both cases. The second part that Piotr just argued about is also strikingly important to me I think, as e.g. we cannot share all the modules we have but would like to be able to e.g. have multiple "channels" of modules - if we can add some flexibility to nf-core modules that allows this, that would be super cool - that way users can rely on the hopefully big repository of nf-core/modules in the future, BUT also use and rely on their own modules too. We could even add some linting in the future to check where the modules exactly came from, but as long as we follow kind of a hierarchical model similar to bioconda, conda-forge, anaconda etc, it should be fine to adopt the concept for modules here too. I'd also like to keep things as simple as possible, although benefitting from experiences at bioconda is a good idea :-) |
Ok, after a little further discussion on the nf-core slack in various channels and again just now with @grst, I think we should wrap this up. Let's go for the basic home made I've moved my initial proof-of-concept code onto a Thanks all for an excellent discussion! |
Lots of people use nf-core pipelines offline. We want to make the process of using modules from a different repository as simple as possible.
One solution would be to use
git submodule
to addnf-core/modules
as a git submodule to every pipeline. By default, doinggit clone
will not pull the submodules. Doinggit clone --recursive
orgit submodule update --init --recursive
will pull the module repository.Loading logic could then be:
Then by default most people running online will pull the online files dynamically. But pulling a pipeline to use offline is super easy and does not require any changes to files or config.
Currently
nf-core download
manually pulls institutional config files and editsnextflow.config
so that the pipeline loads these files. This could also be done with submodules as above, without any need to edit any files.Limitations would be that we have to manage the git hash of the modules repository in two places - the git submodule file and the
nextflow.config
file. We can lint to check that these two are the same. Also, this forces pipelines to use a single hash for all modules in the pipeline. I think this is probably ok for reasons of maintaining sanity though.Thoughts?
The text was updated successfully, but these errors were encountered: