-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evolution of Nextflow configuration file #2723
Comments
I think the problem comes from the current implementation because we are overwriting the current ConfigSlurper trying to do more things at the same time instead of first parsing the configuration and after applying them For example, we implement the About malicious code we can execute the evaluation of the configuration into a secured context with minimal imports and no access to the rest of the application |
Likely, another side of the problem is that to parse the file you need to compile the corresponding groovy code. Powerful but in some context cannot be done. |
What do you think about wrapping some of the functionality from the @nf-core JSON Schema into the new config file format? I'm mostly thinking variable type, but Some of the longer-form things like description and help text may be better suited to staying in the JSON Schema.. Keeping that file also allows us to customise and play with future extensions easily, so I'm not suggesting we kill it completely (though I'm not completely against that either). |
Also, an early heads-up: we make heavy use of |
Static typing and JSON schema start to smell overkilling. IMO it should be the good parts of the current system i.e. variables interpolation, |
Yeah I thought you might say that 😅 But if it's optional then it needn't be much? Hypothetical example based on the current syntax: params {
foo = 'bar', required: true
baz = 'qux', type: 'int'
} For many people this would be sufficient to completely remove |
The current model has similarities to Spring boot configuration properties. Hierarchical config, profiles, variable interpretation are must haves. https://docs.spring.io/spring-boot/docs/current/reference/html/features.html#features.external-config That model is great for Java apps. The validation logic though is separate and could be complex to recreate in a pipeline. We want to enforce our pipelines to use nf-cores models, but it has to be customized for our internal needs. If some of those validation concepts or other items could be brought in natively or via plugins that would be awesome. |
Another interesting format can be I'm playing with a java parser (https://github.com/mwanji/toml4j) and it's very interesting because it can map configuration files against Java classes instead simple maps an (invented) configuration could be: includeConfig = [ 'b.toml', 'http://localhost/remote.toml' ]
[aws]
accessKey = '123'
[docker]
enabled = false
envWhitelist = ['a', 'b', "c"]
memory = 210
[profiles]
[profiles.aws]
[profiles.aws.process]
executor = 'sge'
queue = 'long'
memory = '10GB'
[profiles.cloud.process]
executor = 'sge'
queue = 'long'
memory = '1GB'
data = [ ["delta", "phi"], [3.14] ] |
Maybe I missed this in the thread already, but is there a reason that we can't use YAML? I feel like that's the format that most people are used to.. Equivalent YAML of the TOML above could be: !include b.yml
!include "http://localhost/remote.toml"
aws:
accessKey: 123
docker:
enabled: false
envWhitelist: ['a', 'b', "c"]
memory: 210
profiles:
aws:
process:
executor: sge
queue: long
memory: 10GB
cloud:
process:
executor: sge
queue: long
memory: 1GB
data:
- ["delta", "phi"]
- [3.14] The YAML has the advantage that many Nextflow users are already using it, with the |
From my side, I'm only investigating other formats that can be interesting to consider of course, YAML is probably the most famous format (but also has a lot of problems) and we have libraries that allow us to parse yml files |
Biggest problem yaml does not have variables interpolation |
HIML - hierarchical yaml config with variable interpolation: https://github.com/adobe/himl#interpolation |
Yeah, I was thinking YAML anchors and aliases but that's full values and not partial strings. I was just looking at a few def rseqc_modules = params.rseqc_modules ? params.rseqc_modules.split(',').collect{ it.trim().toLowerCase() } : []
process {
publishDir = [
path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
// ...
withName: 'SALMON_INDEX' {
ext.args = params.gencode ? '--gencode' : ''
publishDir = [
path: { "${params.outdir}/genome/index" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
enabled: params.save_reference
]
}
}
if (!(params.skip_fastqc || params.skip_qc)) {
process {
withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:FASTQC' {
ext.args = '--quiet'
}
}
} I can't imagine all of this surviving a new format, but the less that we have to rewrite the better.. |
Yep, some of that syntax can be improved due to recent additions to NF (e.g. #2700) but it is really a side-effect of truly modularising within and between pipelines. I think it goes without saying that it would be nice to test any prototype config format with an nf-core pipeline because I'm sure we will be able to unravel some edge cases! |
yaml, gets hairy when when it comes to anything non-trivial -- check out issues in gitlab CI support. toml is a nicer version of the same sort of thing. |
This comment was marked as outdated.
This comment was marked as outdated.
Am interested in this line you have in the original post...
By this do you mean to have a formal grammar for the NextFlow DSL e.g. using BNF notation ? |
Some of the problems described here (like with the nf-core configs) can be addressed in other ways like module config (#4510) and resourceLimits directive (#2911). So I think in this issue we can focus on parsing and variable evaluation. I have developed a custom parser for Nextflow config files which defines a formal grammar that is much more strict:
It would allow us to have much better control over the parsing and evaluation compared to the current system. Could also be used to develop IDE tooling for config files like linting and code completion. The key question around whether to move to a new format or improve the current one is how to handle variable interpolation:
Would it be unreasonable to support both Nextflow config and YAML/TOML? Then we could say "if you need to use dynamic expressions, use Nextflow config, otherwise YAML is fine". We could still support config inclusion and dynamic strings in YAML, but we wouldn't have to come up with some kludgy solution to support Groovy closures in YAML. And with the custom parser, we could bring the developer experience for Nextflow configs up to par with YAML. |
You're dangerously close to making GroovyMake with Groovy code embedded in YAML 🤣 . Your choice is a conundrum, on the one hand it represents a loss of functionality to ditch the current system in favour of something like TOML that could be read by external systems. (An aside, if you allow groovy expressions as strings in those the external tools would need to know how to interpret groovy). On the other, its very powerful for external systems to be able to read the configuration files and be able to perform tasks such as parameter validation. The central issue here is that the current config files are overloaded in their duty. The can contain both data and code; what's needed I feel is a better separation of those. The nf-core idea of parameter schemas was a step in this direction, however they are not tightly integrated into Nextflow and don't replace the native config files; which leads to all manner of tedious issues. So perhaps you want a system whereby best practice is to reserve the current nexflow config files for dynamic configuration, and data resides in a document freely parseable by any external tool. You could likely get a long way to this goal by writing a joint parser for nextflow config files and nf-core schemas that populates the former with data from the later. |
I don't think the parameter schema should be part of the config file. It should be in the top-level workflow definition as inputs, or in a separate file like the nf-core parameter schema. Config files should be able to reference params and even set default values, but they should not be the source of truth for params. So now I'm wondering, if you take away that example, what other use case is there for loading a Nextflow config file in an external tool? If there aren't any others, then maybe external tooling isn't important and we can safely stick to the current config format. |
The problem with keeping the parameter schema separate from the config pipeline defaults is that the two need to be kept in sync. It's irritating to need to add the same thing in two places as a pipeline developer. There are two things that need to be loaded by external tools - the schema (to build launch tools / render documentation) and the config (to pre-populate forms and validate inputs before running Nextflow). |
I can't tell who's violently agreeing with who. I think I'm violently agreeing with @ewels: use something generic like the nf-core schema for loading and understanding static data that may ultimately flow from the user. Use something like the currently nextflow config to provide flexibility to developers and power users for augmenting the runtime environment. This would mean narrowing the scope of what the current nextflow config can do (e.g. it won't contain a |
@ewels when you say that external tools need to load the schema and the config, by "config" do you mean just the params block? I'm thinking that config files should still be allowed to set param defaults for backwards compatibility (and because it is useful to do in profiles), but users should be encouraged to specify default values in the schema file so that there is one source of truth as you say. As Chris said, users should avoid having a I think we are all in violent agreement |
Yes, I think so 👍🏻 We already encourage people to use |
...and |
This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools |
This is naturally not a single source of truth: its just a way of generating a secondary file which could become out of sync with the primary document, just like the current situation. |
Let's continue the discussion of the params schema here: #4669 In any case, I think we agree that the params definition should be separate from the config file. That tells me that it would be fine to keep the current config format and implement a custom parser to enforce a simpler syntax without the Groovy quirks. |
Rationale
The configuration file is a key component of the Nextflow framework since it allows workflow developers to decouple the pipeline logic from the execution parameters and infrastructure deployment settings.
The current configuration mechanism is extremely flexible and powerful. It uses a customised GroovySlurper class with extended to handle multiple profiles and config file inclusions.
A Nextflow configuration file is ultimately a compiled Java class on steroids.
However, this has serious drawbacks:
Goals
Non-goals
Action
This issue should explore possible alternative implementation and define a migration strategy.
Currently the best alternatives are:
The text was updated successfully, but these errors were encountered: