Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evolution of Nextflow configuration file #2723

Open
pditommaso opened this issue Mar 14, 2022 · 28 comments · May be fixed by #4744
Open

Evolution of Nextflow configuration file #2723

pditommaso opened this issue Mar 14, 2022 · 28 comments · May be fixed by #4744

Comments

@pditommaso
Copy link
Member

pditommaso commented Mar 14, 2022

Rationale

The configuration file is a key component of the Nextflow framework since it allows workflow developers to decouple the pipeline logic from the execution parameters and infrastructure deployment settings.

The current configuration mechanism is extremely flexible and powerful. It uses a customised GroovySlurper class with extended to handle multiple profiles and config file inclusions.

A Nextflow configuration file is ultimately a compiled Java class on steroids.

However, this has serious drawbacks:

  1. The evaluation of config file, requires Java class compilation and interpretation, which can be time-consuming compared to text file parsing.
  2. The evaluation of the config requires ultimately the execution of JVM code, which could be explored by malicious users and pose a non-trivial security burden to be handled in a managed environment.
  3. The current configuration allows the use of arbitrary imperative programming statements within a config file, which is a bad practice, making the config logic poorly readable.
  4. The parser implementation is becoming including hard to be maintained, and patches easily introduce unexpected side-effects, resulting in increased stability issues.

Goals

  • Explore a new configuration mechanism that can be used in alternative to the current nextflow config
  • Have a well-defined structure and grammar
  • Human-readable
  • Allow grouping correlated setting
  • Allow the definition of named groups of settings (i.e. profiles)
  • Allow inheritance
  • Allow config files inclusion

Non-goals

  • Compatible with current nextlow.config syntax
  • Allowing custom functions

Action

This issue should explore possible alternative implementation and define a migration strategy.

Currently the best alternatives are:

  • Hashicorp HCL: this is essentially the configuration language used by Terraform and other Hashicorp tools. It's quite popular, however there isn't a complete parsed for JVM world, other than this project.
  • Ligthbed HCON: this is essentially, very curated configuration library. The syntax has a lot of similarities with the current nextflow.config syntax.
@jorgeaguileraseqera
Copy link
Contributor

I think the problem comes from the current implementation because we are overwriting the current ConfigSlurper trying to do more things at the same time instead of first parsing the configuration and after applying them

For example, we implement the profile feature trying to merge the different configurations on the fly but maybe is to "easy" as use the environment feature implemented in ConfigSlurper (https://wjw465150.github.io/blog/Groovy/my_data/%E5%AE%98%E6%96%B9%E4%BE%8B%E5%AD%90/Groovy%20-%20ConfigSlurper.htm) as many profiles the user wants and after merge all of them

About malicious code we can execute the evaluation of the configuration into a secured context with minimal imports and no access to the rest of the application

@pditommaso
Copy link
Member Author

Likely, another side of the problem is that to parse the file you need to compile the corresponding groovy code. Powerful but in some context cannot be done.

@ewels
Copy link
Member

ewels commented Apr 27, 2022

What do you think about wrapping some of the functionality from the @nf-core JSON Schema into the new config file format? I'm mostly thinking variable type, but enum, pattern and required would also be nice. It would be nice to move as much of the custom nf-core validation code into core-Nextflow as possible 😊 (we've previously been aiming at a plugin, but this would be even better)

Some of the longer-form things like description and help text may be better suited to staying in the JSON Schema.. Keeping that file also allows us to customise and play with future extensions easily, so I'm not suggesting we kill it completely (though I'm not completely against that either).

@ewels
Copy link
Member

ewels commented Apr 27, 2022

Also, an early heads-up: we make heavy use of includeConfig within @nf-core configs to pull remote config files (eg. shared nf-core/configs) and it works brilliantly. So if it's possible to keep this functionality of including arbitrary config URLs then that would be ideal 👍🏻

@pditommaso
Copy link
Member Author

Static typing and JSON schema start to smell overkilling. IMO it should be the good parts of the current system i.e. variables interpolation, profiles, includeConfig without the bad parts ie closures, custom functions, conditional statements, etc

@ewels
Copy link
Member

ewels commented Apr 27, 2022

Yeah I thought you might say that 😅 But if it's optional then it needn't be much? Hypothetical example based on the current syntax:

params {
    foo = 'bar', required: true
    baz = 'qux', type: 'int'
}

For many people this would be sufficient to completely remove nextflow_schema.json and all dependencies on the nf-core toolchain, whilst giving much more power for catching user error.

@mbwhelan
Copy link

The current model has similarities to Spring boot configuration properties. Hierarchical config, profiles, variable interpretation are must haves. https://docs.spring.io/spring-boot/docs/current/reference/html/features.html#features.external-config

That model is great for Java apps. The validation logic though is separate and could be complex to recreate in a pipeline.

We want to enforce our pipelines to use nf-cores models, but it has to be customized for our internal needs. If some of those validation concepts or other items could be brought in natively or via plugins that would be awesome.

@jorgeaguileraseqera
Copy link
Contributor

jorgeaguileraseqera commented May 10, 2022

Another interesting format can be toml https://toml.io/en/

I'm playing with a java parser (https://github.com/mwanji/toml4j) and it's very interesting because it can map configuration files against Java classes instead simple maps

an (invented) configuration could be:

includeConfig = [  'b.toml', 'http://localhost/remote.toml' ]
[aws]
        accessKey = '123'
[docker]
    enabled = false
    envWhitelist = ['a', 'b', "c"]
    memory = 210
            
[profiles]
    [profiles.aws]
        [profiles.aws.process]
            executor = 'sge'
            queue = 'long'
            memory = '10GB'
        [profiles.cloud.process]
            executor = 'sge'
            queue = 'long'
            memory = '1GB'
            data = [ ["delta", "phi"], [3.14] ]

@ewels
Copy link
Member

ewels commented May 10, 2022

Maybe I missed this in the thread already, but is there a reason that we can't use YAML? I feel like that's the format that most people are used to..

Equivalent YAML of the TOML above could be:

!include b.yml
!include "http://localhost/remote.toml"
aws:
    accessKey: 123
docker:
    enabled: false
    envWhitelist: ['a', 'b', "c"]
    memory: 210
            
profiles:
    aws:
        process:
            executor: sge
            queue: long
            memory: 10GB
    cloud:
        process:
            executor: sge
            queue: long
            memory: 1GB
            data:
                - ["delta", "phi"]
                - [3.14]

The !include bit is a bit non-standard, but similar things are used by a lot of other projects (eg. pyyaml-include).

YAML has the advantage that many Nextflow users are already using it, with the -params-file option. The only difference here would be to move it up a level, so that the YAML doesn't just go to params but also across other config scopes.

@jorgeaguileraseqera
Copy link
Contributor

From my side, I'm only investigating other formats that can be interesting to consider

of course, YAML is probably the most famous format (but also has a lot of problems) and we have libraries that allow us to parse yml files

@pditommaso
Copy link
Member Author

Biggest problem yaml does not have variables interpolation

@MrMarkW
Copy link

MrMarkW commented May 10, 2022

HIML - hierarchical yaml config with variable interpolation: https://github.com/adobe/himl#interpolation
Unfortunately, it is in python and I couldn't find a java flavor.

@ewels
Copy link
Member

ewels commented May 10, 2022

Yeah, I was thinking YAML anchors and aliases but that's full values and not partial strings.

I was just looking at a few nextflow.config files from nf-core pipelines and thinking it wasn't so bad, but the really scary ones are in conf/modules.config. For example, in nf-core/rnaseq conf/modules.config we have stuff like this:

def rseqc_modules = params.rseqc_modules ? params.rseqc_modules.split(',').collect{ it.trim().toLowerCase() } : []

process {
    publishDir = [
        path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
        mode: params.publish_dir_mode,
        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
    ]

   // ...

    withName: 'SALMON_INDEX' {
        ext.args   = params.gencode ? '--gencode' : ''
        publishDir = [
            path: { "${params.outdir}/genome/index" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
            enabled: params.save_reference
        ]
    }
}

if (!(params.skip_fastqc || params.skip_qc)) {
    process {
        withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:FASTQC' {
            ext.args   = '--quiet'
        }
    }
}

I can't imagine all of this surviving a new format, but the less that we have to rewrite the better..

@drpatelh
Copy link
Contributor

Yep, some of that syntax can be improved due to recent additions to NF (e.g. #2700) but it is really a side-effect of truly modularising within and between pipelines.

I think it goes without saying that it would be nice to test any prototype config format with an nf-core pipeline because I'm sure we will be able to unravel some edge cases!

@cjw85
Copy link
Contributor

cjw85 commented Jul 12, 2022

yaml, gets hairy when when it comes to anything non-trivial -- check out issues in gitlab CI support. toml is a nicer version of the same sort of thing.

@stale

This comment was marked as outdated.

@emyr666
Copy link

emyr666 commented Nov 23, 2023

Am interested in this line you have in the original post...

  • Have a well-defined structure and grammar

By this do you mean to have a formal grammar for the NextFlow DSL e.g. using BNF notation ?
This would be really useful.

@bentsherman
Copy link
Member

Some of the problems described here (like with the nf-core configs) can be addressed in other ways like module config (#4510) and resourceLimits directive (#2911). So I think in this issue we can focus on parsing and variable evaluation.

I have developed a custom parser for Nextflow config files which defines a formal grammar that is much more strict:

  • basic assignments
  • config blocks
  • config includes
  • values can be any Groovy expression including closures
  • no function definitions, if statements, arbitrary Groovy code, etc

It would allow us to have much better control over the parsing and evaluation compared to the current system. Could also be used to develop IDE tooling for config files like linting and code completion.

The key question around whether to move to a new format or improve the current one is how to handle variable interpolation:

  • declarative formats like YAML can be easily parsed by external tools. however, they will need to wrap closures and other dynamic expressions in a string, which is a poor developer experience
  • Nextflow config supports Groovy syntax and the developer experience can be improved greatly with a custom parser. however, it cannot be parsed easily by external tools

Would it be unreasonable to support both Nextflow config and YAML/TOML? Then we could say "if you need to use dynamic expressions, use Nextflow config, otherwise YAML is fine". We could still support config inclusion and dynamic strings in YAML, but we wouldn't have to come up with some kludgy solution to support Groovy closures in YAML. And with the custom parser, we could bring the developer experience for Nextflow configs up to par with YAML.

@cjw85
Copy link
Contributor

cjw85 commented Jan 16, 2024

You're dangerously close to making GroovyMake with Groovy code embedded in YAML 🤣 .

Your choice is a conundrum, on the one hand it represents a loss of functionality to ditch the current system in favour of something like TOML that could be read by external systems. (An aside, if you allow groovy expressions as strings in those the external tools would need to know how to interpret groovy). On the other, its very powerful for external systems to be able to read the configuration files and be able to perform tasks such as parameter validation.

The central issue here is that the current config files are overloaded in their duty. The can contain both data and code; what's needed I feel is a better separation of those. The nf-core idea of parameter schemas was a step in this direction, however they are not tightly integrated into Nextflow and don't replace the native config files; which leads to all manner of tedious issues.

So perhaps you want a system whereby best practice is to reserve the current nexflow config files for dynamic configuration, and data resides in a document freely parseable by any external tool. You could likely get a long way to this goal by writing a joint parser for nextflow config files and nf-core schemas that populates the former with data from the later.

@bentsherman
Copy link
Member

I don't think the parameter schema should be part of the config file. It should be in the top-level workflow definition as inputs, or in a separate file like the nf-core parameter schema. Config files should be able to reference params and even set default values, but they should not be the source of truth for params.

So now I'm wondering, if you take away that example, what other use case is there for loading a Nextflow config file in an external tool? If there aren't any others, then maybe external tooling isn't important and we can safely stick to the current config format.

@ewels
Copy link
Member

ewels commented Jan 17, 2024

The problem with keeping the parameter schema separate from the config pipeline defaults is that the two need to be kept in sync. It's irritating to need to add the same thing in two places as a pipeline developer.

There are two things that need to be loaded by external tools - the schema (to build launch tools / render documentation) and the config (to pre-populate forms and validate inputs before running Nextflow).

@cjw85
Copy link
Contributor

cjw85 commented Jan 17, 2024

I can't tell who's violently agreeing with who.

I think I'm violently agreeing with @ewels: use something generic like the nf-core schema for loading and understanding static data that may ultimately flow from the user. Use something like the currently nextflow config to provide flexibility to developers and power users for augmenting the runtime environment. This would mean narrowing the scope of what the current nextflow config can do (e.g. it won't contain a params { } stanza`).

@bentsherman
Copy link
Member

@ewels when you say that external tools need to load the schema and the config, by "config" do you mean just the params block?

I'm thinking that config files should still be allowed to set param defaults for backwards compatibility (and because it is useful to do in profiles), but users should be encouraged to specify default values in the schema file so that there is one source of truth as you say. As Chris said, users should avoid having a params block in their config file.

I think we are all in violent agreement

@ewels
Copy link
Member

ewels commented Jan 17, 2024

Yes, I think so 👍🏻 We already encourage people to use -params-yaml, so I guess the missing piece would be to get Nextflow to parse and use nextflow_schema.json? (or equivalent). Then the params could be removed from nextflow.config and the pipeline developer would only need to care about the schema file.

@cjw85
Copy link
Contributor

cjw85 commented Jan 17, 2024

...and remove override nextflow's dodgy type casting in favour of whats declared in the schema

@bentsherman
Copy link
Member

This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools

@cjw85
Copy link
Contributor

cjw85 commented Jan 17, 2024

This is naturally not a single source of truth: its just a way of generating a secondary file which could become out of sync with the primary document, just like the current situation.

@bentsherman
Copy link
Member

Let's continue the discussion of the params schema here: #4669

In any case, I think we agree that the params definition should be separate from the config file. That tells me that it would be fine to keep the current config format and implement a custom parser to enforce a simpler syntax without the Groovy quirks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants