Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable tool selection for seqinspector #23

Closed
wants to merge 9 commits into from

Conversation

MatthiasZepper
Copy link
Member

@MatthiasZepper MatthiasZepper commented Sep 27, 2024

This PR proposes a tool selection mechanism for Seqinspector. There are several nf-core pipelines which implement parameter-based tool selections in various degrees of complexity.

Sarek for example uses Groovy code directly in the config and applies .split() and .contains() repeatedly to check if that particular tool should run.

   ext.when   = (params.save_mapped && !params.save_output_as_bam) ||
            (
                (params.skip_tools && params.skip_tools.split(',').contains('markduplicates')) &&
                !(params.tools && params.tools.split(',').contains('sentieon_dedup'))
            )

Unfortunately, this approach is not future-proof, because putting Groovy code in configuration files is deprecated. In addition, the ext.when directive is also phased out.

In addition to that, this notation only allows selecting or skipping single tools at a time. For a pipeline like Seqinspector, support for profiles seemed desirable. With profiles, several related tools can be quickly turned on or off, e.g. depending on the sequencing platform or assay type.

Using Sets would have been an option, as they can be intersected for combined, but ultimately central reference with binary values seemed more straightforward.

def tools = ['a', 'b', 'c'] as Set
assert 'a' in tools
assert tools + ['c', 'd'] == ['a', 'b', 'c', 'd'] as Set
assert tools.intersect(['c', 'd']) == ['c'] as Set

Therefore, I have copied the approach of (and some Utility functions from) Oncoanalyser to select the tools that should run from parameters. On the technical level, I have opted for a slightly different data structure, though.

While Oncoanalyser has a Processes class, that returns the result of a Set comparison as Run stage, I have implemented a Binary Map, that can be queried with the tool name and returns a binary value:

// Oncoanalyser
if(run_config.stages.isofox){ // Run the Isofox module}

// Proposal for Seqinspector
if (seqinspector_tools['FASTQC']){ // Run the FastQC module}

To generate the final seqinspector_tools map, profiles can be applied and combined according to a flexible parameter. For example, (hypothetical) run specifications could look like default OR contamination_screening to apply the default setting plus the tools from contamination_screening respectively default AND disable_read_quality to apply a negative profile.

For now, all profiles that can be used with Seqinspector are hard-coded constants, but we could introduce a parameter to load e.g. a CSV file as profile relatively easily.

This is a draft PR to gather the first feedback, if you like the overall direction...

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/seqinspector branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@bentsherman
Copy link

Hi Matthias, Phil linked me here because I'm interested to see how classes are being used in the lib directory.

I am struggling to see why all of this code is necessary. To begin with, I'm not sure why a single param with a list of tools is not sufficient, maybe you can fill me in there.

But even if I go with the three params you added, couldn't you boil all of this down to a single helper function that does a few set operations?

  1. define each profile as a set of tools
  2. take the union of the profiles specified by the user (why are you doing AND/OR/XOR here? that's another thing I'm unclear about)
  3. add the tools in the include list
  4. remove the tools in the exclude list

That's probably a 10-line function that you could put next to the entry workflow, no need for classes or the lib directory

@MatthiasZepper
Copy link
Member Author

MatthiasZepper commented Sep 29, 2024

Hi Matthias, Phil linked me here because I'm interested to see how classes are being used in the lib directory.

Thanks, your feedback is much appreciated. Having no background in Java or Groovy and only having tweaked existing pipelines so far, it is indeed a daunting task for me to conceptualize such a central piece of code in a way that I hope will be future-proof for greatly expanding the pipeline later.

I am struggling to see why all of this code is necessary. To begin with, I'm not sure why a single param with a list of tools is not sufficient, maybe you can fill me in there.

My main consideration here was that Seqinspector is not so much a classical pipeline (in the sense of running a well-defined sequence of processing steps) as it is a tray of tools to select from. Most tools will run in parallel and independently of each other.

However, not all tools that we will offer are suitable for each supported sequencing platform or need platform-specific configuration. A QC for Nanopore data is entirely different to Illumina etc. The same with the assay-types: You will not want to use the same QC criteria for an RNA-seq and a HiC for example.

As a pipeline developer, one could of course put all the responsibility on the user and say that they need to sport the expertise to know what tools they have to run and how to configure them for their assay with a custom config.

But from a convenience perspective, it seemed much nicer to be able to say: I have sequencing data from an Illumina Novaseq 6000, 2x 150bp run and that is WGS human sequencing, and please run not a Minimal but an Extensive QC for that instead of maintaining an own param file a string of 20 tools into the pipeline (that may differ depending on the version) and providing the correct custom config via -c on top.

For that reason, I wanted the pipeline to support profiles as an entity that can hold pre-selections made by us developers.

define each profile as a set of tools
take the union of the profiles specified by the user (why are you doing AND/OR/XOR here? that's another thing I'm unclear about)

Since the number of possible combinations of platform x assay x qc_profile could easily reach into the hundreds, these profiles must support some combinatorial logic, so we do not have to hard-code all possible combinations. The Union (OR) is indeed probably the most common operation that will be needed.

But to selectively turn tools off, you will also need AND (and the inclusive AND).

The XOR was an operation that I implemented mostly for technical reasons, after noticing that non-common keys got eliminated from the maps by my initial AND/OR implementations. To prevent dropping entries from the maps, I felt that doing first an XOR and then AND/OR of two profiles and concatenating the result should become a common pattern in the pipeline code. However, I later modified my AND/OR to retain non-common entries directly, since I did not see a use case for eliminating keys. (If we should expose any of this functionality to the end user, like I did in this draft, is something we need to discuss in our next developer meeting).

With regard to the implementation, my preference of a key-value map with binary values over Sets is weak. The hope was that this would nudge developers to restrict the evaluation of profiles to a single part of the pipeline and only access the results of the profile combinations downstream, instead of littering the Set operations throughout the pipeline code. I hope, this prevents some bugs and improves the ease of testing.

add the tools in the include list
remove the tools in the exclude list

Which is exactly what I implemented, no?

        // Convert the specified custom tools into a profile
        def custom_tools = Utilities.buildToolTracker(tool_include_list, tool_exclude_list)

        // Apply the custom tools to the existing tool profile: orOperation to activate the include_tools
        seqinspector_tools =  seqinspector_tools.orOperation(custom_tools)

        // Apply the custom tools to the existing tool profile: iAndOperation to deactivate the exclude_tools
        seqinspector_tools =  seqinspector_tools.iAndOperation(custom_tools)

While I still think that most users will want to stick to using profiles, I already foresaw that there would be the need to support a manual selection of tools nonetheless.

But even if I go with the three params you added, couldn't you boil all of this down to a single helper function that does a few set operations?

Maybe with some advanced Nextflow/Groovy wizardry, but nothing that I am capable of.

That's probably a 10-line function that you could put next to the entry workflow, no need for classes or the lib directory

I do not really see where your aversion to custom classes comes from. In Python, they are a good way to keep code clean and well-structured and avoid code duplications, so I assumed that would also apply to Java/Groovy and thus Nextflow.

To see the reason for and potential of introducing the custom classes, one should probably rather look at Oncoanalyser since it is an already mature pipeline. I have extensively looked at other nf-core pipelines to see how they approached the tool selection problem, and was in awe about the elegance of the Oncoanalyser approach because of its custom classes.

From Oncoanalyser, I drew a lot of inspiration and actually copied much literal code. A huge perk is the extensive validation that it performs beyond what is possible with the nf-schema / nf-validation plugins. What I have already copied now is the strategy of using enums to define valid tools or operations. It allows centralizing the available tools in one enum as a Constant and reuse it throughout the whole pipeline. This is a really cool developer experience: Add a new module, update one single enum declaration, and all functionality now supports it.

Contrast that with the number of bugs in RNAseq that arose due to incomplete updates, because it was just forgotten to update something in one part of the pipeline, and you know why I would like to keep it tidy from the start.

But upon reading the Oncoanalyser code, you will notice that it also checks e.g. that the matching reference genome is provided for the tool selection etc. That is some functionality we will need for Seqinspector as well that we can flexibly add later to "our" custom classes as well.

Lastly, there is one more thing that I see as an necessity for Seqinspector and that goes without precedent in Oncoanalyser: The config of ext.args and cutoffs according to profiles. For example, a quality cutoff for Illumina will be different from one for the Element Aviti sequencing platform. A QC tool will need different ext.args depending on an assay etc. Either one extends the ToolTracker class to hold more than the tool_selection binary map, or introduces another class that allows managing them cleanly.

In summary: I think, custom classes are nice to keep the main pipeline code concise and tidy and will be ideal for the functionality we currently need or will soon have to implement. RNAseq for example doesn't use any, but has several (formerly local and now central) modules for input validation and reference preparation that collectively also comprise hundreds of lines of code. It is just less obvious, because it is more of an organically grown codebase distributed over many files, rather than one centralized class definition that defines much of the functionality upfront.

@bentsherman
Copy link

bentsherman commented Sep 29, 2024

Thanks for the extra context. I'd like to take some time to understand this tool selection logic in these two pipelines as a case study, if you'll entertain me. Especially since I was the one pushing against the ext.args / ext.when approach, I want to make sure there is a better path. Hopefully my notes here can serve as more general guidance.

First some notes about classes and then we'll get to your code. I agree that classes are quite useful, but classes in their most general form are overkill for Nextflow, which is not trying to be a general-purpose object oriented language like Java or Groovy or even Python.

This hasn't mattered in the past because Nextflow delegates everything to Groovy, so all Groovy syntax is supported. But now that we are formalizing the language and deciding which parts of Groovy to keep or discard, here's where I have landed regarding classes:

  • As a way to define composite data types: absolutely. Nextflow already supports enums and will support record types in the future. We just need to make them include-able across modules so that it is practical to declare them in Nextflow modules instead of the lib directory.

    However, bundling fields and methods in a class opens up a whole space of capabilities that must be supported (encapsulation, inheritance, polymorphism, overloading, etc), which are complicated and not really needed in Nextflow anyway. Instead, you can define functions which operate on those data types and bundle them in a module, rather than bundling them in the class definition.

  • As a namespace for helper functions: this use case is already covered by modules. Just define all of the functions in a Nextflow module and then include the ones you need. In fact, all of the lib directory code in oncoanalyzer falls under this case.

    For constants like those defined in lib/Constants.groovy, you can write a helper function that returns them all in a map:

    def getConstants() {
      return [
        GENOMES_VERSION_37: ['GRCh37_hmf', 'GRCh37'],
        GENOMES_VERSION_38: ['GRCh38_hmf', 'GRCh38', 'hg38'],
        GENOMES_ALT: ['GRCh38', 'hg38']
      ]
    }
    
    workflow {
      def constants = getConstants()
      println constants.GENOMES_VERSION_37
    }

    Or split them into multiple functions as you like. I'm still considering whether to support global constants as top-level declarations in Nextflow, but a helper function as shown above will also suffice.

As for your code, the main thing I'm wondering about is this ToolTracker which is providing a wrapper over a binary map with some extra methods. And to be clear, my main interest here is not over the use of a class -- the lib directory will always be fair game for making custom classes either way -- but whether the ToolTracker itself is needed over something like a set.

  1. Why do you need to retain keys even if they are false? That is the only reason I can see for your choice of a binary map. With a set you can just say 'tool in toolsortools.contains('tool'), vs tools[tool]` for a map, and with a set you already have functions for union and intersection. i don't see the difference here.

    You mentioned guarding against downstream changes, but the ToolTracker can be modified just as easily as a set. You could create an immutable set to make modification harder, although there are ways around that as well.

  2. Can you give an example of why someone would specify something like profile1 AND/XOR/IAND profile2? I can sort of understand the OR but the other ones don't make sense. Maybe a real-world example would help here.

@mahesh-panchal
Copy link
Member

I have to admit, I would do this via a yaml file, to make a dictionary to query. It would be explicit to the user which tools are in a given profile (key), and allow the user to add keys to the yaml to customise it. You can then validate it against the json schema too to make sure key values are valid for any given profile.

@MatthiasZepper
Copy link
Member Author

MatthiasZepper commented Sep 30, 2024

Thanks for the extra context. I'd like to take some time to understand this tool selection logic in these two pipelines as a case study, if you'll entertain me.

Sure. I appreciate your input, since you are the Nextflow luminary.

Especially since I was the one pushing against the ext.args / ext.when approach, I want to make sure there is a better path. Hopefully my notes here can serve as more general guidance.

I was already avoiding any ext.when and used if (seqinspector_tools['FASTQC']){} instead.

I can't follow regarding ext.args, because I am not aware that they are deprecated and furthermore this proposal does not yet deal with that problem, since I am confused about the permissible config syntax with regard to Nextflow's new syntax parser and thus did not touch the subject yet.

But yes, we will later have to modify arguments of tools depending on the selected profile.

First some notes about classes and then we'll get to your code. I agree that classes are quite useful, but classes in their most general form are overkill for Nextflow, which is not trying to be a general-purpose object oriented language like Java or Groovy or even Python.

This hasn't mattered in the past because Nextflow delegates everything to Groovy, so all Groovy syntax is supported. But now that we are formalizing the language and deciding which parts of Groovy to keep or discard, here's where I have landed regarding classes

However, bundling fields and methods in a class opens up a whole space of capabilities that must be supported (encapsulation, inheritance, polymorphism, overloading, etc), which are complicated and not really needed in Nextflow anyway.

Thanks a lot for your comprehensive input! Much appreciated for understanding the background of your decisions and what should be avoided. I just bundled fields and methods into a class, since I am used to that coming from a Python, where exactly that is commonplace (Look at tools or MultiQC...)

Instead, you can define functions which operate on those data types and bundle them in a module, rather than bundling them in the class definition.
For constants like those defined in lib/Constants.groovy, you can write a helper function that returns them all in a map:

Perfectly fine with me. Looking at the pipeline template, it seems to me that I need to include each function then in every file explicitly then, or is there an easier way?

include { completionEmail           } from '../../nf-core/utils_nfcore_pipeline'
include { completionSummary         } from '../../nf-core/utils_nfcore_pipeline'
include { dashedLine                } from '../../nf-core/utils_nfcore_pipeline'
include { nfCoreLogo                } from '../../nf-core/utils_nfcore_pipeline'
include { imNotification            } from '../../nf-core/utils_nfcore_pipeline'
include { UTILS_NFCORE_PIPELINE     } from '../../nf-core/utils_nfcore_pipeline'
include { workflowCitation          } from '../../nf-core/utils_nfcore_pipeline'

As for your code, the main thing I'm wondering about is this ToolTracker which is providing a wrapper over a binary map with some extra methods. And to be clear, my main interest here is not over the use of a class, the lib directory will always be fair game for making custom classes either way, but whether the ToolTracker itself is needed over something like a set.

(Also see below). Fair, I am for now redefining the putAt and getAt methods of my ToolTracker class, which makes it seem as if tool_selection would be the only field in this class ever. That is not necessarily true. We clearly also need a data structure to hold technology and assay-specific ext.args. I have not yet decided how to solve that issue, but introducing another field in the ToolTracker class is evidently an option. In that case, a ToolTracker instance could bundle the tools and all deviating configuration.

Why do you need to retain keys even if they are false? That is the only reason I can see for your choice of a binary map. With a set you can just say 'tool in toolsortools.contains('tool'), vs tools[tool]` for a map, and with a set you already have functions for union and intersection. i don't see the difference here.

Can you give an example of why someone would specify something like profile1 AND/XOR/IAND profile2? I can sort of understand the OR but the other ones don't make sense. Maybe a real-world example would help here.

Sure! I indeed want to have the option to record a falsevalue. Let's say somebody is running the pipeline with the default profile, but then wants to do an extra extensive contamination screening. So the pipeline will be run with default OR contamination_screening. So far, sets and union operation will do as well.

But some of the default tools and two of the contamination screening tools are unsuited for Nanopore data and that is what our user wishes to QC check in that case. So either I have a dedicated default_nanopore and contamination_screening_nanopore profile, or I apply a general nanopore profile that has false values for all incompatible tools. In the latter case, I can simply do a default OR contamination_screening IAND nanopore and that's it. I can with way less predefined profiles achieve a much finer grained control.

You mentioned guarding against downstream changes, but the ToolTracker can be modified just as easily as a set. You could create an immutable set to make modification harder, although there are ways around that as well.

Guarding is too strong or immutability is too strong. I just meant that I suggest evaluating the profiles once, e.g. in the PIPELINE_INITIALISATION workflow and then use the results throughout the pipeline.

I would mainly like to avoid redoing complex evaluations for each and every module that we run like Sarek does it. I see a benefit of doing:

if (seqinspector_tools['FASTQC']){
 // run module
}

over

if (params.save_mapped && !params.save_output_as_bam) || 
   (params.skip_tools && params.skip_tools.split(',').contains('markduplicates'))  &&
   !(params.tools && params.tools.split(',').contains('sentieon_dedup')) ){
 // run module (in Sarek)
}

I see how such a complex evaluation may be necessary, but it is easy to forget a bracket somewhere or a negation and suddenly a module runs differently than indented. If I centralize the evaluation, I can for example just check for params.skip_tools once and then set all of those tools to false with a .each{tool -> ...} closure.

@MatthiasZepper
Copy link
Member Author

I have to admit, I would do this via a yaml file, to make a dictionary to query. It would be explicit to the user which tools are in a given profile (key), and allow the user to add keys to the yaml to customise it. You can then validate it against the json schema too to make sure key values are valid for any given profile.

If you have time on Friday, it would be great if you could join the meeting to elaborate!

But in brief: I did not consider a YAML yet, but sounds good to me. How we specify the profiles is something we can still decide - in particular for user-provided profiles it seems like a good idea.

But internally, you will probably still have a map to hold that information later? Or do you repeatedly want to read in the YAML files during a pipeline run and re-validate them against the schema as well?

I also think something like

    static Map<String, ToolTracker> ToolProfiles = [
            ALL: Ultilities.buildToolTracker(Constants.Tool.values().toList(), []), // generate a tool tracker with all tools set to true
            NONE: Ultilities.buildToolTracker([], Constants.Tool.values().toList()), // generate a tool tracker with all tools set to false
    ]

could prove helpful, because those profiles do not need to be updated once new modules are added.

@bentsherman
Copy link

bentsherman commented Sep 30, 2024

I am confused about the permissible config syntax with regard to Nextflow's new syntax parser and thus did not touch the subject yet.

The new config parser still allows all of the existing ext config, it's just a convention I'm trying to move away from in favor of explicit pipeline logic and explicit process/workflow inputs. This change is less urgent and will likely take a while

I suggest evaluating the profiles once, e.g. in the PIPELINE_INITIALISATION workflow and then use the results throughout the pipeline.

I'm with you there. It should be possible to replace all of this conditional logic with a single upfront calculation as you are trying to do.

Regarding the includes, you would have to include each function that you intend to use. But I thought you would only need these functions once at the beginning of the pipeline? In any case, you can also do a compound include:

include {
    completionEmail ; completionSummary ; dashedLine
} from '../../nf-core/utils_nfcore_pipeline'

But I wouldn't worry about it for now anyway since either approach will work. A module would just be better for sharing in the future.

But some of the default tools and two of the contamination screening tools are unsuited for Nanopore data and that is what our user wishes to QC check in that case. So either I have a dedicated default_nanopore and contamination_screening_nanopore profile, or I apply a general nanopore profile that has false values for all incompatible tools. In the latter case, I can simply do a default OR contamination_screening IAND nanopore and that's it. I can with way less predefined profiles achieve a much finer grained control.

I think I understand. So a profile can contain both true values and false values to denote (soft) requirements and (hard) incompatibilities. And the IAND means "enable the tools explicitly enabled by the other profile, and disable any tools explicitly disabled by it", where explicit disables override explicit enables. Does that sound right?

This is pretty interesting! I knew I would find some treasure if I dug deep enough. Here's how I think it could be modeled with sets:

  1. Each profile has two sets, a "soft enable" set and "hard disable" set. In other words, "I can run with these tools if it's alright with everyone else, but I absolutely can't run with these tools".
  2. The user simply specifies a list of profiles to apply
  3. The final tool selection is the union of the enable sets minus the union of the disable sets

Here's that 10-line function I promised:

class ToolProfile {
  Set<Tool> enable
  Set<Tool> disable
}

enum Tools {
  FASTQC,
  // ...
}

// TODO: pre-defined tool profiles

def getEnabledTools(profileNames, includes, excludes, toolProfiles) {
  def profiles = // get list of profiles for each profile name
  def allEnabled = // get union of all enable sets
  def allDisabled = // get union of all disable sets

  // note: the order of operations here communicates the precedence of each setting
  return allEnabled - allDisabled + includes - excludes
}

I'm still essentially using a record type to model the profile; you could also use a tuple or map but I agree the custom class is clearer. But now the tool selection is just set operations, no need for the custom boolean operations, and hopefully the intent is now clearer.

What do you think? Am I missing anything?

@bentsherman
Copy link

To Mahesh's point, you could define the built-in profiles in a YAML and load them into a map as you have now. Just a matter of whether you'd rather have the definitions sit in code or in YAML.

If this tool selection logic ever becomes a module, the enums and built-in profiles would clearly have to be external and passed in as arguments to something like getEnabledTools()

@MatthiasZepper
Copy link
Member Author

MatthiasZepper commented Sep 30, 2024

Regarding the includes, you would have to include each function that you intend to use. But I thought you would only need these functions once at the beginning of the pipeline? In any case, you can also do a compound include.

That is nifty and good to know. Yes, most functions will only be needed for the initialization of the pipeline. Perhaps, there are some in the Utilities class, that could come handy elsewhere, but they could indeed just be ordinary functions.

And the IAND means "enable the tools explicitly enabled by the other profile, and disable any tools explicitly disabled by it", where explicit disables override explicit enables. Does that sound right?

Almost. AND is obviously just A∧B, but I also required an (A∧B)∨(A⊕B) operation. Since I did not know how to call it, I thought that inclusive AND or IAND should do.

IAND means: "Enable tools only if both profiles allow it OR retain whatever setting you got from the single profile that has it."

AND means: "Enable tools only if both profiles allow it. Therefore, disable in all other cases: Either one profile explicitly disables, or one profile lacks an explicit setting."

I think I understand. So a profile can contain both true values and false values to denote (soft) requirements and (hard) incompatibilities.

Incompatibilities are indeed a prime reason for disabling a tool, but of course a minimal profile could just disable tools that require too many resources or take too much time. When devising the ToolTracker class and it's functions, I was more thinking in the direction of a toolbox for the developers.

As of now, I do not make any assumptions on the softness / hardness of an operation. The params.tool_selection allows the user to specify profiles and the operations that link them in a left-to-right manner. Only for the params.tools_include and params.tools_exclude, I already hard-coded the operations, since obviously OR is needed to include and AND to exclude tools.

But, as I already wrote above: "If we should expose any of this functionality to the end user, like I did in this draft, is something we need to discuss in our next developer meeting". Probably, it would indeed suffice to perform one final enable IAND disable operation, after making a union of the respective sets from all profiles including params.tools_include (union with enable) and params.tools_exclude (union with disable).

Here's that 10-line function I promised:

Thanks a lot! I like the simplicity of your approach! Working with a single Set seemed cumbersome to me, but a class that holds two of them should indeed cover 95% of the use cases and probably 200% of what we will ever actually implement in the pipeline.

I will think about it and discuss with the others, but in either case get rid of the mixed classes that bundle fields and methods. I was not aware how much additional complexity that imposes on Nextflow.

@bentsherman
Copy link

Well, hopefully this sketch gives you an idea of how you might use the existing data structures like sets to model whatever constraints you need. Even the priority resolution is clearly communicated by the order of these operations:

allEnabled - allDisabled + includes - excludes

So the disables override the enables, the manual selection overrides the profile selection, etc. I'm curious to see where you might take this idea, especially with the other constraints you mentioned like sequence type, etc.

I would argue that the bundled code + data approach is not only complicated to implement, but it also makes your job harder because it encourages you down a path where it's much easier to get tangled up. I think this is partly why many languages are moving towards the data-driven approach of compound data types + standalone functions (e.g. records in Java, dataclasses in Python).

On the Nextflow side, aside from the better support for enum and record types, I think we need to better educate users on how to use the standard library to full effect, rather than simply delegating to the Java/Groovy docs. This has been a good use case for something that could be a tutorial on domain modeling with basic data structures.

Copy link

github-actions bot commented Oct 18, 2024

nf-core pipelines lint overall result: Failed ❌

Posted for pipeline commit 64cc996

+| ✅ 168 tests passed       |+
!| ❗  24 tests had warnings |!
-| ❌  10 tests failed       |-

❌ Test failures:

  • nextflow_config - Config variable (incorrectly) found: params.max_cpus
  • nextflow_config - Config variable (incorrectly) found: params.max_memory
  • nextflow_config - Config variable (incorrectly) found: params.max_time
  • nextflow_config - Old lines for loading custom profiles found. File should contain: ```groovy
    // Load nf-core custom profiles from different Institutions
    includeConfig !System.getenv('NXF_OFFLINE') && params.custom_config_base ? "${params.custom_config_base}/nfcore_custom.config" : "/dev/null"
  • files_unchanged - .github/CONTRIBUTING.md does not match the template
  • files_unchanged - .github/PULL_REQUEST_TEMPLATE.md does not match the template
  • files_unchanged - .github/workflows/linting_comment.yml does not match the template
  • files_unchanged - .github/workflows/linting.yml does not match the template
  • files_unchanged - .gitignore does not match the template
  • actions_awsfulltest - .github/workflows/awsfulltest.yml is not triggered correctly

❗ Test warnings:

  • files_exist - File not found: conf/igenomes_ignored.config
  • nextflow_config - nf-validation has been detected in the pipeline. Please migrate to nf-schema: https://nextflow-io.github.io/nf-schema/latest/migration_guide/
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
  • pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in ci.yml: You can customise CI pipeline run tests as required
  • nfcore_yml - nf-core version in .nf-core.yml is not set to the latest version. Should be 3.0.2 but was 2.14.1

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-10-22 13:00:41

@MatthiasZepper
Copy link
Member Author

Closed in favor of a different approach, that uses Profiles in YAML format and nf-schema for validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants