From d4a80ea83248ac5245e513f154973a248e2cabb0 Mon Sep 17 00:00:00 2001 From: Richard Lupat Date: Wed, 5 Jun 2024 04:11:31 +1000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- index.html | 6 +- search.json | 235 ++++- sessions/1_intro_run_nf.html | 4 + sessions/2_nf_dev_intro.html | 8 +- sitemap.xml | 32 +- workshops/00_setup.html | 4 + workshops/1.1_intro_nextflow.html | 4 + workshops/1.2_intro_nf_core.html | 4 + workshops/2.1_customise_and_run.html | 4 + workshops/2.2_troubleshooting.html | 4 + workshops/2.3_tips_and_tricks.html | 4 + workshops/3.1_creating_a_workflow.html | 4 + workshops/4.1_draft_future_sess.html | 4 + workshops/4_1_modules.html | 1310 ++++++++++++++++++++++++ 15 files changed, 1588 insertions(+), 41 deletions(-) create mode 100644 workshops/4_1_modules.html diff --git a/.nojekyll b/.nojekyll index a49e755..6c6660a 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -338f3e2a \ No newline at end of file +25e786e2 \ No newline at end of file diff --git a/index.html b/index.html index d2c59ec..82b4eec 100644 --- a/index.html +++ b/index.html @@ -132,6 +132,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • @@ -205,7 +209,7 @@

    Curre 29th May 2024 -To-be-added +Developing Modularised Workflows Introduction to developing bioinformatics workflow with Nextflow (Part2) 5th Jun 2024 diff --git a/search.json b/search.json index 41f5fe4..bc8169e 100644 --- a/search.json +++ b/search.json @@ -4,14 +4,14 @@ "href": "sessions/2_nf_dev_intro.html", "title": "Developing bioinformatics workflows with Nextflow", "section": "", - "text": "This workshop is designed to provide participants with a fundamental understanding of developing bioinformatics pipelines using Nextflow. This workshop aims to provide participants with the necessary skills required to create a Nextflow pipeline from scratch or from the nf-core template.\n\nCourse Presenters\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\n\n\n\nCourse Helpers\n\nSanduni Rajapaksa, Research Computing Facility\nSong Li, Bioinformatics Core Facility\n\n\n\nPrerequisites\n\nExperience with command line interface and cluster/slurm\nFamiliarity with the basic concept of workflows\nAccess to Peter Mac Cluster\nAttendance in the ‘Introduction to Nextflow and Running nf-core Workflows’ workshop, or an understanding of the Nextflow concepts outlined in the workshop material\n\n\n\nLearning Objectives:\nBy the end of this workshop, participants should be able to:\n\nDevelop a basic Nextflow workflow consisting of processes that use multiple scripting languages\nGain an understanding of Groovy and Nextflow syntax\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\nRe-use and import processes, modules, and sub-workflows into a Nextflow workflow\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes and conditional scripts within a process\nGain an understanding of Nextflow channel operators\nDevelop a basic Nextflow workflow with nf-core templates\nTroubleshoot known errors in workflow development\n\n\n\nSet up requirements\nPlease complete the Setup Instructions before the course.\nIf you have any trouble, please get in contact with us ASAP via Slack/Teams.\n\n\nWorkshop schedule\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nSetup\nFollow these instructions to install VS Code and setup your workspace\nPrior to workshop\n\n\nSession kick off\nSession kick off: Discuss learning outcomes and finalising workspace setup\nEvery week\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to nextflow channels, processes, data types and workflows\n29th May 2024\n\n\nDeveloping Reusable Workflows\nIntroduction to modules imports, sub-workflows, setting up test-profile, and common useful groovy functions\n5th Jun 2024\n\n\nWorking with nf-core Templates\nIntroduction to developing nextflow workflow with nf-core templates\n12th Jun 2024\n\n\nWorking with Nextflow Built-in Functions\nIntroduction to nextflow operators, metadata propagation, grouping, and splitting\n19th Jun 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core." + "text": "This workshop is designed to provide participants with a fundamental understanding of developing bioinformatics pipelines using Nextflow. This workshop aims to provide participants with the necessary skills required to create a Nextflow pipeline from scratch or from the nf-core template.\n\nCourse Presenters\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\nSong Li, Bioinformatics Core Facility\n\n\n\nCourse Helpers\n\nSanduni Rajapaksa, Research Computing Facility\n\n\n\nPrerequisites\n\nExperience with command line interface and cluster/slurm\nFamiliarity with the basic concept of workflows\nAccess to Peter Mac Cluster\nAttendance in the ‘Introduction to Nextflow and Running nf-core Workflows’ workshop, or an understanding of the Nextflow concepts outlined in the workshop material\n\n\n\nLearning Objectives:\nBy the end of this workshop, participants should be able to:\n\nDevelop a basic Nextflow workflow consisting of processes that use multiple scripting languages\nGain an understanding of Groovy and Nextflow syntax\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\nRe-use and import processes, modules, and sub-workflows into a Nextflow workflow\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes and conditional scripts within a process\nGain an understanding of Nextflow channel operators\nDevelop a basic Nextflow workflow with nf-core templates\nTroubleshoot known errors in workflow development\n\n\n\nSet up requirements\nPlease complete the Setup Instructions before the course.\nIf you have any trouble, please get in contact with us ASAP via Slack/Teams.\n\n\nWorkshop schedule\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nSetup\nFollow these instructions to install VS Code and setup your workspace\nPrior to workshop\n\n\nSession kick off\nSession kick off: Discuss learning outcomes and finalising workspace setup\nEvery week\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to nextflow channels, processes, data types and workflows\n29th May 2024\n\n\nDeveloping Modularised Workflows\nIntroduction to modules imports, sub-workflows, setting up test-profile, and common useful groovy functions\n5th Jun 2024\n\n\nWorking with nf-core Templates\nIntroduction to developing nextflow workflow with nf-core templates\n12th Jun 2024\n\n\nWorking with Nextflow Built-in Functions\nIntroduction to nextflow operators, metadata propagation, grouping, and splitting\n19th Jun 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core." }, { "objectID": "index.html", "href": "index.html", "title": "Peter Mac Internal Nextflow Workshops", "section": "", - "text": "These workshops are designed to provide participants with a foundational understanding of Nextflow and nf-core pipelines. Participants are expected to have prior experience with the command-line interface and working with cluster systems like Slurm. The primary goal of the workshop is to equip researchers with the skills needed to use nextflow and nf-core pipelines for their research data.\n\nCourse Developers & Maintainers\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\nSong Li, Bioinformatics Core Facility\nSanduni Rajapaksa, Research Computing Facility\n\n\n\nCurrent and Future Workshop Sessions\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to developing bioinformatics workflow with Nextflow (Part1)\n29th May 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part2)\n5th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part3)\n12th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part4)\n19th Jun 2024\n\n\n\n\n\nPast Workshop Sessions\n\n\n\nSession\nOverview\nDate\n\n\n\n\nIntroduction to Nextflow and running nf-core workflows\nIntroduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n22nd Nov 2023\n\n\nIntroduction to Nextflow and running nf-core workflows\n(Re-run) Introduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n15th May 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Customising Nf-Core Workshop materials from Sydney Informatics Hub" + "text": "These workshops are designed to provide participants with a foundational understanding of Nextflow and nf-core pipelines. Participants are expected to have prior experience with the command-line interface and working with cluster systems like Slurm. The primary goal of the workshop is to equip researchers with the skills needed to use nextflow and nf-core pipelines for their research data.\n\nCourse Developers & Maintainers\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\nSong Li, Bioinformatics Core Facility\nSanduni Rajapaksa, Research Computing Facility\n\n\n\nCurrent and Future Workshop Sessions\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to developing bioinformatics workflow with Nextflow (Part1)\n29th May 2024\n\n\nDeveloping Modularised Workflows\nIntroduction to developing bioinformatics workflow with Nextflow (Part2)\n5th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part3)\n12th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part4)\n19th Jun 2024\n\n\n\n\n\nPast Workshop Sessions\n\n\n\nSession\nOverview\nDate\n\n\n\n\nIntroduction to Nextflow and running nf-core workflows\nIntroduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n22nd Nov 2023\n\n\nIntroduction to Nextflow and running nf-core workflows\n(Re-run) Introduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n15th May 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Customising Nf-Core Workshop materials from Sydney Informatics Hub" }, { "objectID": "workshops/2.3_tips_and_tricks.html", @@ -28,18 +28,25 @@ "text": "Objectives\n\n\n\n\nLearn about the core features of nf-core.\nLearn the terminology used by nf-core.\nUse Nextflow to pull and run the nf-core/testpipeline workflow\n\n\n\nIntroduction to nf-core: Introduce nf-core features and concepts, structures, tools, and example nf-core pipelines\n\n1.2.1. What is nf-core?\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nnf-core provides a standardized set of best practices, guidelines, and templates for building and sharing bioinformatics workflows. These workflows are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources.\nThe community is a diverse group of bioinformaticians, developers, and researchers from around the world who collaborate on developing and maintaining a growing collection of high-quality workflows. These workflows cover a range of applications, including transcriptomics, proteomics, and metagenomics.\nOne of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the workflows are robust, well-documented, and validated against real-world datasets. This helps to increase the reliability and reproducibility of bioinformatics analyses and ultimately enables researchers to accelerate their scientific discoveries.\nnf-core is published in Nature Biotechnology: Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology\nKey Features of nf-core workflows\n\nDocumentation\n\nnf-core workflows have extensive documentation covering installation, usage, and description of output files to ensure that you won’t be left in the dark.\n\nStable Releases\n\nnf-core workflows use GitHub releases to tag stable versions of the code and software, making workflow runs totally reproducible.\n\nPackaged software\n\nPipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or other software management tools. There is no need for any software installations.\n\nPortable and reproducible\n\nnf-core workflows follow best practices to ensure maximum portability and reproducibility. The large community makes the workflows exceptionally well-tested and easy to execute.\n\nCloud-ready\n\nnf-core workflows are tested on AWS\n\n\n\n\n1.2.2. Executing an nf-core workflow\nThe nf-core website has a full list of workflows and asssociated documentation tno be explored.\nEach workflow has a dedicated page that includes expansive documentation that is split into 7 sections:\n\nIntroduction\n\nAn introduction and overview of the workflow\n\nResults\n\nExample output files generated from the full test dataset\n\nUsage docs\n\nDescriptions of how to execute the workflow\n\nParameters\n\nGrouped workflow parameters with descriptions\n\nOutput docs\n\nDescriptions and examples of the expected output files\n\nReleases & Statistics\n\nWorkflow version history and statistics\n\n\nAs nf-core is a community development project the code for a pipeline can be changed at any time. To ensure that you have locked in a specific version of a pipeline you can use Nextflow’s built-in functionality to pull a workflow. The Nextflow pull command can download and cache workflows from GitHub repositories:\nnextflow pull nf-core/<pipeline>\nNextflow run will also automatically pull the workflow if it was not already available locally:\nnextflow run nf-core/<pipeline>\nNextflow will pull the default git branch if a workflow version is not specified. This will be the master branch for nf-core workflows with a stable release. nf-core workflows use GitHub releases to tag stable versions of the code and software. You will always be able to execute a previous version of a workflow once it is released using the -revision or -r flag.\nFor this section of the workshop we will be using the nf-core/testpipeline as an example.\nAs we will be running some bioinformatics tools, we will need to make sure of the following:\n\nWe are not running on login node\nsingularity module is loaded (module load singularity/3.7.3)\n\n\n\n\n\n\n\nSetup an interactive session\n\n\n\nsrun --pty -p prod_short --mem 20GB --cpus-per-task 2 -t 0-2:00 /bin/bash\n\nEnsure the required modules are loaded\nmodule list\nCurrently Loaded Modulefiles:\n 1) java/jdk-17.0.6 2) nextflow/23.04.1 3) squashfs-tools/4.5 4) singularity/3.7.3\n\n\n\nWe will also create a separate output directory for this section.\ncd /scratch/users/<your-username>/nfWorkshop; mkdir ./lesson1.2 && cd $_\nThe base command we will be using for this section is:\nnextflow run nf-core/testpipeline -profile test,singularity --outdir my_results\n\n\n1.2.3. Workflow structure\nnf-core workflows start from a common template and follow the same structure. Although you won’t need to edit code in the workflow project directory, having a basic understanding of the project structure and some core terminology will help you understand how to configure its execution.\nLet’s take a look at the code for the nf-core/rnaseq pipeline.\nNextflow DSL2 workflows are built up of subworkflows and modules that are stored as separate .nf files.\nMost nf-core workflows consist of a single workflow file (there are a few exceptions). This is the main <workflow>.nf file that is used to bring everything else together. Instead of having one large monolithic script, it is broken up into a combination of subworkflows and modules.\nA subworkflow is a groups of modules that are used in combination with each other and have a common purpose. Subworkflows improve workflow readability and help with the reuse of modules within a workflow. The nf-core community also shares subworkflows in the nf-core subworkflows GitHub repository. Local subworkflows are workflow specific that are not shared in the nf-core subworkflows repository.\nLet’s take a look at the BAM_STATS_SAMTOOLS subworkflow.\nThis subworkflow is comprised of the following modules: - SAMTOOLS_STATS - SAMTOOLS_IDXSTATS, and - SAMTOOLS_FLAGSTAT\nA module is a wrapper for a process, most modules will execute a single tool and contain the following definitions: - inputs - outputs, and - script block.\nLike subworkflows, modules can also be shared in the nf-core modules GitHub repository or stored as a local module. All modules from the nf-core repository are version controlled and tested to ensure reproducibility. Local modules are workflow specific that are not shared in the nf-core modules repository.\n\n\n1.2.4. Viewing parameters\nEvery nf-core workflow has a full list of parameters on the nf-core website. When viewing these parameters online, you will also be shown a description and the type of the parameter. Some parameters will have additional text to help you understand when and how a parameter should be used.\n\n\n\n\n\nParameters and their descriptions can also be viewed in the command line using the run command with the --help parameter:\nnextflow run nf-core/<workflow> --help\n\n\n\n\n\n\nChallenge\n\n\n\nView the parameters for the nf-core/testpipeline workflow using the command line:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe nf-core/testpipeline workflow parameters can be printed using the run command and the --help option:\nnextflow run nf-core/testpipeline --help\n\n\n\n\n\n1.2.5. Parameters in the command line\nParameters can be customized using the command line. Any parameter can be configured on the command line by prefixing the parameter name with a double dash (--):\nnextflow run nf-core/<workflow> --<parameter>\n\n\n\n\n\n\nTip\n\n\n\nNextflow options are prefixed with a single dash (-) and workflow parameters are prefixed with a double dash (--).\n\n\nDepending on the parameter type, you may be required to add additional information after your parameter flag. For example, for a string parameter, you would add the string after the parameter flag:\nnextflow run nf-core/<workflow> --<parameter> string\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite animal using the multiqc_title parameter using a command line flag:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nAdd the --multiqc_title flag to your command and execute it. Use the -resume option to save time:\nnextflow run nf-core/testpipeline -profile test,singularity --multiqc_title koala --outdir my_results -resume\n\n\n\nIn this example, you can check your parameter has been applied by listing the files created in the results folder (my_results):\nls my_results/multiqc/\n\n\n1.2.6. Configuration files\nConfiguration files are .config files that can contain various workflow properties. Custom paths passed in the command-line using the -c option:\nnextflow run nf-core/<workflow> -profile test,docker -c <path/to/custom.config>\nMultiple custom .config files can be included at execution by separating them with a comma (,).\nCustom configuration files follow the same structure as the configuration file included in the workflow directory. Configuration properties are organized into scopes by grouping the properties in the same scope using the curly brackets notation. For example:\nalpha {\n x = 1\n y = 'string value..'\n}\nScopes allow you to quickly configure settings required to deploy a workflow on different infrastructure using different software management. For example, the executor scope can be used to provide settings for the deployment of a workflow on a HPC cluster. Similarly, the singularity scope controls how Singularity containers are executed by Nextflow. Multiple scopes can be included in the same .config file using a mix of dot prefixes and curly brackets. A full list of scopes is described in detail here.\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite color using the multiqc_title parameter in a custom my_custom.config file:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nCreate a custom my_custom.config file that contains your favourite colour, e.g., blue:\nparams {\n multiqc_title = \"blue\"\n}\nInclude the custom .config file in your execution command with the -c option:\nnextflow run nf-core/testpipeline --outdir my_results -profile test,singularity -resume -c my_custom.config\nCheck that it has been applied:\nls my_results/multiqc/\nWhy did this fail?\nYou can not use the params scope in custom configuration files. Parameters can only be configured using the -params-file option and the command line. While parameter is listed as a parameter on the STDOUT, it was not applied to the executed command.\nWe will revisit this at the end of the module\n\n\n\n\n\n1.2.7 Parameter files\nParameter files are used to define params options for a pipeline, generally written in the YAML format. They are added to a pipeline with the flag --params-file\nExample YAML:\n\"<parameter1_name>\": 1,\n\"<parameter2_name>\": \"<string>\",\n\"<parameter3_name>\": true\n\n\n\n\n\n\nChallenge\n\n\n\nBased on the failed application of the parameter multiqc_title create a my_params.yml setting multiqc_title to your favourite colour. Then re-run the pipeline with the your my_params.yml\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nSet up my_params.yml\nmultiqc_title: \"black\"\nnextflow run nf-core/testpipeline -profile test,singularity --params-file my_params.yml --outdir Lesson1_2\n\n\n\n\n\n1.2.8. Default configuration files\nAll parameters will have a default setting that is defined using the nextflow.config file in the workflow project directory. By default, most parameters are set to null or false and are only activated by a profile or configuration file.\nThere are also several includeConfig statements in the nextflow.config file that are used to load additional .config files from the conf/ folder. Each additional .config file contains categorized configuration information for your workflow execution, some of which can be optionally included:\n\nbase.config\n\nIncluded by the workflow by default.\nGenerous resource allocations using labels.\nDoes not specify any method for software management and expects software to be available (or specified elsewhere).\n\nigenomes.config\n\nIncluded by the workflow by default.\nDefault configuration to access reference files stored on AWS iGenomes.\n\nmodules.config\n\nIncluded by the workflow by default.\nModule-specific configuration options (both mandatory and optional).\n\n\nNotably, configuration files can also contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated when launching a workflow by using the -profile command option:\nnextflow run nf-core/<workflow> -profile <profile>\nProfiles used by nf-core workflows include:\n\nSoftware management profiles\n\nProfiles for the management of software using software management tools, e.g., docker, singularity, and conda.\n\nTest profiles\n\nProfiles to execute the workflow with a standardized set of test data and parameters, e.g., test and test_full.\n\n\nMultiple profiles can be specified in a comma-separated (,) list when you execute your command. The order of profiles is important as they will be read from left to right:\nnextflow run nf-core/<workflow> -profile test,singularity\nnf-core workflows are required to define software containers and conda environments that can be activated using profiles.\n\n\n\n\n\n\nTip\n\n\n\nIf you’re computer has internet access and one of Conda, Singularity, or Docker installed, you should be able to run any nf-core workflow with the test profile and the respective software management profile ‘out of the box’. The test data profile will pull small test files directly from the nf-core/test-data GitHub repository and run it on your local system. The test profile is an important control to check the workflow is working as expected and is a great way to trial a workflow. Some workflows have multiple test profiles for you to test.\n\n\n\n\n\n\n\n\nKey points\n\n\n\n\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nNextflow can be used to pull nf-core workflows.\nnf-core workflows follow similar structures\nnf-core workflows are configured using parameters and profiles\n\n\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" }, { - "objectID": "workshops/4.1_draft_future_sess.html", - "href": "workshops/4.1_draft_future_sess.html", - "title": "Nextflow Development - Metadata Parsing", + "objectID": "workshops/2.2_troubleshooting.html", + "href": "workshops/2.2_troubleshooting.html", + "title": "Troubleshooting Nextflow run", "section": "", - "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." + "text": "2.2.1. Nextflow log\nIt is important to keep a record of the commands you have run to generate your results. Nextflow helps with this by creating and storing metadata and logs about the run in hidden files and folders in your current directory (unless otherwise specified). This data can be used by Nextflow to generate reports. It can also be queried using the Nextflow log command:\nnextflow log\nThe log command has multiple options to facilitate the queries and is especially useful while debugging a workflow and inspecting execution metadata. You can view all of the possible log options with -h flag:\nnextflow log -h\nTo query a specific execution you can use the RUN NAME or a SESSION ID:\nnextflow log <run name>\nTo get more information, you can use the -f option with named fields. For example:\nnextflow log <run name> -f process,hash,duration\nThere are many other fields you can query. You can view a full list of fields with the -l option:\nnextflow log -l\n\n\n\n\n\n\nChallenge\n\n\n\nUse the log command to view with process, hash, and script fields for your tasks from your most recent Nextflow execution.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUse the log command to get a list of you recent executions:\nnextflow log\nTIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND \n2023-11-21 22:43:14 14m 17s jovial_angela OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:05:49 1m 36s marvelous_shannon OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:10:00 1m 35s deadly_babbage OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\nQuery the process, hash, and script using the -f option for the most recent run:\nnextflow log marvelous_shannon -f process,hash,script\n\n[... truncated ...]\n\nNFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS 7c/f936d4 \n featureCounts \\\n -B -C -g gene_biotype -t exon \\\n -p \\\n -T 2 \\\n -a chr22_with_ERCC92.gtf \\\n -s 2 \\\n -o HBR_Rep1_ERCC.featureCounts.txt \\\n HBR_Rep1_ERCC.markdup.sorted.bam\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS\":\n subread: $( echo $(featureCounts -v 2>&1) | sed -e \"s/featureCounts v//g\")\n END_VERSIONS\n\n[... truncated ... ]\n\nNFCORE_RNASEQ:RNASEQ:MULTIQC 7a/8449d7 \n multiqc \\\n -f \\\n \\\n \\\n .\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:MULTIQC\":\n multiqc: $( multiqc --version | sed -e \"s/multiqc, version //g\" )\n END_VERSIONS\n \n\n\n\n\n\n2.2.2. Execution cache and resume\nTask execution caching is an essential feature of modern workflow managers. As such, Nextflow provides an automated caching mechanism for every execution. When using the Nextflow -resume option, successfully completed tasks from previous executions are skipped and the previously cached results are used in downstream tasks.\nNextflow caching mechanism works by assigning a unique ID to each task. The task unique ID is generated as a 128-bit hash value composing the the complete file path, file size, and last modified timestamp. These ID’s are used to create a separate execution directory where the tasks are executed and the outputs are stored. Nextflow will take care of the inputs and outputs in these folders for you.\nYou can re-launch the previously executed nf-core/rnaseq workflow again, but with a -resume flag, and observe the progress. Notice the time it takes to complete the workflow.\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \n\n[80/ec6ff8] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF2BED (chr22_with_ERCC92.gtf) [100%] 1 of 1, cached: 1 ✔\n[1a/7bec9c] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_GENE_FILTER (chr22_with_ERCC92.fa) [100%] 1 of 1, cached: 1 ✔\nExecuting this workflow will create a my_results directory with selected results files and add some further sub-directories into the work directory\nIn the schematic above, the hexadecimal numbers, such as 80/ec6ff8, identify the unique task execution. These numbers are also the prefix of the work directories where each task is executed.\nYou can inspect the files produced by a task by looking inside the work directory and using these numbers to find the task-specific execution path:\nls work/80/ec6ff8ba69a8b5b8eede3679e9f978/\nIf you look inside the work directory of a FASTQC task, you will find the files that were staged and created when this task was executed:\n>>> ls -la work/e9/60b2e80b2835a3e1ad595d55ac5bf5/ \n\ntotal 15895\ndrwxrwxr-x 2 rlupat rlupat 4096 Nov 22 03:39 .\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 03:38 ..\n-rw-rw-r-- 1 rlupat rlupat 0 Nov 22 03:39 .command.begin\n-rw-rw-r-- 1 rlupat rlupat 9509 Nov 22 03:39 .command.err\n-rw-rw-r-- 1 rlupat rlupat 9609 Nov 22 03:39 .command.log\n-rw-rw-r-- 1 rlupat rlupat 100 Nov 22 03:39 .command.out\n-rw-rw-r-- 1 rlupat rlupat 10914 Nov 22 03:39 .command.run\n-rw-rw-r-- 1 rlupat rlupat 671 Nov 22 03:39 .command.sh\n-rw-rw-r-- 1 rlupat rlupat 231 Nov 22 03:39 .command.trace\n-rw-rw-r-- 1 rlupat rlupat 1 Nov 22 03:39 .exitcode\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2368 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 697080 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 490526 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 6735205 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2688 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 695591 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 485732 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 7088948 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 109 Nov 22 03:39 versions.yml\nThe FASTQC process runs twice, executing in a different work directories for each set of inputs. Therefore, in the previous example, the work directory [e9/60b2e8] represents just one of the four sets of input data that was processed.\nIt’s very likely you will execute a workflow multiple times as you find the parameters that best suit your data. You can save a lot of spaces (and time) by resuming a workflow from the last step that was completed successfully and/or unmodified.\nIn practical terms, the workflow is executed from the beginning. However, before launching the execution of a process, Nextflow uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the expected output files. If this condition is satisfied, the task execution is skipped and previously computed results are used as the process results.\nNotably, the -resume functionality is very sensitive. Even touching a file in the work directory can invalidate the cache.\n\n\n\n\n\n\nChallenge\n\n\n\nInvalidate the cache by touching a .fastq.gz file in a FASTQC task work directory (you can use the touch command). Execute the workflow again with the -resume option to show that the cache has been invalidated.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nExecute the workflow for the first time (if you have not already).\nUse the task ID shown for the FASTQC process and use it to find and touch a the sample1_R1.fastq.gz file:\ntouch work/ff/21abfa87cc7cdec037ce4f36807d32/HBR_Rep1_ERCC_1.fastq.gz\nExecute the workflow again with the -resume command option:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nYou should see that some task were invalid and were executed again.\nWhy did this happen?\nIn this example, the cache of two FASTQC tasks were invalid. The fastq file we touch is used by in the pipeline in multiple places. Thus, touching the symlink for this file and changing the date of last modification disrupted two task executions.\n\n\n\n\n\n2.2.3. Troubleshoot warning and error messages\nWhile our previous workflow execution completed successfully, there were a couple of warning messages that may be cause for concern:\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 20-Nov-2023 00:29:04\nDuration : 10m 15s\nCPU hours : 0.3 \nSucceeded : 72\n\n\n\n\n\n\nHandling dodgy error messages 🤬\n\n\n\nThe first warning message isn’t very descriptive (see this pull request). You might come across issues like this when running nf-core pipelines, too. Bug reports and user feedback is very important to open source software communities like nf-core. If you come across any issues, submit a GitHub issue or start a discussion in the relevant nf-core Slack channel so others are aware and it can be addressed by the pipeline’s developers.\n\n\n➤ Take a look at the MultiQC report, as directed by the second message. You can find the MultiQC report in the lesson2.1/ directory:\nls -la lesson2.1/multiqc/star_salmon/\ntotal 1402\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 00:29 .\ndrwxrwxr-x 3 rlupat rlupat 4096 Nov 22 00:29 ..\ndrwxrwxr-x 2 rlupat rlupat 8192 Nov 22 00:29 multiqc_data\ndrwxrwxr-x 5 rlupat rlupat 4096 Nov 22 00:29 multiqc_plots\n-rw-rw-r-- 1 rlupat rlupat 1419998 Nov 22 00:29 multiqc_report.html\n➤ Download the multiqc_report.html the file navigator panel on the left side of your VS Code window by right-clicking on it and then selecting Download. Open the file on your computer.\nTake a look a the section labelled WARNING: Fail Strand Check\nThe warning we have received is indicating that the read strandedness we specified in our samplesheet.csv and inferred strandedness identified by the RSeqQC process in the pipeline do not match. It looks like the test samplesheet have incorrectly specified strandedness as forward in the samplesheet.csv when our raw reads actually show an equal distribution of sense and antisense reads.\nFor those who are not familiar with RNAseq data, incorrectly specified strandedness may negatively impact the read quantification step (process: Salmon quant) and give us inaccurate results. So, let’s clarify how the Salmon quant process is gathering strandedness information for our input files by default and find a way to address this with the parameters provided by the nf-core/rnaseq pipeline.\n\n\n\n2.2.4. Identify the run command for a process\nTo observe exactly what command is being run for a process, we can attempt to infer this information from the module’s main.nf script in the modules/ directory. However, given all the different parameters that may be applied at the process level, this may not be very clear.\n➤ Take a look at the Salmon quant main.nf file:\nnf-core-rnaseq-3.11.1/workflow/modules/nf-core/salmon/quant/main.nf\nUnless you are familiar with developing nf-core pipelines, it can be very hard to see what is actually happening in the code, given all the different variables and conditional arguments inside this script. Above the script block we can see strandedness is being applied using a few different conditional arguments. Instead of trying to infer how the $strandedness variable is being defined and applied to the process, let’s use the hidden command files saved for this task in the work/ directory.\n\n\n\n\n\n\nHidden files in the work directory!\n\n\n\nRemember that the pipeline’s results are cached in the work directory. In addition to the cached files, each task execution directories inside the work directory contains a number of hidden files:\n\n.command.sh: The command script run for the task.\n.command.run: The command wrapped used to run the task.\n.command.out: The task’s standard output log.\n.command.err: The task’s standard error log.\n.command.log: The wrapper execution output.\n.command.begin: A file created as soon as the job is launched.\n.exitcode: A file containing the task exit code (0 if successful)\n\n\n\nWith nextflow log command that we discussed previously, there are multiple options to facilitate the queries and is especially useful while debugging a pipeline and while inspecting pipeline execution metadata.\nTo understand how Salmon quant is interpreting strandedness, we’re going to use this command to track down the hidden .command.sh scripts for each Salmon quant task that was run. This will allow us to find out how Salmon quant handles strandedness and if there is a way for us to override this.\n➤ Use the Nextflow log command to get the unique run name information of the previously executed pipelines:\nnextflow log <run-name>\nThat command will list out all the work subdirectories for all processes run.\nAnd we now need to find the specific hidden.command.sh for Salmon tasks. But how to find them? 🤔\n➤ Let’s add some custom bash code to query a Nextflow run with the run name from the previous lesson. First, save your run name in a bash variable. For example:\nrun_name=marvelous_shannon\n➤ And let’s save the tool of interest (salmon) in another bash variable to pull it from a run command:\ntool=salmon\n➤ Next, run the following bash command:\nnextflow log ${run_name} | while read line;\n do\n cmd=$(ls ${line}/.command.sh 2>/dev/null);\n if grep -q $tool $cmd;\n then \n echo $cmd; \n fi; \n done \nThat will list all process .command.sh scripts containing ‘salmon’. There are a few different processes that run Salmon to perform other steps in the workflow. We are looking for Salmon quant which performs the read quantification:\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/57/fba8f9a2385dac5fa31688ba1afa9b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/30/0113a58c14ca8d3099df04ebf388f3/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/ec/95d6bd12d578c3bce22b5de4ed43fe/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/49/6fedcb09e666432ae6ddf8b1e8f488/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/b4/2ca8d05b049438262745cde92955e9/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/38/875d68dae270504138bb3d72d511a7/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/72/776810a99695b1c114cbb103f4a0e6/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/1c/dc3f54cc7952bf55e6742dd4783392/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/f3/5116a5b412bde7106645671e4c6ffb/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/17/fb0c791810f42a438e812d5c894ebf/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/4c/931a9b60b2f3cf770028854b1c673b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/91/e1c99d8acb5adf295b37fd3bbc86a5/.command.sh\nCompared with the salmon quant main.nf file, we get a lot more fine scale details from the .command.sh process scripts:\n>>> cat main.nf\nsalmon quant \\\\\n --geneMap $gtf \\\\\n --threads $task.cpus \\\\\n --libType=$strandedness \\\\\n $reference \\\\\n $input_reads \\\\\n $args \\\\\n -o $prefix\n>>> cat .command.sh\nsalmon quant \\\n --geneMap chr22_with_ERCC92.gtf \\\n --threads 2 \\\n --libType=ISF \\\n -t genome.transcripts.fa \\\n -a HBR_Rep1_ERCC.Aligned.toTranscriptome.out.bam \\\n \\\n -o HBR_Rep1_ERCC\nLooking at the nf-core/rnaseq Parameter documentation and Salmon documentation, we found that we can override this default using the --salmon_quant_libtype A parameter to indicate our data is unstranded and override samplesheet.csv input.\n\n\n\n\n\n\nHow do I get rid of the strandedness check warning message?\n\n\n\nIf we want to get rid of the warning message Please check MultiQC report: 2/2 samples failed strandedness check, we’ll have to change the strandedness fields in our samplesheet.csv. Keep in mind, doing this will invalidate the pipeline’s cache and cause the pipeline to run from the beginning.\n\n\n\n\n\n2.2.5. Write a parameter file\nFrom the previous section we learn that Nextflow accepts either yaml or json formats for parameter files. Any of the pipeline-specific parameters can be supplied to a Nextflow pipeline in this way.\n\n\n\n\n\n\nChallenge\n\n\n\nFill in the parameters file below and save as workshop-params.yaml. This time, include the --salmon_quant_libtype A parameter.\n💡 YAML formatting tips!\n\nStrings need to be inside double quotes\nBooleans (true/false) and numbers do not require quotes\n\ninput: \"\"\noutdir: \"lesson2.2\"\nfasta: \"\"\ngtf: \"\"\nstar_index: \"\"\nsalmon_index: \"\"\nskip_markduplicates: \nsave_trimmed: \nsave_unaligned: \nsalmon_quant_libtype: \"A\" \n\n\n\n\n2.2.6. Apply the parameter file\n➤ Once your params file has been saved, run:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n -params-file workshop-params.yaml\n -profile singularity \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nThe number of pipeline-specific parameters we’ve added to our run command has been significantly reduced. The only -- parameters we’ve provided to the run command relate to how the pipeline is executed on our interative job. These resource limits won’t be applicable to others who will run the pipeline on a different infrastructure.\nAs the workflow runs a second time, you will notice 4 things:\n\nThe command is much tidier thanks to offloading some parameters to the params file\nThe -resume flag. Nextflow has lots of run options including the ability to use cached output!\nSome processes will be pulled from the cache. These processes remain unaffected by our addition of a new parameter.\n\nThis run of the pipeline will complete in a much shorter time.\n\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 21-Apr-2023 05:58:06\nDuration : 1m 51s\nCPU hours : 0.3 (82.2% cached)\nSucceeded : 11\nCached : 55\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" }, { - "objectID": "workshops/4.1_draft_future_sess.html#metadata-parsing", - "href": "workshops/4.1_draft_future_sess.html#metadata-parsing", - "title": "Nextflow Development - Metadata Parsing", + "objectID": "workshops/1.1_intro_nextflow.html", + "href": "workshops/1.1_intro_nextflow.html", + "title": "Introduction to Nextflow", "section": "", - "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." + "text": "Objectives\n\n\n\n\nLearn about the benefits of a workflow manager.\nLearn Nextflow terminology.\nLearn basic commands and options to run a Nextflow workflow" + }, + { + "objectID": "workshops/1.1_intro_nextflow.html#footnotes", + "href": "workshops/1.1_intro_nextflow.html#footnotes", + "title": "Introduction to Nextflow", + "section": "Footnotes", + "text": "Footnotes\n\n\nhttps://www.lexico.com/definition/workflow↩︎" }, { "objectID": "workshops/3.1_creating_a_workflow.html", @@ -63,25 +70,207 @@ "text": "Creating an RNAseq Workflow\n\n\n\n\n\n\nObjectives\n\n\n\n\nDevelop a Nextflow workflow\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\n\n\n\n\n4.1.1. Define Workflow Parameters\nLet’s create a Nextflow script rnaseq.nf for a RNA-seq workflow. The code begins with a shebang, which declares Nextflow as the interpreter.\n#!/usr/bin/env nextflow\nOne way to define the workflow parameters is inside the Nextflow script.\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/.../training/nf-training/data/ggal/transcriptome.fa\"\nparams.multiqc = \"/.../training/nf-training/multiqc\"\n\nprintln \"reads: $params.reads\"\nWorkflow parameters can be defined and accessed inside the Nextflow script by prepending the prefix params to a variable name, separated by a dot character, eg. params.reads.\nDifferent data types can be assigned as a parameter in Nextflow. The reads parameter is defined as multiple .fq files. The transcriptome_file parameter is defined as one file, /.../training/nf-training/data/ggal/transcriptome.fa. The multiqc parameter is defined as a directory, /.../training/nf-training/data/ggal/multiqc.\nThe Groovy println command is then used to print the contents of the reads parameter, which is access with the $ character.\nRun the script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [astonishing_raman] DSL2 - revision: 8c9adc1772\nreads: /.../training/nf-training/data/ggal/*_{1,2}.fq\n\n\n\n4.1.2. Create a transcriptome index file\nCommands or scripts can be executed inside a process.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nThe INDEX process takes an input path, and assigns that input as the variable transcriptome. The path type qualifier will allow Nextflow to stage the files in the process execution directory, where they can be accessed by the script via the defined variable name, ie. transcriptome. The code between the three double-quotes of the script block will be executed, and accesses the input transcriptome variable using $. The output is a path, with a filename salmon_idx. The output path can also be defined using wildcards, eg. path \"*_idx\".\nNote that the name of the input file is not used and is only referenced to by the input variable name. This feature allows pipeline tasks to be self-contained and decoupled from the execution environment. As best practice, avoid referencing files that are not defined in the process script.\nTo execute the INDEX process, a workflow scope will need to be added.\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n}\nHere, the params.transcriptome_file parameter we defined earlier in the Nextflow script is used as an input into the INDEX process. The output of the process is assigned to the index_ch channel.\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\n\nERROR ~ Error executing process > 'INDEX'\n\nCaused by:\n Process `INDEX` terminated with an error exit status (127)\n\nCommand executed:\n\n salmon index --threads 1 -t transcriptome.fa -i salmon_index\n\nCommand exit status:\n 127\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: salmon: command not found\n\nWork dir:\n /.../work/85/495a21afcaaf5f94780aff6b2a964c\n\nTip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`\n\n -- Check '.nextflow.log' file for details\nWhen a process execution exits with a non-zero exit status, the workflow will be stopped. Nextflow will output the cause of the error, the command that caused the error, the exit status, the standard output (if available), the comand standard error, and the work directory where the process was executed.\nLet’s first look inside the process execution directory:\n>>> ls -a /.../work/85/495a21afcaaf5f94780aff6b2a964c \n\n. .command.begin .command.log .command.run .exitcode\n.. .command.err .command.out .command.sh transcriptome.fa\nWe can see that the input file transcriptome.fa has been staged inside this process execution directory by being symbolically linked. This allows it to be accessed by the script.\nInside the .command.err script, we can see that the salmon command was not found, resulting in the termination of the Nextflow workflow.\nSingularity containers can be used to execute the process within an environment that contains the package of interest. Create a config file nextflow.config containing the following:\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nThe container process directive can be used to specify the required container:\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_goldwasser] DSL2 - revision: bdebf34e16\nexecutor > local (1)\n[37/7ef8f0] process > INDEX [100%] 1 of 1 ✔\nThe newly created nextflow.config files does not need to be specified in the nextflow run command. This file is automatically searched for and used by Nextflow.\nAn alternative to singularity containers is the use of a module. Since the script block is executed as a Bash script, it can contain any command or script normally executed on the command line. If there is a module present in the host environment, it can be loaded as part of the process script.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n module purge\n module load salmon/1.3.0\n\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [reverent_liskov] DSL2 - revision: b74c22049d\nexecutor > local (1)\n[ba/3c12ab] process > INDEX [100%] 1 of 1 ✔\n\n\n\n4.1.3. Collect Read Files By Pairs\nPreviously, we have defined the reads parameter to be the following:\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nChallenge: Convert the reads parameter into a tuple channel called reads_ch, where the first element is a unique grouping key, and the second element is the paired .fq files. Then, view the contents of reads_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe fromFilePairs channel factory will automatically group input files into a tuple with a unique grouping key. The view() channel operator can be used to view the contents of the channel.\n>>> nextflow run rnaseq.nf\n\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\n\n\n\n\n\n4.1.4. Perform Expression Quantification\nLet’s add a new process QUANTIFICATION that uses both the indexed transcriptome file and the .fq file pairs to execute the salmon quant command.\nprocess QUANTIFICATION {\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nThe QUANTIFICATION process takes two inputs, the first is the path to the salmon_index created from the INDEX process. The second input is set to match the output of fromFilePairs – a tuple where the first element is a value (ie. grouping key), and the second element is a list of paths to the .fq reads.\nIn the script block, the salmon quant command saves the output of the tool as $sample_id. This output is emitted by the QUANTIFICATION process, using $ to access the Nextflow variable.\nChallenge:\nSet the following as the execution container for QUANTIFICATION:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\nAssign index_ch and reads_ch as the inputs to this process, and emit the process outputs as quant_ch. View the contents of quant_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nTo assign a container to a process, the container directive can be used.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nTo run the QUANTIFICATION process and emit the outputs as quant_ch, the following can be added to the end of the workflow block:\nquant_ch = QUANTIFICATION(index_ch, reads_ch)\nquant_ch.view()\nThe script can now be run:\n>>> nextflow run rnaseq.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [elated_cray] DSL2 - revision: abe41f4f69\nexecutor > local (4)\n[e5/e75095] process > INDEX [100%] 1 of 1 ✔\n[4c/68a000] process > QUANTIFICATION (1) [100%] 3 of 3 ✔\n/.../work/b1/d861d26d4d36864a17d2cec8d67c80/liver\n/.../work/b4/a6545471c1f949b2723d43a9cce05f/lung\n/.../work/4c/68a000f7c6503e8ae1fe4d0d3c93d8/gut\nIn the Nextflow output, we can see that the QUANTIFICATION process has been ran three times, since the reads_ch consists of three elements. Nextflow will automatically run the QUANTIFICATION process on each of the elements in the input channel, creating separate process execution work directories for each execution.\n\n\n\n\n\n4.1.5. Quality Control\nNow, let’s implement a FASTQC quality control process for the input fastq reads.\nChallenge:\nCreate a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the variable reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:\nmkdir fastqc_${sample_id}_logs\nfastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\nTake fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nThe process FASTQC is created in rnaseq.nf. Since the input channel is a tuple, the process input declaration is a tuple containing elements that match the structure of the incoming channel. The first element of the tuple is assigned the variable sample_id, and the second element of the tuple is assigned the variable reads. The relevant container is specified using the container process directive.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\nIn the workflow scope, the following can be added:\nfastqc_ch = FASTQC(reads_ch)\nThe FASTQC process is called, taking reads_ch as an input. The output of the process is assigned to be fastqc_ch.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sad_jennings] DSL2 - revision: cfae7ccc0e\nexecutor > local (7)\n[b5/6bece3] process > INDEX [100%] 1 of 1 ✔\n[32/46f20b] process > QUANTIFICATION (3) [100%] 3 of 3 ✔\n[44/27aa8d] process > FASTQC (2) [100%] 3 of 3 ✔\nIn the Nextflow output, we can see that the FASTQC has been ran three times as expected, since the reads_ch consists of three elements.\n\n\n\n\n\n4.1.6. MultiQC Report\nSo far, the generated outputs have all been saved inside the Nextflow work directory. For the FASTQC process, the specified output directory is only created inside the process execution directory. To save results to a specified folder, the publishDir process directive can be used.\nLet’s create a new MULTIQC process in our workflow that takes the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the multiqc tool, and publish the process outputs to a directory outside of the process execution directory.\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nIn the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.\nAdd the following to the end of workflow scope:\nmultiqc_ch = MULTIQC(quant_ch, fastqc_ch)\nRun the pipeline, specifying an output directory using the outdir parameter:\nnextflow run rnaseq.nf --outdir \"results\"\nA results directory containing the output multiqc reports will be created outside of the process execution directory.\n>>> ls results\ngut.html liver.html lung.html\n\n\n\n\n\n\n\nKey points\n\n\n\n\nCommands or scripts can be executed inside a process\nEnvironments can be defined using the container process directive\nThe input declaration for a process must match the structure of the channel that is being passed into that process\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core\n^*Draft for Future Sessions" }, { - "objectID": "workshops/1.1_intro_nextflow.html", - "href": "workshops/1.1_intro_nextflow.html", - "title": "Introduction to Nextflow", + "objectID": "workshops/4.1_draft_future_sess.html", + "href": "workshops/4.1_draft_future_sess.html", + "title": "Nextflow Development - Metadata Parsing", "section": "", - "text": "Objectives\n\n\n\n\nLearn about the benefits of a workflow manager.\nLearn Nextflow terminology.\nLearn basic commands and options to run a Nextflow workflow" + "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." }, { - "objectID": "workshops/1.1_intro_nextflow.html#footnotes", - "href": "workshops/1.1_intro_nextflow.html#footnotes", - "title": "Introduction to Nextflow", - "section": "Footnotes", - "text": "Footnotes\n\n\nhttps://www.lexico.com/definition/workflow↩︎" + "objectID": "workshops/4.1_draft_future_sess.html#metadata-parsing", + "href": "workshops/4.1_draft_future_sess.html#metadata-parsing", + "title": "Nextflow Development - Metadata Parsing", + "section": "", + "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." }, { - "objectID": "workshops/2.2_troubleshooting.html", - "href": "workshops/2.2_troubleshooting.html", - "title": "Troubleshooting Nextflow run", + "objectID": "workshops/4_1_modules.html", + "href": "workshops/4_1_modules.html", + "title": "Nextflow Development - Developing Modularised Workflows", "section": "", - "text": "2.2.1. Nextflow log\nIt is important to keep a record of the commands you have run to generate your results. Nextflow helps with this by creating and storing metadata and logs about the run in hidden files and folders in your current directory (unless otherwise specified). This data can be used by Nextflow to generate reports. It can also be queried using the Nextflow log command:\nnextflow log\nThe log command has multiple options to facilitate the queries and is especially useful while debugging a workflow and inspecting execution metadata. You can view all of the possible log options with -h flag:\nnextflow log -h\nTo query a specific execution you can use the RUN NAME or a SESSION ID:\nnextflow log <run name>\nTo get more information, you can use the -f option with named fields. For example:\nnextflow log <run name> -f process,hash,duration\nThere are many other fields you can query. You can view a full list of fields with the -l option:\nnextflow log -l\n\n\n\n\n\n\nChallenge\n\n\n\nUse the log command to view with process, hash, and script fields for your tasks from your most recent Nextflow execution.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUse the log command to get a list of you recent executions:\nnextflow log\nTIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND \n2023-11-21 22:43:14 14m 17s jovial_angela OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:05:49 1m 36s marvelous_shannon OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:10:00 1m 35s deadly_babbage OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\nQuery the process, hash, and script using the -f option for the most recent run:\nnextflow log marvelous_shannon -f process,hash,script\n\n[... truncated ...]\n\nNFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS 7c/f936d4 \n featureCounts \\\n -B -C -g gene_biotype -t exon \\\n -p \\\n -T 2 \\\n -a chr22_with_ERCC92.gtf \\\n -s 2 \\\n -o HBR_Rep1_ERCC.featureCounts.txt \\\n HBR_Rep1_ERCC.markdup.sorted.bam\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS\":\n subread: $( echo $(featureCounts -v 2>&1) | sed -e \"s/featureCounts v//g\")\n END_VERSIONS\n\n[... truncated ... ]\n\nNFCORE_RNASEQ:RNASEQ:MULTIQC 7a/8449d7 \n multiqc \\\n -f \\\n \\\n \\\n .\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:MULTIQC\":\n multiqc: $( multiqc --version | sed -e \"s/multiqc, version //g\" )\n END_VERSIONS\n \n\n\n\n\n\n2.2.2. Execution cache and resume\nTask execution caching is an essential feature of modern workflow managers. As such, Nextflow provides an automated caching mechanism for every execution. When using the Nextflow -resume option, successfully completed tasks from previous executions are skipped and the previously cached results are used in downstream tasks.\nNextflow caching mechanism works by assigning a unique ID to each task. The task unique ID is generated as a 128-bit hash value composing the the complete file path, file size, and last modified timestamp. These ID’s are used to create a separate execution directory where the tasks are executed and the outputs are stored. Nextflow will take care of the inputs and outputs in these folders for you.\nYou can re-launch the previously executed nf-core/rnaseq workflow again, but with a -resume flag, and observe the progress. Notice the time it takes to complete the workflow.\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \n\n[80/ec6ff8] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF2BED (chr22_with_ERCC92.gtf) [100%] 1 of 1, cached: 1 ✔\n[1a/7bec9c] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_GENE_FILTER (chr22_with_ERCC92.fa) [100%] 1 of 1, cached: 1 ✔\nExecuting this workflow will create a my_results directory with selected results files and add some further sub-directories into the work directory\nIn the schematic above, the hexadecimal numbers, such as 80/ec6ff8, identify the unique task execution. These numbers are also the prefix of the work directories where each task is executed.\nYou can inspect the files produced by a task by looking inside the work directory and using these numbers to find the task-specific execution path:\nls work/80/ec6ff8ba69a8b5b8eede3679e9f978/\nIf you look inside the work directory of a FASTQC task, you will find the files that were staged and created when this task was executed:\n>>> ls -la work/e9/60b2e80b2835a3e1ad595d55ac5bf5/ \n\ntotal 15895\ndrwxrwxr-x 2 rlupat rlupat 4096 Nov 22 03:39 .\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 03:38 ..\n-rw-rw-r-- 1 rlupat rlupat 0 Nov 22 03:39 .command.begin\n-rw-rw-r-- 1 rlupat rlupat 9509 Nov 22 03:39 .command.err\n-rw-rw-r-- 1 rlupat rlupat 9609 Nov 22 03:39 .command.log\n-rw-rw-r-- 1 rlupat rlupat 100 Nov 22 03:39 .command.out\n-rw-rw-r-- 1 rlupat rlupat 10914 Nov 22 03:39 .command.run\n-rw-rw-r-- 1 rlupat rlupat 671 Nov 22 03:39 .command.sh\n-rw-rw-r-- 1 rlupat rlupat 231 Nov 22 03:39 .command.trace\n-rw-rw-r-- 1 rlupat rlupat 1 Nov 22 03:39 .exitcode\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2368 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 697080 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 490526 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 6735205 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2688 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 695591 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 485732 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 7088948 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 109 Nov 22 03:39 versions.yml\nThe FASTQC process runs twice, executing in a different work directories for each set of inputs. Therefore, in the previous example, the work directory [e9/60b2e8] represents just one of the four sets of input data that was processed.\nIt’s very likely you will execute a workflow multiple times as you find the parameters that best suit your data. You can save a lot of spaces (and time) by resuming a workflow from the last step that was completed successfully and/or unmodified.\nIn practical terms, the workflow is executed from the beginning. However, before launching the execution of a process, Nextflow uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the expected output files. If this condition is satisfied, the task execution is skipped and previously computed results are used as the process results.\nNotably, the -resume functionality is very sensitive. Even touching a file in the work directory can invalidate the cache.\n\n\n\n\n\n\nChallenge\n\n\n\nInvalidate the cache by touching a .fastq.gz file in a FASTQC task work directory (you can use the touch command). Execute the workflow again with the -resume option to show that the cache has been invalidated.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nExecute the workflow for the first time (if you have not already).\nUse the task ID shown for the FASTQC process and use it to find and touch a the sample1_R1.fastq.gz file:\ntouch work/ff/21abfa87cc7cdec037ce4f36807d32/HBR_Rep1_ERCC_1.fastq.gz\nExecute the workflow again with the -resume command option:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nYou should see that some task were invalid and were executed again.\nWhy did this happen?\nIn this example, the cache of two FASTQC tasks were invalid. The fastq file we touch is used by in the pipeline in multiple places. Thus, touching the symlink for this file and changing the date of last modification disrupted two task executions.\n\n\n\n\n\n2.2.3. Troubleshoot warning and error messages\nWhile our previous workflow execution completed successfully, there were a couple of warning messages that may be cause for concern:\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 20-Nov-2023 00:29:04\nDuration : 10m 15s\nCPU hours : 0.3 \nSucceeded : 72\n\n\n\n\n\n\nHandling dodgy error messages 🤬\n\n\n\nThe first warning message isn’t very descriptive (see this pull request). You might come across issues like this when running nf-core pipelines, too. Bug reports and user feedback is very important to open source software communities like nf-core. If you come across any issues, submit a GitHub issue or start a discussion in the relevant nf-core Slack channel so others are aware and it can be addressed by the pipeline’s developers.\n\n\n➤ Take a look at the MultiQC report, as directed by the second message. You can find the MultiQC report in the lesson2.1/ directory:\nls -la lesson2.1/multiqc/star_salmon/\ntotal 1402\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 00:29 .\ndrwxrwxr-x 3 rlupat rlupat 4096 Nov 22 00:29 ..\ndrwxrwxr-x 2 rlupat rlupat 8192 Nov 22 00:29 multiqc_data\ndrwxrwxr-x 5 rlupat rlupat 4096 Nov 22 00:29 multiqc_plots\n-rw-rw-r-- 1 rlupat rlupat 1419998 Nov 22 00:29 multiqc_report.html\n➤ Download the multiqc_report.html the file navigator panel on the left side of your VS Code window by right-clicking on it and then selecting Download. Open the file on your computer.\nTake a look a the section labelled WARNING: Fail Strand Check\nThe warning we have received is indicating that the read strandedness we specified in our samplesheet.csv and inferred strandedness identified by the RSeqQC process in the pipeline do not match. It looks like the test samplesheet have incorrectly specified strandedness as forward in the samplesheet.csv when our raw reads actually show an equal distribution of sense and antisense reads.\nFor those who are not familiar with RNAseq data, incorrectly specified strandedness may negatively impact the read quantification step (process: Salmon quant) and give us inaccurate results. So, let’s clarify how the Salmon quant process is gathering strandedness information for our input files by default and find a way to address this with the parameters provided by the nf-core/rnaseq pipeline.\n\n\n\n2.2.4. Identify the run command for a process\nTo observe exactly what command is being run for a process, we can attempt to infer this information from the module’s main.nf script in the modules/ directory. However, given all the different parameters that may be applied at the process level, this may not be very clear.\n➤ Take a look at the Salmon quant main.nf file:\nnf-core-rnaseq-3.11.1/workflow/modules/nf-core/salmon/quant/main.nf\nUnless you are familiar with developing nf-core pipelines, it can be very hard to see what is actually happening in the code, given all the different variables and conditional arguments inside this script. Above the script block we can see strandedness is being applied using a few different conditional arguments. Instead of trying to infer how the $strandedness variable is being defined and applied to the process, let’s use the hidden command files saved for this task in the work/ directory.\n\n\n\n\n\n\nHidden files in the work directory!\n\n\n\nRemember that the pipeline’s results are cached in the work directory. In addition to the cached files, each task execution directories inside the work directory contains a number of hidden files:\n\n.command.sh: The command script run for the task.\n.command.run: The command wrapped used to run the task.\n.command.out: The task’s standard output log.\n.command.err: The task’s standard error log.\n.command.log: The wrapper execution output.\n.command.begin: A file created as soon as the job is launched.\n.exitcode: A file containing the task exit code (0 if successful)\n\n\n\nWith nextflow log command that we discussed previously, there are multiple options to facilitate the queries and is especially useful while debugging a pipeline and while inspecting pipeline execution metadata.\nTo understand how Salmon quant is interpreting strandedness, we’re going to use this command to track down the hidden .command.sh scripts for each Salmon quant task that was run. This will allow us to find out how Salmon quant handles strandedness and if there is a way for us to override this.\n➤ Use the Nextflow log command to get the unique run name information of the previously executed pipelines:\nnextflow log <run-name>\nThat command will list out all the work subdirectories for all processes run.\nAnd we now need to find the specific hidden.command.sh for Salmon tasks. But how to find them? 🤔\n➤ Let’s add some custom bash code to query a Nextflow run with the run name from the previous lesson. First, save your run name in a bash variable. For example:\nrun_name=marvelous_shannon\n➤ And let’s save the tool of interest (salmon) in another bash variable to pull it from a run command:\ntool=salmon\n➤ Next, run the following bash command:\nnextflow log ${run_name} | while read line;\n do\n cmd=$(ls ${line}/.command.sh 2>/dev/null);\n if grep -q $tool $cmd;\n then \n echo $cmd; \n fi; \n done \nThat will list all process .command.sh scripts containing ‘salmon’. There are a few different processes that run Salmon to perform other steps in the workflow. We are looking for Salmon quant which performs the read quantification:\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/57/fba8f9a2385dac5fa31688ba1afa9b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/30/0113a58c14ca8d3099df04ebf388f3/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/ec/95d6bd12d578c3bce22b5de4ed43fe/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/49/6fedcb09e666432ae6ddf8b1e8f488/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/b4/2ca8d05b049438262745cde92955e9/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/38/875d68dae270504138bb3d72d511a7/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/72/776810a99695b1c114cbb103f4a0e6/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/1c/dc3f54cc7952bf55e6742dd4783392/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/f3/5116a5b412bde7106645671e4c6ffb/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/17/fb0c791810f42a438e812d5c894ebf/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/4c/931a9b60b2f3cf770028854b1c673b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/91/e1c99d8acb5adf295b37fd3bbc86a5/.command.sh\nCompared with the salmon quant main.nf file, we get a lot more fine scale details from the .command.sh process scripts:\n>>> cat main.nf\nsalmon quant \\\\\n --geneMap $gtf \\\\\n --threads $task.cpus \\\\\n --libType=$strandedness \\\\\n $reference \\\\\n $input_reads \\\\\n $args \\\\\n -o $prefix\n>>> cat .command.sh\nsalmon quant \\\n --geneMap chr22_with_ERCC92.gtf \\\n --threads 2 \\\n --libType=ISF \\\n -t genome.transcripts.fa \\\n -a HBR_Rep1_ERCC.Aligned.toTranscriptome.out.bam \\\n \\\n -o HBR_Rep1_ERCC\nLooking at the nf-core/rnaseq Parameter documentation and Salmon documentation, we found that we can override this default using the --salmon_quant_libtype A parameter to indicate our data is unstranded and override samplesheet.csv input.\n\n\n\n\n\n\nHow do I get rid of the strandedness check warning message?\n\n\n\nIf we want to get rid of the warning message Please check MultiQC report: 2/2 samples failed strandedness check, we’ll have to change the strandedness fields in our samplesheet.csv. Keep in mind, doing this will invalidate the pipeline’s cache and cause the pipeline to run from the beginning.\n\n\n\n\n\n2.2.5. Write a parameter file\nFrom the previous section we learn that Nextflow accepts either yaml or json formats for parameter files. Any of the pipeline-specific parameters can be supplied to a Nextflow pipeline in this way.\n\n\n\n\n\n\nChallenge\n\n\n\nFill in the parameters file below and save as workshop-params.yaml. This time, include the --salmon_quant_libtype A parameter.\n💡 YAML formatting tips!\n\nStrings need to be inside double quotes\nBooleans (true/false) and numbers do not require quotes\n\ninput: \"\"\noutdir: \"lesson2.2\"\nfasta: \"\"\ngtf: \"\"\nstar_index: \"\"\nsalmon_index: \"\"\nskip_markduplicates: \nsave_trimmed: \nsave_unaligned: \nsalmon_quant_libtype: \"A\" \n\n\n\n\n2.2.6. Apply the parameter file\n➤ Once your params file has been saved, run:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n -params-file workshop-params.yaml\n -profile singularity \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nThe number of pipeline-specific parameters we’ve added to our run command has been significantly reduced. The only -- parameters we’ve provided to the run command relate to how the pipeline is executed on our interative job. These resource limits won’t be applicable to others who will run the pipeline on a different infrastructure.\nAs the workflow runs a second time, you will notice 4 things:\n\nThe command is much tidier thanks to offloading some parameters to the params file\nThe -resume flag. Nextflow has lots of run options including the ability to use cached output!\nSome processes will be pulled from the cache. These processes remain unaffected by our addition of a new parameter.\n\nThis run of the pipeline will complete in a much shorter time.\n\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 21-Apr-2023 05:58:06\nDuration : 1m 51s\nCPU hours : 0.3 (82.2% cached)\nSucceeded : 11\nCached : 55\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" + "text": "Objectives\n\n\n\n\nGain an understanding of Nextflow modules and subworkflows\nGain an understanding of Nextflow workflow structures\nExplore some groovy functions and libraries\nSetup config, profile, and some test data" + }, + { + "objectID": "workshops/4_1_modules.html#environment-setup", + "href": "workshops/4_1_modules.html#environment-setup", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "Environment Setup", + "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here." + }, + { + "objectID": "workshops/4_1_modules.html#modularization", + "href": "workshops/4_1_modules.html#modularization", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5. Modularization", + "text": "5. Modularization\nThe definition of module libraries simplifies the writing of complex data analysis workflows and makes re-use of processes much easier.\nUsing the rnaseq.nf example from previous section, you can convert the workflow’s processes into modules, then call them within the workflow scope.\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\n\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\n\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QUANTIFICATION(index_ch, reads_ch)\n quant_ch.view()\n\n fastqc_ch = FASTQC(reads_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}" + }, + { + "objectID": "workshops/4_1_modules.html#modules", + "href": "workshops/4_1_modules.html#modules", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.1 Modules", + "text": "5.1 Modules\nNextflow DSL2 allows for the definition of stand-alone module scripts that can be included and shared across multiple workflows. Each module can contain its own process or workflow definition." + }, + { + "objectID": "workshops/4_1_modules.html#importing-modules", + "href": "workshops/4_1_modules.html#importing-modules", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.1.1. Importing modules", + "text": "5.1.1. Importing modules\nComponents defined in the module script can be imported into other Nextflow scripts using the include statement. This allows you to store these components in one or more file(s) that they can be re-used in multiple workflows.\nUsing the rnaseq.nf example, you can achieve this by:\nCreating a file called modules.nf in the top-level directory. Copying and pasting all process definitions for INDEX, QUANTIFICATION, FASTQC and MULTIQC into modules.nf. Removing the process definitions in the rnaseq.nf script. Importing the processes from modules.nf within the rnaseq.nf script anywhere above the workflow definition:\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION } from './modules.nf'\ninclude { FASTQC } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\n\n\n\n\n\n\nTip\n\n\n\nIn general, you would use relative paths to define the location of the module scripts using the ./prefix.\n\n\nExercise\nCreate a modules.nf file with the INDEX, QUANTIFICATION, FASTQC and MULTIQC from rnaseq.nf. Then remove these processes from rnaseq.nf and include them in the workflow using the include definitions shown above.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe rnaseq.nf script should look similar to this:\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION } from './modules.nf'\ninclude { FASTQC } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QUANTIFICATION(index_ch, reads_ch)\n quant_ch.view()\n\n fastqc_ch = FASTQC(reads_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\n\n\nRun the pipeline to check if the module import is successful\nnextflow run rnaseq.nf --outdir \"results\" -resume\n\n\n\n\n\n\nChallenge\nTry modularising the modules.nf even further to achieve a setup of one tool per module (can be one or more processes), similar to the setup used by most nf-core pipelines\nnfcore/rna-seq\n | modules\n | local\n | multiqc\n | deseq2_qc\n | nf-core\n | fastqc\n | salmon\n | index\n | main.nf\n | quant\n | main.nf" + }, + { + "objectID": "workshops/4_1_modules.html#multiple-imports", + "href": "workshops/4_1_modules.html#multiple-imports", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.1.2. Multiple imports", + "text": "5.1.2. Multiple imports\nIf a Nextflow module script contains multiple process definitions they can also be imported using a single include statement as shown in the example below:\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX; QUANTIFICATION; FASTQC; MULTIQC } from './modules.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QUANTIFICATION(index_ch, reads_ch)\n fastqc_ch = FASTQC(reads_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}" + }, + { + "objectID": "workshops/4_1_modules.html#module-aliases", + "href": "workshops/4_1_modules.html#module-aliases", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.1.3 Module aliases", + "text": "5.1.3 Module aliases\nWhen including a module component it is possible to specify a name alias using the as declaration. This allows the inclusion and the invocation of the same component multiple times using different names:\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n\n}\nNote how the QUANTIFICATION process is now being refer to as QT, and FASTQC process is imported twice, each time with a different alias, and how these aliases are used to invoke the processes.\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sharp_meitner] DSL2 - revision: 6afd5bf37c\nexecutor > local (16)\n[c7/56160a] process > INDEX [100%] 1 of 1 ✔\n[75/cb99dd] process > QT (3) [100%] 3 of 3 ✔\n[d9/e298c6] process > FASTQC_one (3) [100%] 3 of 3 ✔\n[5e/7ccc39] process > TRIMGALORE (3) [100%] 3 of 3 ✔\n[a3/3a1e2e] process > FASTQC_two (3) [100%] 3 of 3 ✔\n[e1/411323] process > MULTIQC (3) [100%] 3 of 3 ✔\n\n\n\n\n\n\nWarning\n\n\n\nWhat do you think will happen if FASTQC is imported only once without alias, but used twice within the workflow?\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nProcess 'FASTQC' has been already used -- If you need to reuse the same component, include it with a different name or include it in a different workflow context" + }, + { + "objectID": "workshops/4_1_modules.html#workflow-definition", + "href": "workshops/4_1_modules.html#workflow-definition", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.2 Workflow definition", + "text": "5.2 Workflow definition\nThe workflow scope allows the definition of components that define the invocation of one or more processes or operators:\n\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow my_workflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\nworkflow {\n my_workflow()\n}\nFor example, the snippet above defines a workflow named my_workflow, that is invoked via another workflow definition." + }, + { + "objectID": "workshops/4_1_modules.html#workflow-inputs", + "href": "workshops/4_1_modules.html#workflow-inputs", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.2.1 Workflow inputs", + "text": "5.2.1 Workflow inputs\nA workflow component can declare one or more input channels using the take statement. When the take statement is used, the workflow definition needs to be declared within the main block.\nFor example:\n\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow my_workflow {\n take:\n transcriptome_file\n reads_ch\n\n main:\n index_ch = INDEX(transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\nThe input for the workflowcan then be specified as an argument:\nworkflow {\n my_workflow(Channel.of(params.transcriptome_file), reads_ch)\n}" + }, + { + "objectID": "workshops/4_1_modules.html#workflow-outputs", + "href": "workshops/4_1_modules.html#workflow-outputs", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.2.2 Workflow outputs", + "text": "5.2.2 Workflow outputs\nA workflow can declare one or more output channels using the emit statement. For example:\n\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow my_workflow {\n take:\n transcriptome_file\n reads_ch\n\n main:\n index_ch = INDEX(transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n\n emit:\n quant_ch\n\n}\n\nworkflow {\n my_workflow(Channel.of(params.transcriptome_file), reads_ch)\n my_workflow.out.view()\n}\nAs a result, you can use the my_workflow.out notation to access the outputs of my_workflow in the invoking workflow.\nYou can also declare named outputs within the emit block.\n emit:\n my_wf_output = quant_ch\nworkflow {\n my_workflow(Channel.of(params.transcriptome_file), reads_ch)\n my_workflow.out.my_wf_output.view()\n}\nThe result of the above snippet can then be accessed using my_workflow.out.my_wf_output." + }, + { + "objectID": "workshops/4_1_modules.html#calling-named-workflows", + "href": "workshops/4_1_modules.html#calling-named-workflows", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.2.3 Calling named workflows", + "text": "5.2.3 Calling named workflows\nWithin a main.nf script (called rnaseq.nf in our example) you can also have multiple workflows. In which case you may want to call a specific workflow when running the code. For this you could use the entrypoint call -entry <workflow_name>.\nThe following snippet has two named workflows (quant_wf and qc_wf):\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow quant_wf {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n}\n\nworkflow qc_wf {\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\nworkflow {\n quant_wf(Channel.of(params.transcriptome_file), reads_ch)\n qc_wf(reads_ch, quant_wf.out)\n}\nBy default, running the main.nf (called rnaseq.nf in our example) will execute the main workflow block.\nnextflow run runseq.nf --outdir \"results\"\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq4.nf` [goofy_mahavira] DSL2 - revision: 2125d44217\nexecutor > local (12)\n[38/e34e41] process > quant_wf:INDEX (1) [100%] 1 of 1 ✔\n[9e/afc9e0] process > quant_wf:QT (1) [100%] 1 of 1 ✔\n[c1/dc84fe] process > qc_wf:FASTQC_one (3) [100%] 3 of 3 ✔\n[2b/48680f] process > qc_wf:TRIMGALORE (3) [100%] 3 of 3 ✔\n[13/71e240] process > qc_wf:FASTQC_two (3) [100%] 3 of 3 ✔\n[07/cf203f] process > qc_wf:MULTIQC (1) [100%] 1 of 1 ✔\nNote that the process is now annotated with <workflow-name>:<process-name>\nBut you can choose which workflow to run by using the entry flag:\nnextflow run runseq.nf --outdir \"results\" -entry quant_wf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq5.nf` [magical_picasso] DSL2 - revision: 4ddb8eaa12\nexecutor > local (4)\n[a7/152090] process > quant_wf:INDEX [100%] 1 of 1 ✔\n[cd/612b4a] process > quant_wf:QT (1) [100%] 3 of 3 ✔" + }, + { + "objectID": "workshops/4_1_modules.html#importing-subworkflows", + "href": "workshops/4_1_modules.html#importing-subworkflows", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.2.4 Importing Subworkflows", + "text": "5.2.4 Importing Subworkflows\nSimilar to module script, workflow or sub-workflow can also be imported into other Nextflow scripts using the include statement. This allows you to store these components in one or more file(s) that they can be re-used in multiple workflows.\nAgain using the rnaseq.nf example, you can achieve this by:\nCreating a file called subworkflows.nf in the top-level directory. Copying and pasting all workflow definitions for quant_wf and qc_wf into subworkflows.nf. Removing the workflow definitions in the rnaseq.nf script. Importing the sub-workflows from subworkflows.nf within the rnaseq.nf script anywhere above the workflow definition:\ninclude { QUANT_WF } from './subworkflows.nf'\ninclude { QC_WF } from './subworkflows.nf'\nExercise\nCreate a subworkflows.nf file with the QUANT_WF, and QC_WF from the previous sections. Then remove these processes from rnaseq.nf and include them in the workflow using the include definitions shown above.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe rnaseq.nf script should look similar to this:\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../nf-training/ggal/transcriptome.fa\"\nparams.multiqc = \"/scratch/users/.../nf-training/multiqc\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { QUANT_WF; QC_WF } from './subworkflows.nf'\n\nworkflow {\n QUANT_WF(Channel.of(params.transcriptome_file), reads_ch)\n QC_WF(reads_ch, QUANT_WF.out)\n}\nand the subworkflows.nf script should look similar to this:\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow QUANT_WF{\n take:\n transcriptome_file\n reads_ch\n\n main:\n index_ch = INDEX(transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n\n emit:\n quant_ch\n}\n\nworkflow QC_WF{\n take:\n reads_ch\n quant_ch\n\n main:\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n\n emit:\n multiqc_ch\n}\n\n\n\nRun the pipeline to check if the workflow import is successful\nnextflow run rnaseq.nf --outdir \"results\" -resume\n\n\n\n\n\n\nChallenge\nStructure modules and subworkflows similar to the setup used by most nf-core pipelines (e.g. nf-core/rnaseq)" + }, + { + "objectID": "workshops/4_1_modules.html#workflow-structure", + "href": "workshops/4_1_modules.html#workflow-structure", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.3 Workflow Structure", + "text": "5.3 Workflow Structure\nThere are three directories in a Nextflow workflow repository that have a special purpose:" + }, + { + "objectID": "workshops/4_1_modules.html#bin", + "href": "workshops/4_1_modules.html#bin", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.3.1 ./bin", + "text": "5.3.1 ./bin\nThe bin directory (if it exists) is always added to the $PATH for all tasks. If the tasks are performed on a remote machine, the directory is copied across to the new machine before the task begins. This Nextflow feature is designed to make it easy to include accessory scripts directly in the workflow without having to commit those scripts into the container. This feature also ensures that the scripts used inside of the workflow move on the same revision schedule as the workflow itself.\nIt is important to know that Nextflow will take care of updating $PATH and ensuring the files are available wherever the task is running, but will not change the permissions of any files in that directory. If a file is called by a task as an executable, the workflow developer must ensure that the file has the correct permissions to be executed.\nFor example, let’s say we have a small R script that produces a csv and a tsv:\n\n#!/usr/bin/env Rscript\nlibrary(tidyverse)\n\nplot <- ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()\nmtcars |> write_tsv(\"cars.tsv\")\nggsave(\"cars.png\", plot = plot)\nWe’d like to use this script in a simple workflow car.nf:\nprocess PlotCars {\n // container 'rocker/tidyverse:latest'\n container '/config/binaries/singularity/containers_devel/nextflow/r-dinoflow_0.1.1.sif'\n\n output:\n path(\"*.png\"), emit: \"plot\"\n path(\"*.tsv\"), emit: \"table\"\n\n script:\n \"\"\"\n cars.R\n \"\"\"\n}\n\nworkflow {\n PlotCars()\n\n PlotCars.out.table | view { \"Found a tsv: $it\" }\n PlotCars.out.plot | view { \"Found a png: $it\" }\n}\nTo do this, we can create the bin directory, write our R script into the directory. Finally, and crucially, we make the script executable:\nchmod +x bin/cars.R\n\n\n\n\n\n\nWarning\n\n\n\nAlways ensure that your scripts are executable. The scripts will not be available to your Nextflow processes without this step.\nYou will get the following error if permission is not set correctly.\nERROR ~ Error executing process > 'PlotCars'\n\nCaused by:\n Process `PlotCars` terminated with an error exit status (126)\n\nCommand executed:\n\n cars.R\n\nCommand exit status:\n 126\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: /scratch/users/.../bin/cars.R: Permission denied\n\nWork dir:\n /scratch/users/.../work/6b/86d3d0060266b1ca515cc851d23890\n\nTip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`\n\n -- Check '.nextflow.log' file for details\n\n\nLet’s run the script and see what Nextflow is doing for us behind the scenes:\nnextflow run car.nf\nand then inspect the .command.run file that Nextflow has generated\nYou’ll notice a nxf_container_env bash function that appends our bin directory to $PATH:\nnxf_container_env() {\ncat << EOF\nexport PATH=\"\\$PATH:/scratch/users/<your-user-name>/.../bin\"\nEOF\n}\nWhen working on the cloud, Nextflow will also ensure that the bin directory is copied onto the virtual machine running your task in addition to the modification of $PATH." + }, + { + "objectID": "workshops/4_1_modules.html#templates", + "href": "workshops/4_1_modules.html#templates", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.3.2 ./templates", + "text": "5.3.2 ./templates\nIf a process script block is becoming too long, it can be moved to a template file. The template file can then be imported into the process script block using the template method. This is useful for keeping the process block tidy and readable. Nextflow’s use of $ to indicate variables also allows for directly testing the template file by running it as a script.\nFor example:\n# cat templates/my_script.sh\n\n#!/bin/bash\necho \"process started at `date`\"\necho $name\necho \"process completed\"\nprocess SayHiTemplate {\n debug true\n input: \n val(name)\n\n script: \n template 'my_script.sh'\n}\n\nworkflow {\n SayHiTemplate(\"Hello World\")\n}\nBy default, Nextflow looks for the my_script.sh template file in the templates directory located alongside the Nextflow script and/or the module script in which the process is defined. Any other location can be specified by using an absolute template path." + }, + { + "objectID": "workshops/4_1_modules.html#lib", + "href": "workshops/4_1_modules.html#lib", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.3.3 ./lib", + "text": "5.3.3 ./lib\nIn the next chapter, we will start looking into adding small helper Groovy functions to the main.nf file. It may at times be helpful to bundle functionality into a new Groovy class. Any classes defined in the lib directory are available for use in the workflow - both main.nf and any imported modules.\nClasses defined in lib directory can be used for a variety of purposes. For example, the nf-core/rnaseq workflow uses five custom classes:\n\nNfcoreSchema.groovy for parsing the schema.json file and validating the workflow parameters.\nNfcoreTemplate.groovy for email templating and nf-core utility functions.\nUtils.groovy for provision of a single checkCondaChannels method.\nWorkflowMain.groovy for workflow setup and to call the NfcoreTemplate class.\nWorkflowRnaseq.groovy for the workflow-specific functions.\n\nThe classes listed above all provide utility executed at the beginning of a workflow, and are generally used to “set up” the workflow. However, classes defined in lib can also be used to provide functionality to the workflow itself." + }, + { + "objectID": "workshops/4_1_modules.html#groovy-functions-and-libraries", + "href": "workshops/4_1_modules.html#groovy-functions-and-libraries", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "6. Groovy Functions and Libraries", + "text": "6. Groovy Functions and Libraries\nNextflow is a domain specific language (DSL) implemented on top of the Groovy programming language, which in turn is a super-set of the Java programming language. This means that Nextflow can run any Groovy or Java code.\nYou have already been using some Groovy code in the previous sections, but now it’s time to learn more about it." + }, + { + "objectID": "workshops/4_1_modules.html#some-useful-groovy-introduction", + "href": "workshops/4_1_modules.html#some-useful-groovy-introduction", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "6.1 Some useful groovy introduction", + "text": "6.1 Some useful groovy introduction" + }, + { + "objectID": "workshops/4_1_modules.html#variables", + "href": "workshops/4_1_modules.html#variables", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "6.1.1 Variables", + "text": "6.1.1 Variables\nTo define a variable, simply assign a value to it:\nx = 1\nprintln x\n\nx = new java.util.Date()\nprintln x\n\nx = -3.1499392\nprintln x\n\nx = false\nprintln x\n\nx = \"Hi\"\nprintln x\n>> nextflow run variable.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `variable.nf` [trusting_moriondo] DSL2 - revision: ee74c86d04\n1\nWed Jun 05 03:45:19 AEST 2024\n-3.1499392\nfalse\nHi\nLocal variables are defined using the def keyword:\ndef x = 'foo'\nThe def should be always used when defining variables local to a function or a closure." + }, + { + "objectID": "workshops/4_1_modules.html#maps", + "href": "workshops/4_1_modules.html#maps", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "6.1.2 Maps", + "text": "6.1.2 Maps\nMaps are like lists that have an arbitrary key instead of an integer (allow key-value pair).\nmap = [a: 0, b: 1, c: 2]\nMaps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.\nmap = [a: 0, b: 1, c: 2]\n\nassert map['a'] == 0 \nassert map.b == 1 \nassert map.get('c') == 2 \nTo add data or to modify a map, the syntax is similar to adding values to a list:\nmap = [a: 0, b: 1, c: 2]\n\nmap['a'] = 'x' \nmap.b = 'y' \nmap.put('c', 'z') \nassert map == [a: 'x', b: 'y', c: 'z']\nMap objects implement all methods provided by the java.util.Map interface, plus the extension methods provided by Groovy." + }, + { + "objectID": "workshops/4_1_modules.html#if-statement", + "href": "workshops/4_1_modules.html#if-statement", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "6.1.3 If statement", + "text": "6.1.3 If statement\nThe if statement uses the same syntax common in other programming languages, such as Java, C, and JavaScript.\nif (< boolean expression >) {\n // true branch\n}\nelse {\n // false branch\n}\nThe else branch is optional. Also, the curly brackets are optional when the branch defines just a single statement.\nx = 1\nif (x > 10)\n println 'Hello'\nIn some cases it can be useful to replace the if statement with a ternary expression (aka a conditional expression):\nprintln list ? list : 'The list is empty'\nThe previous statement can be further simplified using the Elvis operator:\nprintln list ?: 'The list is empty'" + }, + { + "objectID": "workshops/4_1_modules.html#functions", + "href": "workshops/4_1_modules.html#functions", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "6.1.4 Functions", + "text": "6.1.4 Functions\nIt is possible to define a custom function into a script:\ndef fib(int n) {\n return n < 2 ? 1 : fib(n - 1) + fib(n - 2)\n}\n\nassert fib(10)==89\nA function can take multiple arguments separating them with a comma.\nThe return keyword can be omitted and the function implicitly returns the value of the last evaluated expression. Also, explicit types can be omitted, though not recommended:\ndef fact(n) {\n n > 1 ? n * fact(n - 1) : 1\n}\n\nassert fact(5) == 120" + }, + { + "objectID": "workshops/4_1_modules.html#grooovy-library", + "href": "workshops/4_1_modules.html#grooovy-library", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "6.2 Grooovy Library", + "text": "6.2 Grooovy Library" + }, + { + "objectID": "workshops/4_1_modules.html#testing", + "href": "workshops/4_1_modules.html#testing", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "7. Testing", + "text": "7. Testing" + }, + { + "objectID": "workshops/4_1_modules.html#stub", + "href": "workshops/4_1_modules.html#stub", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "7.1 Stub", + "text": "7.1 Stub\nYou can define a command stub, which replaces the actual process command when the -stub-run or -stub command-line option is enabled:\n\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path 'index'\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i index\n \"\"\"\n\n stub:\n \"\"\"\n mkdir index\n touch index/seq.bin\n touch index/info.json\n touch index/refseq.bin\n \"\"\"\n}\nThe stub block can be defined before or after the script block. When the pipeline is executed with the -stub-run option and a process’s stub is not defined, the script block is executed.\nThis feature makes it easier to quickly prototype the workflow logic without using the real commands. The developer can use it to provide a dummy script that mimics the execution of the real one in a quicker manner. In other words, it is a way to perform a dry-run." + }, + { + "objectID": "workshops/4_1_modules.html#test-profile", + "href": "workshops/4_1_modules.html#test-profile", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "7.2 Test profile", + "text": "7.2 Test profile" + }, + { + "objectID": "workshops/4_1_modules.html#nf-test", + "href": "workshops/4_1_modules.html#nf-test", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "7.3. nf-test", + "text": "7.3. nf-test\nIt is critical for reproducibility and long-term maintenance to have a way to systematically test that every part of your workflow is doing what it’s supposed to do. To that end, people often focus on top-level tests, in which the workflow is un on some test data from start to finish. This is useful but unfortunately incomplete. You should also implement module-level tests (equivalent to what is called ‘unit tests’ in general software engineering) to verify the functionality of individual components of your workflow, ensuring that each module performs as expected under different conditions and inputs.\nThe nf-test package provides a testing framework that integrates well with Nextflow and makes it straightforward to add both module-level and workflow-level tests to your pipeline. For more background information, read the blog post about nf-test on the nf-core blog.\nSee this tutorial for some examples.\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core" }, { "objectID": "workshops/00_setup.html", diff --git a/sessions/1_intro_run_nf.html b/sessions/1_intro_run_nf.html index 37fa214..5a02e24 100644 --- a/sessions/1_intro_run_nf.html +++ b/sessions/1_intro_run_nf.html @@ -132,6 +132,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/sessions/2_nf_dev_intro.html b/sessions/2_nf_dev_intro.html index b599231..930c73a 100644 --- a/sessions/2_nf_dev_intro.html +++ b/sessions/2_nf_dev_intro.html @@ -132,6 +132,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • @@ -187,13 +191,13 @@

    Course Presenters

    Course Helpers

    @@ -253,7 +257,7 @@

    Workshop schedule

    29th May 2024 -Developing Reusable Workflows +Developing Modularised Workflows Introduction to modules imports, sub-workflows, setting up test-profile, and common useful groovy functions 5th Jun 2024 diff --git a/sitemap.xml b/sitemap.xml index a2b23ea..18d7675 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,46 +2,50 @@ https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/2_nf_dev_intro.html - 2024-05-29T01:20:43.219Z + 2024-06-04T18:11:31.022Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/index.html - 2024-05-29T01:20:42.278Z + 2024-06-04T18:11:30.249Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.3_tips_and_tricks.html - 2024-05-29T01:20:40.418Z + 2024-06-04T18:11:28.642Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.2_intro_nf_core.html - 2024-05-29T01:20:39.398Z + 2024-06-04T18:11:27.714Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_draft_future_sess.html - 2024-05-29T01:20:37.648Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.2_troubleshooting.html + 2024-06-04T18:11:26.090Z + + + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html + 2024-06-04T18:11:24.979Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/3.1_creating_a_workflow.html - 2024-05-29T01:20:36.497Z + 2024-06-04T18:11:24.315Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html - 2024-05-29T01:20:37.202Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_draft_future_sess.html + 2024-06-04T18:11:25.367Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.2_troubleshooting.html - 2024-05-29T01:20:38.501Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4_1_modules.html + 2024-06-04T18:11:26.903Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/00_setup.html - 2024-05-29T01:20:39.844Z + 2024-06-04T18:11:28.152Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.1_customise_and_run.html - 2024-05-29T01:20:41.860Z + 2024-06-04T18:11:29.910Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/1_intro_run_nf.html - 2024-05-29T01:20:42.751Z + 2024-06-04T18:11:30.632Z diff --git a/workshops/00_setup.html b/workshops/00_setup.html index cd6f8aa..2b6545c 100644 --- a/workshops/00_setup.html +++ b/workshops/00_setup.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/1.1_intro_nextflow.html b/workshops/1.1_intro_nextflow.html index 7423e12..47044d3 100644 --- a/workshops/1.1_intro_nextflow.html +++ b/workshops/1.1_intro_nextflow.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/1.2_intro_nf_core.html b/workshops/1.2_intro_nf_core.html index 784f099..6ad7aed 100644 --- a/workshops/1.2_intro_nf_core.html +++ b/workshops/1.2_intro_nf_core.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/2.1_customise_and_run.html b/workshops/2.1_customise_and_run.html index fbe40d9..8f54609 100644 --- a/workshops/2.1_customise_and_run.html +++ b/workshops/2.1_customise_and_run.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/2.2_troubleshooting.html b/workshops/2.2_troubleshooting.html index e03df02..03b617c 100644 --- a/workshops/2.2_troubleshooting.html +++ b/workshops/2.2_troubleshooting.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/2.3_tips_and_tricks.html b/workshops/2.3_tips_and_tricks.html index 177cc3e..d229d06 100644 --- a/workshops/2.3_tips_and_tricks.html +++ b/workshops/2.3_tips_and_tricks.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/3.1_creating_a_workflow.html b/workshops/3.1_creating_a_workflow.html index 962a1b5..d81c526 100644 --- a/workshops/3.1_creating_a_workflow.html +++ b/workshops/3.1_creating_a_workflow.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/4.1_draft_future_sess.html b/workshops/4.1_draft_future_sess.html index 70a6c0b..16dbc84 100644 --- a/workshops/4.1_draft_future_sess.html +++ b/workshops/4.1_draft_future_sess.html @@ -166,6 +166,10 @@
  • Creating a workflow +
  • +
  • + + Modularisation
  • diff --git a/workshops/4_1_modules.html b/workshops/4_1_modules.html new file mode 100644 index 0000000..24ab4fb --- /dev/null +++ b/workshops/4_1_modules.html @@ -0,0 +1,1310 @@ + + + + + + + + + +Peter Mac Nextflow Workshop - Nextflow Development - Developing Modularised Workflows + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + +
    + + + + +
    + +
    +
    +

    Nextflow Development - Developing Modularised Workflows

    +
    + + + +
    + + + + +
    + + +
    + +
    +
    +
    + +
    +
    +Objectives +
    +
    +
    +
      +
    • Gain an understanding of Nextflow modules and subworkflows
    • +
    • Gain an understanding of Nextflow workflow structures
    • +
    • Explore some groovy functions and libraries
    • +
    • Setup config, profile, and some test data
    • +
    +
    +
    +
    +

    Environment Setup

    +

    Set up an interactive shell to run our Nextflow workflow:

    +
    srun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash
    +

    Load the required modules to run Nextflow:

    +
    module load nextflow/23.04.1
    +module load singularity/3.7.3
    +

    Set the singularity cache environment variable:

    +
    export NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow
    +

    Singularity images downloaded by workflow executions will now be stored in this directory.

    +

    You may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.

    +
    +
    +

    5. Modularization

    +

    The definition of module libraries simplifies the writing of complex data analysis workflows and makes re-use of processes much easier.

    +

    Using the rnaseq.nf example from previous section, you can convert the workflow’s processes into modules, then call them within the workflow scope.

    +
    #!/usr/bin/env nextflow
    +
    +params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +process INDEX {
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img"
    +
    +    input:
    +    path transcriptome
    +
    +    output:
    +    path "salmon_idx"
    +
    +    script:
    +    """
    +    salmon index --threads $task.cpus -t $transcriptome -i salmon_idx
    +    """
    +}
    +
    +process QUANTIFICATION {
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img"
    +
    +    input:
    +    path salmon_index
    +    tuple val(sample_id), path(reads)
    +
    +    output:
    +    path "$sample_id"
    +
    +    script:
    +    """
    +    salmon quant --threads $task.cpus --libType=U \
    +    -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id
    +    """
    +}
    +
    +process FASTQC {
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img"
    +
    +    input:
    +    tuple val(sample_id), path(reads)
    +
    +    output:
    +    path "fastqc_${sample_id}_logs"
    +
    +    script:
    +    """
    +    mkdir fastqc_${sample_id}_logs
    +    fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}
    +    """
    +}
    +
    +process MULTIQC {
    +    publishDir params.outdir, mode:'copy'
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img"
    +
    +    input:
    +    path quantification
    +    path fastqc
    +
    +    output:
    +    path "*.html"
    +
    +    script:
    +    """
    +    multiqc . --filename $quantification
    +    """
    +}
    +
    +workflow {
    +  index_ch = INDEX(params.transcriptome_file)
    +  quant_ch = QUANTIFICATION(index_ch, reads_ch)
    +  quant_ch.view()
    +
    +  fastqc_ch = FASTQC(reads_ch)
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +
    +
    +

    5.1 Modules

    +

    Nextflow DSL2 allows for the definition of stand-alone module scripts that can be included and shared across multiple workflows. Each module can contain its own process or workflow definition.

    +
    +
    +

    5.1.1. Importing modules

    +

    Components defined in the module script can be imported into other Nextflow scripts using the include statement. This allows you to store these components in one or more file(s) that they can be re-used in multiple workflows.

    +

    Using the rnaseq.nf example, you can achieve this by:

    +

    Creating a file called modules.nf in the top-level directory. Copying and pasting all process definitions for INDEX, QUANTIFICATION, FASTQC and MULTIQC into modules.nf. Removing the process definitions in the rnaseq.nf script. Importing the processes from modules.nf within the rnaseq.nf script anywhere above the workflow definition:

    +
    include { INDEX } from './modules.nf'
    +include { QUANTIFICATION } from './modules.nf'
    +include { FASTQC } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +
    +
    +
    + +
    +
    +Tip +
    +
    +
    +

    In general, you would use relative paths to define the location of the module scripts using the ./prefix.

    +
    +
    +

    Exercise

    +

    Create a modules.nf file with the INDEX, QUANTIFICATION, FASTQC and MULTIQC from rnaseq.nf. Then remove these processes from rnaseq.nf and include them in the workflow using the include definitions shown above.

    +
    + +
    +
    +

    The rnaseq.nf script should look similar to this:

    +
    params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX } from './modules.nf'
    +include { QUANTIFICATION } from './modules.nf'
    +include { FASTQC } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +
    +workflow {
    +  index_ch = INDEX(params.transcriptome_file)
    +  quant_ch = QUANTIFICATION(index_ch, reads_ch)
    +  quant_ch.view()
    +
    +  fastqc_ch = FASTQC(reads_ch)
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +
    +
    +
    +

    Run the pipeline to check if the module import is successful

    +
    nextflow run rnaseq.nf --outdir "results" -resume
    +
    +
    +
    + +
    +
    +

    Challenge

    +

    Try modularising the modules.nf even further to achieve a setup of one tool per module (can be one or more processes), similar to the setup used by most nf-core pipelines

    +
    nfcore/rna-seq
    +  | modules
    +    | local
    +      | multiqc
    +      | deseq2_qc
    +    | nf-core
    +      | fastqc
    +      | salmon
    +        | index
    +          | main.nf
    +        | quant
    +          | main.nf
    +
    +
    +
    +
    +
    +

    5.1.2. Multiple imports

    +

    If a Nextflow module script contains multiple process definitions they can also be imported using a single include statement as shown in the example below:

    +
    params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX; QUANTIFICATION; FASTQC; MULTIQC } from './modules.nf'
    +
    +workflow {
    +  index_ch = INDEX(params.transcriptome_file)
    +  quant_ch = QUANTIFICATION(index_ch, reads_ch)
    +  fastqc_ch = FASTQC(reads_ch)
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +
    +
    +

    5.1.3 Module aliases

    +

    When including a module component it is possible to specify a name alias using the as declaration. This allows the inclusion and the invocation of the same component multiple times using different names:

    +
    params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX } from './modules.nf'
    +include { QUANTIFICATION as QT } from './modules.nf'
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +include { FASTQC as FASTQC_two } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +include { TRIMGALORE } from './modules/trimgalore.nf'
    +
    +workflow {
    +  index_ch = INDEX(params.transcriptome_file)
    +  quant_ch = QT(index_ch, reads_ch)
    +  fastqc_ch = FASTQC_one(reads_ch)
    +  trimgalore_out_ch = TRIMGALORE(reads_ch).reads
    +  fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)
    +
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +
    process TRIMGALORE {
    +  container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' 
    +
    +  input:
    +    tuple val(sample_id), path(reads)
    +  
    +  output:
    +    tuple val(sample_id), path("*{3prime,5prime,trimmed,val}*.fq.gz"), emit: reads
    +    tuple val(sample_id), path("*report.txt")                        , emit: log     , optional: true
    +    tuple val(sample_id), path("*unpaired*.fq.gz")                   , emit: unpaired, optional: true
    +    tuple val(sample_id), path("*.html")                             , emit: html    , optional: true
    +    tuple val(sample_id), path("*.zip")                              , emit: zip     , optional: true
    +
    +  script:
    +    """
    +    trim_galore \\
    +      --paired \\
    +      --gzip \\
    +      ${reads[0]} \\
    +      ${reads[1]}
    +    """
    +
    +}
    +

    Note how the QUANTIFICATION process is now being refer to as QT, and FASTQC process is imported twice, each time with a different alias, and how these aliases are used to invoke the processes.

    +
    
    +N E X T F L O W  ~  version 23.04.1
    +Launching `rnaseq.nf` [sharp_meitner] DSL2 - revision: 6afd5bf37c
    +executor >  local (16)
    +[c7/56160a] process > INDEX          [100%] 1 of 1 ✔
    +[75/cb99dd] process > QT (3)         [100%] 3 of 3 ✔
    +[d9/e298c6] process > FASTQC_one (3) [100%] 3 of 3 ✔
    +[5e/7ccc39] process > TRIMGALORE (3) [100%] 3 of 3 ✔
    +[a3/3a1e2e] process > FASTQC_two (3) [100%] 3 of 3 ✔
    +[e1/411323] process > MULTIQC (3)    [100%] 3 of 3 ✔
    +
    +
    +
    + +
    +
    +Warning +
    +
    +
    +

    What do you think will happen if FASTQC is imported only once without alias, but used twice within the workflow?

    +
    + +
    +
    +
    Process 'FASTQC' has been already used -- If you need to reuse the same component, include it with a different name or include it in a different workflow context
    +
    +
    +
    +
    +
    +
    +
    +

    5.2 Workflow definition

    +

    The workflow scope allows the definition of components that define the invocation of one or more processes or operators:

    +
    
    +params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX } from './modules.nf'
    +include { QUANTIFICATION as QT } from './modules.nf'
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +include { FASTQC as FASTQC_two } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +include { TRIMGALORE } from './modules/trimgalore.nf'
    +
    +workflow my_workflow {
    +  index_ch = INDEX(params.transcriptome_file)
    +  quant_ch = QT(index_ch, reads_ch)
    +  fastqc_ch = FASTQC_one(reads_ch)
    +  trimgalore_out_ch = TRIMGALORE(reads_ch).reads
    +  fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)
    +
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +
    +workflow {
    +  my_workflow()
    +}
    +

    For example, the snippet above defines a workflow named my_workflow, that is invoked via another workflow definition.

    +
    +
    +

    5.2.1 Workflow inputs

    +

    A workflow component can declare one or more input channels using the take statement. When the take statement is used, the workflow definition needs to be declared within the main block.

    +

    For example:

    +
    
    +params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX } from './modules.nf'
    +include { QUANTIFICATION as QT } from './modules.nf'
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +include { FASTQC as FASTQC_two } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +include { TRIMGALORE } from './modules/trimgalore.nf'
    +
    +workflow my_workflow {
    +  take:
    +  transcriptome_file
    +  reads_ch
    +
    +  main:
    +  index_ch = INDEX(transcriptome_file)
    +  quant_ch = QT(index_ch, reads_ch)
    +  fastqc_ch = FASTQC_one(reads_ch)
    +  trimgalore_out_ch = TRIMGALORE(reads_ch).reads
    +  fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)
    +
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +

    The input for the workflowcan then be specified as an argument:

    +
    workflow {
    +  my_workflow(Channel.of(params.transcriptome_file), reads_ch)
    +}
    +
    +
    +

    5.2.2 Workflow outputs

    +

    A workflow can declare one or more output channels using the emit statement. For example:

    +
    
    +params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX } from './modules.nf'
    +include { QUANTIFICATION as QT } from './modules.nf'
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +include { FASTQC as FASTQC_two } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +include { TRIMGALORE } from './modules/trimgalore.nf'
    +
    +workflow my_workflow {
    +  take:
    +  transcriptome_file
    +  reads_ch
    +
    +  main:
    +  index_ch = INDEX(transcriptome_file)
    +  quant_ch = QT(index_ch, reads_ch)
    +  fastqc_ch = FASTQC_one(reads_ch)
    +  trimgalore_out_ch = TRIMGALORE(reads_ch).reads
    +  fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +
    +  emit:
    +  quant_ch
    +
    +}
    +
    +workflow {
    +  my_workflow(Channel.of(params.transcriptome_file), reads_ch)
    +  my_workflow.out.view()
    +}
    +

    As a result, you can use the my_workflow.out notation to access the outputs of my_workflow in the invoking workflow.

    +

    You can also declare named outputs within the emit block.

    +
      emit:
    +  my_wf_output = quant_ch
    +
    workflow {
    +  my_workflow(Channel.of(params.transcriptome_file), reads_ch)
    +  my_workflow.out.my_wf_output.view()
    +}
    +

    The result of the above snippet can then be accessed using my_workflow.out.my_wf_output.

    +
    +
    +

    5.2.3 Calling named workflows

    +

    Within a main.nf script (called rnaseq.nf in our example) you can also have multiple workflows. In which case you may want to call a specific workflow when running the code. For this you could use the entrypoint call -entry <workflow_name>.

    +

    The following snippet has two named workflows (quant_wf and qc_wf):

    +
    params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX } from './modules.nf'
    +include { QUANTIFICATION as QT } from './modules.nf'
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +include { FASTQC as FASTQC_two } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +include { TRIMGALORE } from './modules/trimgalore.nf'
    +
    +workflow quant_wf {
    +  index_ch = INDEX(params.transcriptome_file)
    +  quant_ch = QT(index_ch, reads_ch)
    +}
    +
    +workflow qc_wf {
    +  fastqc_ch = FASTQC_one(reads_ch)
    +  trimgalore_out_ch = TRIMGALORE(reads_ch).reads
    +  fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +
    +workflow {
    +  quant_wf(Channel.of(params.transcriptome_file), reads_ch)
    +  qc_wf(reads_ch, quant_wf.out)
    +}
    +

    By default, running the main.nf (called rnaseq.nf in our example) will execute the main workflow block.

    +
    nextflow run runseq.nf --outdir "results"
    +
    N E X T F L O W  ~  version 23.04.1
    +Launching `rnaseq4.nf` [goofy_mahavira] DSL2 - revision: 2125d44217
    +executor >  local (12)
    +[38/e34e41] process > quant_wf:INDEX (1)   [100%] 1 of 1 ✔
    +[9e/afc9e0] process > quant_wf:QT (1)      [100%] 1 of 1 ✔
    +[c1/dc84fe] process > qc_wf:FASTQC_one (3) [100%] 3 of 3 ✔
    +[2b/48680f] process > qc_wf:TRIMGALORE (3) [100%] 3 of 3 ✔
    +[13/71e240] process > qc_wf:FASTQC_two (3) [100%] 3 of 3 ✔
    +[07/cf203f] process > qc_wf:MULTIQC (1)    [100%] 1 of 1 ✔
    +

    Note that the process is now annotated with <workflow-name>:<process-name>

    +

    But you can choose which workflow to run by using the entry flag:

    +
    nextflow run runseq.nf --outdir "results" -entry quant_wf
    +
    N E X T F L O W  ~  version 23.04.1
    +Launching `rnaseq5.nf` [magical_picasso] DSL2 - revision: 4ddb8eaa12
    +executor >  local (4)
    +[a7/152090] process > quant_wf:INDEX  [100%] 1 of 1 ✔
    +[cd/612b4a] process > quant_wf:QT (1) [100%] 3 of 3 ✔
    +
    +
    +

    5.2.4 Importing Subworkflows

    +

    Similar to module script, workflow or sub-workflow can also be imported into other Nextflow scripts using the include statement. This allows you to store these components in one or more file(s) that they can be re-used in multiple workflows.

    +

    Again using the rnaseq.nf example, you can achieve this by:

    +

    Creating a file called subworkflows.nf in the top-level directory. Copying and pasting all workflow definitions for quant_wf and qc_wf into subworkflows.nf. Removing the workflow definitions in the rnaseq.nf script. Importing the sub-workflows from subworkflows.nf within the rnaseq.nf script anywhere above the workflow definition:

    +
    include { QUANT_WF } from './subworkflows.nf'
    +include { QC_WF } from './subworkflows.nf'
    +

    Exercise

    +

    Create a subworkflows.nf file with the QUANT_WF, and QC_WF from the previous sections. Then remove these processes from rnaseq.nf and include them in the workflow using the include definitions shown above.

    +
    + +
    +
    +

    The rnaseq.nf script should look similar to this:

    +
    params.reads = "/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../nf-training/ggal/transcriptome.fa"
    +params.multiqc = "/scratch/users/.../nf-training/multiqc"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { QUANT_WF; QC_WF } from './subworkflows.nf'
    +
    +workflow {
    +  QUANT_WF(Channel.of(params.transcriptome_file), reads_ch)
    +  QC_WF(reads_ch, QUANT_WF.out)
    +}
    +

    and the subworkflows.nf script should look similar to this:

    +
    include { INDEX } from './modules.nf'
    +include { QUANTIFICATION as QT } from './modules.nf'
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +include { FASTQC as FASTQC_two } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +include { TRIMGALORE } from './modules/trimgalore.nf'
    +
    +workflow QUANT_WF{
    +  take:
    +  transcriptome_file
    +  reads_ch
    +
    +  main:
    +  index_ch = INDEX(transcriptome_file)
    +  quant_ch = QT(index_ch, reads_ch)
    +
    +  emit:
    +  quant_ch
    +}
    +
    +workflow QC_WF{
    +  take:
    +  reads_ch
    +  quant_ch
    +
    +  main:
    +  fastqc_ch = FASTQC_one(reads_ch)
    +  trimgalore_out_ch = TRIMGALORE(reads_ch).reads
    +  fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +
    +  emit:
    +  multiqc_ch
    +}
    +
    +
    +
    +

    Run the pipeline to check if the workflow import is successful

    +
    nextflow run rnaseq.nf --outdir "results" -resume
    +
    +
    +
    + +
    +
    +

    Challenge

    +

    Structure modules and subworkflows similar to the setup used by most nf-core pipelines (e.g. nf-core/rnaseq)

    +
    +
    +
    +
    +
    +

    5.3 Workflow Structure

    +

    There are three directories in a Nextflow workflow repository that have a special purpose:

    +
    +
    +

    5.3.1 ./bin

    +

    The bin directory (if it exists) is always added to the $PATH for all tasks. If the tasks are performed on a remote machine, the directory is copied across to the new machine before the task begins. This Nextflow feature is designed to make it easy to include accessory scripts directly in the workflow without having to commit those scripts into the container. This feature also ensures that the scripts used inside of the workflow move on the same revision schedule as the workflow itself.

    +

    It is important to know that Nextflow will take care of updating $PATH and ensuring the files are available wherever the task is running, but will not change the permissions of any files in that directory. If a file is called by a task as an executable, the workflow developer must ensure that the file has the correct permissions to be executed.

    +

    For example, let’s say we have a small R script that produces a csv and a tsv:

    +
    
    +#!/usr/bin/env Rscript
    +library(tidyverse)
    +
    +plot <- ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
    +mtcars |> write_tsv("cars.tsv")
    +ggsave("cars.png", plot = plot)
    +

    We’d like to use this script in a simple workflow car.nf:

    +
    process PlotCars {
    +    // container 'rocker/tidyverse:latest'
    +    container '/config/binaries/singularity/containers_devel/nextflow/r-dinoflow_0.1.1.sif'
    +
    +    output:
    +    path("*.png"), emit: "plot"
    +    path("*.tsv"), emit: "table"
    +
    +    script:
    +    """
    +    cars.R
    +    """
    +}
    +
    +workflow {
    +    PlotCars()
    +
    +    PlotCars.out.table | view { "Found a tsv: $it" }
    +    PlotCars.out.plot | view { "Found a png: $it" }
    +}
    +

    To do this, we can create the bin directory, write our R script into the directory. Finally, and crucially, we make the script executable:

    +
    chmod +x bin/cars.R
    +
    +
    +
    + +
    +
    +Warning +
    +
    +
    +

    Always ensure that your scripts are executable. The scripts will not be available to your Nextflow processes without this step.

    +

    You will get the following error if permission is not set correctly.

    +
    ERROR ~ Error executing process > 'PlotCars'
    +
    +Caused by:
    +  Process `PlotCars` terminated with an error exit status (126)
    +
    +Command executed:
    +
    +  cars.R
    +
    +Command exit status:
    +  126
    +
    +Command output:
    +  (empty)
    +
    +Command error:
    +  .command.sh: line 2: /scratch/users/.../bin/cars.R: Permission denied
    +
    +Work dir:
    +  /scratch/users/.../work/6b/86d3d0060266b1ca515cc851d23890
    +
    +Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
    +
    + -- Check '.nextflow.log' file for details
    +
    +
    +

    Let’s run the script and see what Nextflow is doing for us behind the scenes:

    +
    nextflow run car.nf
    +

    and then inspect the .command.run file that Nextflow has generated

    +

    You’ll notice a nxf_container_env bash function that appends our bin directory to $PATH:

    +
    nxf_container_env() {
    +cat << EOF
    +export PATH="\$PATH:/scratch/users/<your-user-name>/.../bin"
    +EOF
    +}
    +

    When working on the cloud, Nextflow will also ensure that the bin directory is copied onto the virtual machine running your task in addition to the modification of $PATH.

    +
    +
    +

    5.3.2 ./templates

    +

    If a process script block is becoming too long, it can be moved to a template file. The template file can then be imported into the process script block using the template method. This is useful for keeping the process block tidy and readable. Nextflow’s use of $ to indicate variables also allows for directly testing the template file by running it as a script.

    +

    For example:

    +
    # cat templates/my_script.sh
    +
    +#!/bin/bash
    +echo "process started at `date`"
    +echo $name
    +echo "process completed"
    +
    process SayHiTemplate {
    +    debug true
    +    input: 
    +      val(name)
    +
    +    script: 
    +      template 'my_script.sh'
    +}
    +
    +workflow {
    +    SayHiTemplate("Hello World")
    +}
    +

    By default, Nextflow looks for the my_script.sh template file in the templates directory located alongside the Nextflow script and/or the module script in which the process is defined. Any other location can be specified by using an absolute template path.

    +
    +
    +

    5.3.3 ./lib

    +

    In the next chapter, we will start looking into adding small helper Groovy functions to the main.nf file. It may at times be helpful to bundle functionality into a new Groovy class. Any classes defined in the lib directory are available for use in the workflow - both main.nf and any imported modules.

    +

    Classes defined in lib directory can be used for a variety of purposes. For example, the nf-core/rnaseq workflow uses five custom classes:

    +
      +
    • NfcoreSchema.groovy for parsing the schema.json file and validating the workflow parameters.
    • +
    • NfcoreTemplate.groovy for email templating and nf-core utility functions.
    • +
    • Utils.groovy for provision of a single checkCondaChannels method.
    • +
    • WorkflowMain.groovy for workflow setup and to call the NfcoreTemplate class.
    • +
    • WorkflowRnaseq.groovy for the workflow-specific functions.
    • +
    +

    The classes listed above all provide utility executed at the beginning of a workflow, and are generally used to “set up” the workflow. However, classes defined in lib can also be used to provide functionality to the workflow itself.

    +
    +
    +

    6. Groovy Functions and Libraries

    +

    Nextflow is a domain specific language (DSL) implemented on top of the Groovy programming language, which in turn is a super-set of the Java programming language. This means that Nextflow can run any Groovy or Java code.

    +

    You have already been using some Groovy code in the previous sections, but now it’s time to learn more about it.

    +
    +
    +

    6.1 Some useful groovy introduction

    +
    +
    +

    6.1.1 Variables

    +

    To define a variable, simply assign a value to it:

    +
    x = 1
    +println x
    +
    +x = new java.util.Date()
    +println x
    +
    +x = -3.1499392
    +println x
    +
    +x = false
    +println x
    +
    +x = "Hi"
    +println x
    +
    >> nextflow run variable.nf
    +
    +N E X T F L O W  ~  version 23.04.1
    +Launching `variable.nf` [trusting_moriondo] DSL2 - revision: ee74c86d04
    +1
    +Wed Jun 05 03:45:19 AEST 2024
    +-3.1499392
    +false
    +Hi
    +

    Local variables are defined using the def keyword:

    +
    def x = 'foo'
    +

    The def should be always used when defining variables local to a function or a closure.

    +
    +
    +

    6.1.2 Maps

    +

    Maps are like lists that have an arbitrary key instead of an integer (allow key-value pair).

    +
    map = [a: 0, b: 1, c: 2]
    +

    Maps can be accessed in a conventional square-bracket syntax or as if the key was a property of the map.

    +
    map = [a: 0, b: 1, c: 2]
    +
    +assert map['a'] == 0 
    +assert map.b == 1 
    +assert map.get('c') == 2 
    +

    To add data or to modify a map, the syntax is similar to adding values to a list:

    +
    map = [a: 0, b: 1, c: 2]
    +
    +map['a'] = 'x' 
    +map.b = 'y' 
    +map.put('c', 'z') 
    +assert map == [a: 'x', b: 'y', c: 'z']
    +

    Map objects implement all methods provided by the java.util.Map interface, plus the extension methods provided by Groovy.

    +
    +
    +

    6.1.3 If statement

    +

    The if statement uses the same syntax common in other programming languages, such as Java, C, and JavaScript.

    +
    if (< boolean expression >) {
    +    // true branch
    +}
    +else {
    +    // false branch
    +}
    +

    The else branch is optional. Also, the curly brackets are optional when the branch defines just a single statement.

    +
    x = 1
    +if (x > 10)
    +    println 'Hello'
    +

    In some cases it can be useful to replace the if statement with a ternary expression (aka a conditional expression):

    +
    println list ? list : 'The list is empty'
    +

    The previous statement can be further simplified using the Elvis operator:

    +
    println list ?: 'The list is empty'
    +
    +
    +

    6.1.4 Functions

    +

    It is possible to define a custom function into a script:

    +
    def fib(int n) {
    +    return n < 2 ? 1 : fib(n - 1) + fib(n - 2)
    +}
    +
    +assert fib(10)==89
    +

    A function can take multiple arguments separating them with a comma.

    +

    The return keyword can be omitted and the function implicitly returns the value of the last evaluated expression. Also, explicit types can be omitted, though not recommended:

    +
    def fact(n) {
    +    n > 1 ? n * fact(n - 1) : 1
    +}
    +
    +assert fact(5) == 120
    +
    +
    +

    6.2 Grooovy Library

    +
    +
    +

    7. Testing

    +
    +
    +

    7.1 Stub

    +

    You can define a command stub, which replaces the actual process command when the -stub-run or -stub command-line option is enabled:

    +
    
    +process INDEX {
    +  input:
    +    path transcriptome
    +
    +  output:
    +    path 'index'
    +
    +  script:
    +    """
    +    salmon index --threads $task.cpus -t $transcriptome -i index
    +    """
    +
    +  stub:
    +    """
    +    mkdir index
    +    touch index/seq.bin
    +    touch index/info.json
    +    touch index/refseq.bin
    +    """
    +}
    +

    The stub block can be defined before or after the script block. When the pipeline is executed with the -stub-run option and a process’s stub is not defined, the script block is executed.

    +

    This feature makes it easier to quickly prototype the workflow logic without using the real commands. The developer can use it to provide a dummy script that mimics the execution of the real one in a quicker manner. In other words, it is a way to perform a dry-run.

    +
    +
    +

    7.2 Test profile

    +
    +
    +

    7.3. nf-test

    +

    It is critical for reproducibility and long-term maintenance to have a way to systematically test that every part of your workflow is doing what it’s supposed to do. To that end, people often focus on top-level tests, in which the workflow is un on some test data from start to finish. This is useful but unfortunately incomplete. You should also implement module-level tests (equivalent to what is called ‘unit tests’ in general software engineering) to verify the functionality of individual components of your workflow, ensuring that each module performs as expected under different conditions and inputs.

    +

    The nf-test package provides a testing framework that integrates well with Nextflow and makes it straightforward to add both module-level and workflow-level tests to your pipeline. For more background information, read the blog post about nf-test on the nf-core blog.

    +

    See this tutorial for some examples.

    +
    +

    This workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core

    + + +
    + +
    + +
    + + + + \ No newline at end of file