Nextflow workflow to identify somatic variation.
This workflow calls variants from the alignment files of a paired tumor/normal sample.
This workflow can be used for the following:
- Alignment QC and statistics.
- Somatic short variant calling (SNV and Indels).
- Somatic structural variants calling (SV).
- Modified sites calling (mod).
Recommended requirements:
- CPUs = 64
- Memory = 256GB
Minimum requirements:
- CPUs = 16
- Memory = 48GB
Approximate run time: Variable depending on sequencing modality (targeted or whole genome sequencing), as well as coverage and the individual analyses requested. For instance, a complete analysis of a 60X/30X Tumor/Normal pair with default settings takes approximately 6h 30m using the recommended requirements.
ARM processor support: False
These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.
The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.
The workflow can currently be run using either
Docker
or Singularity
to provide isolation of the required software.
Both methods are automated out-of-the-box provided
either Docker or Singularity is installed.
This is controlled by the
-profile
parameter as exemplified below.
It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.
The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:
nextflow run epi2me-labs/wf-somatic-variation --help
To update a workflow to the latest version on the command line use the following command:
nextflow pull epi2me-labs/wf-somatic-variation
A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:
wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-somatic-variation/wf-somatic-variation-demo.tar.gz
tar -xzvf wf-somatic-variation-demo.tar.gz
The workflow can then be run with the downloaded demo data using:
nextflow run epi2me-labs/wf-somatic-variation \
--snv \
--sv \
--mod \
--sample_name 'MYSAMPLE' \
--ref 'wf-somatic-variation-demo/GCA_000001405.15_GRCh38_no_alt_analysis_set_chr20.fna' \
--bed 'wf-somatic-variation-demo/demo.bed' \
--bam_normal 'wf-somatic-variation-demo/demo_normal.bam' \
--bam_tumor 'wf-somatic-variation-demo/demo_tumor.bam' \
--override_basecaller_cfg '[email protected]' \
--normal_min_coverage 0 \
--tumor_min_coverage 0 \
-profile standard
For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/
This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.
Find related protocols in the Nanopore community.
This workflow accepts BAM files (aligned or unaligned) as input.
The --bam_tumor
and --bam_normal
input parameters for this workflow accept the path to a single BAM file or folder containing multiple BAM files for the tumor sample and the normal sample, respectively. The normal sample is optional for some components. A sample name can be supplied with --sample
.
(i) (ii)
input_reads.bam ─── input_directory
├── reads0.bam
└── reads1.bam
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
snv | boolean | Call for somatic small variants. | If this option is selected, small variant calling will be carried out using ClairS. | False |
sv | boolean | Call for somatic structural variants. | If this option is selected, the workflow will call somatic structural variants using severus. | False |
mod | boolean | Enable output of differentially modified sites and differentially modified regions [requires input BAMs with ML and MM tags]. | If this option is selected, modified bases will be aggregated with modkit and differential modifications will be computed with DSS. | False |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
sample_name | string | Sample name to be displayed in workflow outputs. | The sample name will be used from the workflow to correctly name output files. | SAMPLE |
bam_normal | string | BAM or unaligned BAM (uBAM) files for the normal sample to use in the analysis. | This accepts one of two cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files. A sample name can be supplied with --sample . |
|
bam_tumor | string | BAM or unaligned BAM (uBAM) files for the tumor sample to use in the analysis. | This accepts one of two cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files. A sample name can be supplied with --sample . |
|
ref | string | Path to a reference FASTA file. | Reference against which to compare reads for variant calling. | |
bed | string | An optional BED file enumerating regions to process for variant calling. | ||
tr_bed | string | An optional BED file enumerating simple repeat regions. | This command provides a bed file specifying the location of the simple repetitive elements in the genome of choice. This file should be a standard bed file, as described in the UCSC specification. | |
out_dir | string | Directory for output of all workflow results. | output | |
annotation | boolean | Perform SnpEff annotation. | If this option is deselected, VCFs will not be annotated with SnpEff. | True |
include_all_ctgs | boolean | Call for variants on all sequences in the reference, otherwise small variants will only be called on chr{1..22,X,Y,MT}. | Enabling this option will call for variants on all contigs of the input reference sequence. Typically this option is not required as standard human reference sequences contain decoy and unplaced contigs that are usually omitted for the purpose of variant calling. This option might be useful for non-standard reference sequence databases. | False |
igv | boolean | Visualize outputs in the EPI2ME IGV visualizer. | Enabling this option will visualize the output VCF files in the EPI2ME desktop app IGV visualizer. | False |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
depth_intervals | boolean | Output a bedGraph file with entries for each genomic interval featuring homogeneous depth. | The output bedGraph file will have an entry for each genomic interval in which all positions have the same alignment depth. By default this workflow outputs summary depth information from your aligned reads. Per-base depth outputs are slower to generate but may be required for some downstream applications. | False |
tumor_min_coverage | number | Minimum read coverage for the tumor sample required to run analysis. | A tumor BAM below this coverage value will fail the coverage threshold, and will not be processed in downstream analyses. | 20 |
normal_min_coverage | number | Minimum read coverage for the normal sample required to run analysis. | A normal BAM below this coverage value will fail the coverage threshold, and will not be processed in downstream analyses. | 20 |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
hybrid_mode_vcf | string | Enable hybrid calling mode that combines the de novo calling results and genotyping results at the positions in the VCF file given. | ||
genotyping_mode_vcf | string | VCF file input containing candidate sites to be genotyped. Variants will only be called at the sites in the VCF file if provided. | ||
normal_vcf | string | VCF file input containing normal sites for the given sample. | Pointing to a pre-computed normal VCF file will prevent the workflow from calling the germline sites for the normal sample, reducing processing time. | |
skip_haplotype_filter | boolean | Skip haplotype filtering of variants. | Setting this will skip haplotype filtering of variants. | False |
fast_mode | boolean | Fast germline variants calling in Clair3 (does not emit germline calls). | Setting this will speed up the germline calling from Clair3 by relaxing the variant calling parameters; this matches ClairS default behaviour, and therefore will not emit germline VCFs. | False |
germline | boolean | The workflow will perform germline calling and tumor phasing as default; set to false to disable this (greatly speeds up execution). | True | |
phase_normal | boolean | Phase and tag the normal, in addition to the tumor dataset. | False | |
liquid_tumor | boolean | The sample is a liquid tumor | Setting this to true will have ClairS to use specific presets and model (where available) for liquid tumors, increasing accuracy. | False |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
min_sv_length | integer | Minimum SV size to call. | Provide the minimum size of the structural variants to call with severus. | 50 |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
force_strand | boolean | Require modkit to call strand-aware modifications. | False | |
diff_mod | boolean | Detect differentially modified loci and regions with DSS. | True |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
ubam_map_threads | integer | Set max number of threads to use for aligning reads from uBAM (limited by config executor cpus). | 8 | |
ubam_sort_threads | integer | Set max number of threads to use for sorting and indexing aligned reads from uBAM (limited by config executor cpus). | 3 | |
ubam_bam2fq_threads | integer | Set max number of threads to use for uncompressing uBAM and generating FASTQ for alignment (limited by config executor cpus). | 1 | |
severus_threads | integer | Total number of threads to use for Severus (minimum of 4 and limited by config executor cpus). |
8 | |
dss_threads | integer | Total number of threads to use in the DSS differential modification analysis (limited by config executor cpus). | 1 | |
modkit_threads | integer | Total number of threads to use in modkit modified base calling (limited by config executor cpus). | 4 | |
haplotype_filter_threads | integer | Set max number of threads to use for the haplotype filtering stage in SNV workflow (limited by config executor cpus). | 4 |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
override_basecaller_cfg | string | Name of the model to use for converting signal and selecting a small variant calling model. | The workflow will attempt to find the basecaller model from the headers of your input data, providing a value for this option will override the model found in the data. If the model cannot be found in the header, it must be provided with this option as the basecaller model is required for small variant calling. The basecaller model is used to automatically select the appropriate small variant calling model. The model list shows all models that are compatible for small variant calling with this workflow. You should select 'custom' to override the basecaller_cfg with clair3_model_path. |
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Title | File path | Description | Per sample or aggregated |
---|---|---|---|
Workflow alignment statistics report | {{ alias }}.wf-somatic-variation-readQC-report.html | Report of the alignment statistics for each tumor/normal paired sample. | per-sample |
Workflow SNV report | {{ alias }}.wf-somatic-snv-report.html | Report of the SNV for each tumor/normal paired sample. | per-sample |
Workflow SV report | {{ alias }}.wf-somatic-sv-report.html | Report of the SV for each tumor/normal paired sample. | per-sample |
Workflow MOD report | {{ alias }}.wf-somatic-mod-report.html | Report of the modified bases for each tumor/normal paired sample. | per-sample |
Somatic short variant VCF | {{ alias }}.wf-somatic-snv.vcf.gz | VCF file with the somatic SNVs for the sample. | per-sample |
Somatic short variant VCF index | {{ alias }}.wf-somatic-snv.vcf.gz.tbi | The index of the resulting VCF file with the somatic SNVs for the sample. | per-sample |
Somatic structural variant VCF | {{ alias }}.wf-somatic-sv.vcf.gz | VCF file with the somatic SVs for the sample. | per-sample |
Somatic structural variant VCF index | {{ alias }}.wf-somatic-sv.vcf.gz.tbi | The index of the resulting VCF file with the somatic SVs for the sample. | per-sample |
Modified bases BEDMethyl (normal) | {{ alias }}.wf-somatic-mod.normal.bedmethyl.gz | BED file with the aggregated modification counts for the normal sample. | per-sample |
Modified bases summary (normal) | {{ alias }}.normal.mod_summary.tsv | Summary modification stats for the normal sample. | per-sample |
Single-change BEDMethyl (normal) | {{ alias }}/mod/{{ mod type }}/bedMethyl/{{ mod type }}.{{ alias }}.wf-somatic-mod.normal.bedmethyl.gz | BED file with the aggregated modification counts for a single modification type (e.g. 5mc) for the normal sample. | per-sample |
Single-change DSS input file (normal) | {{ alias }}/mod/{{ mod type }}/DSS/{{ mod type }}.{{ alias }}_normal.dss.tsv | Input text file for DSS for a single modification type (e.g. 5mc) for the normal sample. | per-sample |
Modified bases BEDMethyl (tumor) | {{ alias }}.wf-somatic-mod.tumor.bedmethyl.gz | BED file with the aggregated modification counts for the tumor sample. | per-sample |
Modified bases summary (tumor) | {{ alias }}.normal.mod_summary.tsv | Summary modification stats for the tumor sample. | per-sample |
Single-change BEDMethyl (tumor) | {{ alias }}/mod/{{ mod type }}/bedMethyl/{{ mod type }}.{{ alias }}.wf-somatic-mod.tumor.bedmethyl.gz | BED file with the aggregated modification counts for a single modification type (e.g. 5mc) for the tumor sample. | per-sample |
Single-change DSS input file (tumor) | {{ alias }}/mod/{{ mod type }}/DSS/{{ mod type }}.{{ alias }}_tumor.dss.tsv | Input text file for DSS for a single modification type (e.g. 5mc) for the tumor sample. | per-sample |
Differentially modified loci (DML) per change type | {{ alias }}/mod/{{ mod type }}/DML/{{ alias }}.{{ mod type }}.dml.tsv | Differentially modified loci from DSS for a single modification type (e.g. 5mc). | per-sample |
Differentially modified regions (DMR) per change type | {{ alias }}/mod/{{ mod type }}/DMR/{{ alias }}.{{ mod type }}.dmr.tsv | Differentially modified regions from DSS for a single modification type (e.g. 5mc). | per-sample |
Alignment file (normal) | {{ alias }}/bam/normal/reads.bam | BAM or CRAM file with the aligned reads for the normal sample. | per-sample |
Alignment file index (normal) | {{ alias }}/bam/normal/reads.bam.bai | The index of the resulting BAM or CRAM file with the aligned reads for the normal sample. | per-sample |
Alignment file (tumor) | {{ alias }}/bam/tumor/reads.bam | BAM or CRAM file with the aligned reads for the tumor sample. | per-sample |
Alignment file index (tumor) | {{ alias }}/bam/tumor/reads.bam.bai | The index of the resulting BAM or CRAM file with the aligned reads for the tumor sample. | per-sample |
Haplotagged alignment file (normal) | {{ alias }}_normal.ht.bam | BAM or CRAM file with the haplotagged reads for the normal sample. | per-sample |
Haplotagged alignment file index (normal) | {{ alias }}_normal.ht.bam.bai | The index of the resulting BAM or CRAM file with the haplotagged reads for the normal sample. | per-sample |
Haplotagged alignment file (tumor) | {{ alias }}_tumor.ht.bam | BAM or CRAM file with the haplotagged reads for the tumor sample. | per-sample |
Haplotagged alignment file index (tumor) | {{ alias }}_tumor.ht.bam.bai | The index of the resulting BAM or CRAM file with the haplotagged reads for the tumor sample. | per-sample |
This workflow is designed to perform variant calling of small variants, structural variants and modified bases aggregation from paired tumor/normal BAM files for a single sample.
Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
The workflow relies on three primary input files:
- A reference genome in fasta format
- A single tumor sample in the format of one BAM file, or a folder of BAM files (either aligned or unaligned)
- An optional single normal sample in the format of one BAM file, or a folder of BAM files (either aligned or unaligned)
The BAM files can be generated from:
- POD5/FAST5 files using the wf-basecalling workflow, or
- fastq files using wf-alignment.
Both workflows will generate aligned BAM files that are ready to be used with wf-somatic-variation
.
It is possible to run the workflow without the BAM file of the "normal" sample. See tumor-only mode for more details.
The workflow starts by performing multiple checks of the input BAM files, as well as computing:
- The depth of sequencing of each BAM file with mosdepth.
- The read alignment statistics for each BAM file with fastcat.
After computing the coverage, the workflow will check that the input BAM files have a depth greater than
--tumor_min_coverage
and --normal_min_coverage
for the tumor and normal BAM files, respectively.
It is necessary that both BAM files have passed the respective thresholds. In cases where the user
sets the minimum coverage to 0
, the check will be skipped and the workflow will proceed directly to the
downstream analyses.
The workflow currently implements a deconstructed version of ClairS to identify somatic variants in a paired tumor/normal sample. This workflow takes advantage of the parallel nature of Nextflow, providing optimal efficiency in high-performance, distributed systems.
Currently, ClairS supports the following basecalling models:
Workflow basecalling model | ClairS model |
---|---|
[email protected] | ont_r10_dorado_5khz |
[email protected] | ont_r10_dorado_4khz |
[email protected] | ont_r10_guppy |
[email protected] | ont_r9_guppy |
[email protected] | ont_r9_guppy |
dna_r9.4.1_450bps_hac_prom | ont_r9_guppy |
dna_r9.4.1_450bps_hac | ont_r9_guppy |
Any other model provided will prevent the workflow to start.
Currently, indel calling is supported only for dna_r10
basecalling models.
When the user specifies an r9 model the workflow will automatically skip
the indel processes and perform only the SNV calling.
The workflow uses Clair3
to call germline sites on both the normal and tumor
sample, which are then used internally to refine the somatic variant calling.
This mode is computationally demanding, and it's behaviour can be changed with
a few options:
- Reduce the accuracy of the variant calling with
--fast_mode
. - Provide a pre-computed VCF file reporting the germline calls for the normal sample with
--normal_vcf
. - Disable the germline calling altogether with
--germline false
.
SNVs called using the GRCh37 or GRCh38 genomes can be annotated using SnpEff
by setting --annotation true
. Furthermore, the workflow will add annotations from
the ClinVar database.
The workflow can run the snv
component in tumor-only mode.
See tumor-only mode for more details.
The workflow allows for the calling of somatic SVs using long-read sequencing data. Starting from the paired cancer/control samples, the workflow will call the somatic SVs using Severus.
The behaviour of Severus can be tweaked with the options:
--min_sv_length
: minimum size of SVs to call--min_support
: minimum number of reads to support an SV call--vaf_threshold
: sites with variant allele frequency (VAF) below this value will be filtered out--severus_args
: additional arguments for Severus
SVs called using the GRCh37 or GRCh38 genomes can be annotated using SnpEff
by setting --annotation true
.
Modified base calling can be performed by specifying --mod
. The workflow
will aggregate the modified bases using modkit.
The default behaviour of the workflow is to run modkit with the
--cpg --combine-strands
options set.
It is possible to report strand-aware modifications by providing --force_strand
.
Users can further change the behaviour of modkit
by passing options directly
to modkit via the --modkit_args
option. This will override any preset,
and allow full control over the run of modkit. For more details on the usage
of modkit pileup
, checkout the software documentation.
The workflow will perform differential modification analyses using DSS
when the user provides both tumor and normal samples.
The outputs for the differentially modified loci (DML) for each modification are saved in subfolders with the structure {{ alias }}/mod/{{ mod type }}/DML/{{ alias }}.{{ mod type }}.dml.tsv
, where alias
is the sample name provided with --sample_name
, and mod type
is the modification type analysed (e.g. 5mC). The differentially modified regions (DMR) can be found in a similar path: {{ alias }}/mod/{{ mod type }}/DMR/{{ alias }}.{{ mod type }}.dmr.tsv
DSS is very resource intensive, and might easily run out of memory. Therefore, it is possible to skip this step by setting --diff_mod false
, saving compute time and allowing the workflow to run to completion.
It is possible to run a reduced version of the workflow using only the tumor BAM files. Currently, only the following components can run in tumor-only mode:
- base workflow: BAM coverage and QC statistics
--mod
: the workflow will run modkit on the tumor BAM file, but will skip the differentially modified region and loci detection--snv
: the workflow will use ClairS-TO, instead of ClairS, to call SNVs.
When sequencing specific regions or genes, the runtime can vary substantially.
The table below provides a guideline on the average compute time for a given number of genes or regions.
All analyses are run using up to 128GB of RAM and 16 cores, computing --sv
and --snv
on a 75X/55X tumor/normal coverage in the regions of interest.
Number of genes | Region size | CPU/h | Runtime |
---|---|---|---|
1-10 | 200Kb-1Mb | ~2.0 | 8m-15m |
11-100 | 1Mb-6Mb | ~7.0 | 25m-45m |
100-1000 | 20Mb-60Mb | ~14.0 | 45m-1h:30m |
- If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
- See how to interpret some common nextflow exit codes here.
- To run the workflow with a non-human organism, proceed as follows:
- EPI2ME Desktop Application: disable the
Annotation
option. - Command line: set
--annotation false
.
- EPI2ME Desktop Application: disable the
- Short somatic Indel calling is supported only for
dna_r10
basecalling models. - Renaming, moving or deleting the input BAM, reference genome or the output directory from the location provided at runtime will cause IGV not to load.
- The workflow expects either an uncompressed or bgzip-compressed reference. If the user provides a reference compressed not with bgzip, the workflow will run to completion, but won't be able to generate the necessary indexes to visualize the outputs in IGV.
-
Does the workflow calls 5hmC, on top of 5mC? - Yes, the workflow does call 5hmC, but only if you performed the basecalling with the appropriate module; for more details, check out the dorado github page.
-
Can I run the workflow in tumor-only mode? - Yes, but currently this mode is only available for
--mod
and--snv
.
See the EPI2ME website for lots of other resources and blog posts.