Skip to content

Commit

Permalink
Adding funcotator stand-alone WDL to supported area (#5999)
Browse files Browse the repository at this point in the history
* Moving stand-alone funcotator.wdl to supported area of scripts.
* Updated funcotator.wdl to be more correct / full-featured.
* Added Readme.md for Funcotator WDL folder.
  • Loading branch information
jonn-smith authored Jun 13, 2019
1 parent 4581998 commit 730164d
Show file tree
Hide file tree
Showing 2 changed files with 149 additions and 52 deletions.
91 changes: 91 additions & 0 deletions scripts/funcotator_wdl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Running the Funcotator WDL

## Background Information
Funcotator (**FUNC**tional ann**OTATOR**) is a functional annotation tool in the core GATK toolset and was designed to handle both somatic and germline use cases. It analyzes given variants for their function (as retrieved from a set of data sources) and produces the analysis in a specified output file. Funcotator reads in a VCF file, labels each variant with one of twenty-three distinct variant classifications, produces gene information (e.g. affected gene, predicted variant amino acid sequence, etc.), and associations to information in datasources. Default supported datasources include GENCODE (gene information and protein change prediction), dbSNP, gnomAD, and COSMIC (among others). The corpus of datasources is extensible and user-configurable and includes cloud-based datasources supported with Google Cloud Storage. Funcotator produces either a Variant Call Format (VCF) file (with annotations in the INFO field) or a Mutation Annotation Format (MAF) file.

Funcotator allows the user to add their own annotations to variants based on a set of data sources. Each data source can be customized to annotate a variant based on several matching criteria. This allows a user to create their own custom annotations easily, without modifying any Java code.

## Setup

To run the Funcotator WDL you must have access to a cromwell server that can run your job.

Once your cromwell instance is active, you will need to generate input arguments to pass to funcotator.wdl. These arguments re passed in as a JSON file (see below for a non-working example).

Once a JSON file has been created you can submit your job to a cromwell server directly (i.e. using a tool such as Cromshell) or through Terra/FireCloud.

## WDL Input Parameters

The input parameters to the Funcotator WDL are as follows:

### Required Inputs:
String gatk_docker - GATK Docker image in which to run

File ref_fasta - Reference FASTA file.

File ref_fasta_index - Reference FASTA file index.

File ref_fasta_dict - Reference FASTA file sequence dictionary.

File variant_vcf_to_funcotate - Variant Context File (VCF) containing the variants to annotate.

File variant_vcf_to_funcotate_index - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate.

String reference_version - Version of the reference being used. Either `hg19` or `hg38`.

String output_file_name - Path to desired output file.

String output_format - Output file format (either VCF or MAF).

Boolean compress - Whether to compress the resulting output file.

Boolean use_gnomad - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist.


### Optional Inputs:
File? interval_list - Intervals to be used for traversal. If specified will only traverse the given intervals.

File? data_sources_tar_gz - Path to tar.gz containing the data sources for Funcotator to create annotations.

String? transcript_selection_mode - Method of detailed transcript selection. This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`).

Array[String]? transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript.

Array[String]? annotation_defaults - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format <ANNOTATION>:<VALUE>). This will add the specified annotation to every annotated variant if it is not already present.

Array[String]? annotation_overrides - Override values for annotations (in the format <ANNOTATION>:<VALUE>). Replaces existing annotations of the given name with given values.

File? gatk4_jar_override - Override Jar file containing GATK 4.0. Use this when overriding the docker JAR or when using a backend without docker.

String? funcotator_extra_args - Extra command-line arguments to pass through to Funcotator. (e.g. " --exclude-field foo_field --exclude-field bar_field ")

## Example JSON File (Non-Working)

The follwing is an example of a JSON input file. It will not work as-is but is provided as a starting point for you to create your own input file:

```
{
"Funcotator.gatk_docker": "broadinstitute/gatk:latest",
"Funcotator.ref_fasta": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
"Funcotator.ref_fasta_index": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
"Funcotator.ref_dict": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
"Funcotator.reference_version": "hg38",
"Funcotator.output_format": "VCF",
"Funcotator.compress": "false",
"Funcotator.use_gnomad": "false",
"Funcotator.data_sources_tar_gz": "gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz",
"Funcotator.variant_vcf_to_funcotate": "variants.vcf",
"Funcotator.variant_vcf_to_funcotate_index": "variants.vcf.idx",
"Funcotator.output_file_base_name": "variants.funcotated"
}
```

## Further Information
- https://software.broadinstitute.org/gatk/documentation/article?id=11193
- https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php


Original file line number Diff line number Diff line change
@@ -1,36 +1,34 @@
# Run Funcotator on a set of called variants from Mutect 2.
# Run Funcotator on a set of called variants.
#
# Description of inputs:
#
# Required:
# String gatk_docker - GATK Docker image in which to run
# File ref_fasta - Reference FASTA file.
# File ref_fasta_index - Reference FASTA file index.
# File ref_fasta_dict - Reference FASTA file sequence dictionary.
# File variant_vcf_to_funcotate - Variant Context File (VCF) containing the variants to annotate.
# File variant_vcf_to_funcotate_index - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate.
# String reference_version - Version of the reference being used. Either `hg19` or `hg38`.
# String output_file_name - Path to desired output file.
# String output_format - Output file format (either VCF or MAF).
# Boolean compress - Whether to compress the resulting output file.
# Boolean use_gnomad - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist.
# String gatk_docker - GATK Docker image in which to run
# File ref_fasta - Reference FASTA file.
# File ref_fasta_index - Reference FASTA file index.
# File ref_fasta_dict - Reference FASTA file sequence dictionary.
# File variant_vcf_to_funcotate - Variant Context File (VCF) containing the variants to annotate.
# File variant_vcf_to_funcotate_index - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate.
# String reference_version - Version of the reference being used. Either `hg19` or `hg38`.
# String output_file_name - Path to desired output file.
# String output_format - Output file format (either VCF or MAF).
# Boolean compress - Whether to compress the resulting output file.
# Boolean use_gnomad - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist.
#
# Optional:
# interval_list - Intervals to be used for traversal. If specified will only traverse the given intervals.
# data_sources_tar_gz - Path to tar.gz containing the data sources for Funcotator to create annotations.
# transcript_selection_mode - Method of detailed transcript selection. This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`).
# transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript.
# annotation_defaults - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format <ANNOTATION>:<VALUE>). This will add the specified annotation to every annotated variant if it is not already present.
# annotation_overrides - Override values for annotations (in the format <ANNOTATION>:<VALUE>). Replaces existing annotations of the given name with given values.
# gatk4_jar_override - Override Jar file containing GATK 4.0. Use this when overriding the docker JAR or when using a backend without docker.
# funcotator_extra_args - Extra command-line arguments to pass through to Funcotator. (e.g. " --exclude-field foo_field --exclude-field bar_field ")
# File? interval_list - Intervals to be used for traversal. If specified will only traverse the given intervals.
# File? data_sources_tar_gz - Path to tar.gz containing the data sources for Funcotator to create annotations.
# String? transcript_selection_mode - Method of detailed transcript selection. This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`).
# Array[String]? transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript.
# Array[String]? annotation_defaults - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format <ANNOTATION>:<VALUE>). This will add the specified annotation to every annotated variant if it is not already present.
# Array[String]? annotation_overrides - Override values for annotations (in the format <ANNOTATION>:<VALUE>). Replaces existing annotations of the given name with given values.
# File? gatk4_jar_override - Override Jar file containing GATK 4.0. Use this when overriding the docker JAR or when using a backend without docker.
# String? funcotator_extra_args - Extra command-line arguments to pass through to Funcotator. (e.g. " --exclude-field foo_field --exclude-field bar_field ")
#
# This WDL needs to decide whether to use the ``gatk_jar`` or ``gatk_jar_override`` for the jar location. As of cromwell-0.24,
# this logic *must* go into each task. Therefore, there is a lot of duplicated code. This allows users to specify a jar file
# independent of what is in the docker file. See the README.md for more info.
#
# NOTE: This only does VCF output right now!
#
workflow Funcotator {
String gatk_docker
File ref_fasta
Expand Down Expand Up @@ -94,15 +92,21 @@ task Funcotate {
File ref_fasta
File ref_fasta_index
File ref_dict

File input_vcf
File input_vcf_idx

String reference_version

String output_file_base_name
String output_format

Boolean compress
Boolean use_gnomad

File? data_sources_tar_gz
# This should be updated when a new version of the data sources is released
# TODO: Make this dynamically chosen in the command.
File? data_sources_tar_gz = "gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz"

String? control_id
String? case_id
Expand All @@ -125,10 +129,10 @@ task Funcotate {
String output_maf_index = output_maf + ".idx"

String output_vcf = output_file_base_name + if compress then ".vcf.gz" else ".vcf"
String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx"
String output_vcf_idx = output_vcf + if compress then ".tbi" else ".idx"

String output_file = if output_format == "MAF" then output_maf else output_vcf
String output_file_index = if output_format == "MAF" then output_maf_index else output_vcf_index
String output_file_index = if output_format == "MAF" then output_maf_index else output_vcf_idx

String transcript_selection_arg = if defined(transcript_selection_list) then " --transcript-list " else ""
String annotation_def_arg = if defined(annotation_defaults) then " --annotation-default " else ""
Expand All @@ -153,43 +157,43 @@ task Funcotate {

Boolean use_ssd = false

# This should be updated when a new version of the data sources is released
# TODO: Make this dynamically chosen in the command.
# TODO: Make this pull from google cloud, rather than from the FTP:
String default_datasources_version = "funcotator_dataSources.v1.6.20190124s"
# Mem is in units of GB but our command and memory runtime values are in MB
Int default_ram_mb = 1024 * 3
Int machine_mem = if defined(mem) then mem *1024 else default_ram_mb
Int command_mem = machine_mem - 1024

# You may have to change the following two parameter values depending on the task requirements
Int default_ram_mb = 3000
# WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb). Please see [TODO: Link from Jose] for examples.
Int default_disk_space_gb = 100
# Calculate disk size:
Float ref_size_gb = size(ref_fasta, "GiB") + size(ref_fasta_index, "GiB") + size(ref_dict, "GiB")
Float vcf_size_gb = size(input_vcf, "GiB") + size(input_vcf_idx, "GiB")
Float ds_size_gb = size(data_sources_tar_gz, "GiB")

# Mem is in units of GB but our command and memory runtime values are in MB
Int machine_mem = if defined(mem) then mem *1000 else default_ram_mb
Int command_mem = machine_mem - 1000
Int default_disk_space_gb = ceil( ref_size_gb + (ds_size_gb * 2) + (vcf_size_gb * 10) ) + 20

# Silly hack to allow us to use the dollar sign in the command section:
String dollar = "$"

command <<<
set -e
export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}

# Handle our data sources:
DATA_SOURCES_TAR_GZ=${data_sources_tar_gz}
if [[ ! -e $DATA_SOURCES_TAR_GZ ]] ; then
# We have to download the data sources:
echo "Data sources gzip does not exist: $DATA_SOURCES_TAR_GZ"
echo "Downloading default data sources..."
wget ftp://[email protected]/bundle/funcotator/${default_datasources_version}.tar.gz
tar -zxf ${default_datasources_version}.tar.gz
DATA_SOURCES_FOLDER=${default_datasources_version}
else
# Extract the tar.gz:
echo "Extracting data sources zip file..."
mkdir datasources_dir
tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1
DATA_SOURCES_FOLDER="$PWD/datasources_dir"
# =======================================
# Hack to validate our WDL inputs:
#
# NOTE: This happens here so that we don't waste time copying down the data sources if there's an error.

if [[ "${output_format}" != "MAF" ]] && [[ "${output_format}" != "VCF" ]] ; then
echo "ERROR: Output format must be MAF or VCF."
fi

# =======================================
# Handle our data sources:

# Extract the tar.gz:
echo "Extracting data sources tar/gzip file..."
mkdir datasources_dir
tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1
DATA_SOURCES_FOLDER="$PWD/datasources_dir"

# Handle gnomAD:
if ${use_gnomad} ; then
echo "Enabling gnomAD..."
Expand All @@ -205,6 +209,7 @@ task Funcotate {
done
fi

# =======================================
# Run Funcotator:
gatk --java-options "-Xmx${command_mem}m" Funcotator \
--data-sources-path $DATA_SOURCES_FOLDER \
Expand All @@ -226,6 +231,7 @@ task Funcotate {
${filter_funcotations_args} \
${extra_args_arg}

# =======================================
# Make sure we have a placeholder index for MAF files so this workflow doesn't fail:
if [[ "${output_format}" == "MAF" ]] ; then
touch ${output_maf_index}
Expand All @@ -238,7 +244,7 @@ task Funcotate {
memory: machine_mem + " MB"
disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD"
preemptible: select_first([preemptible_attempts, 3])
maxRetries: select_first([max_retries, 3])
maxRetries: select_first([max_retries, 0])
cpu: select_first([cpu, 1])
}

Expand Down

0 comments on commit 730164d

Please sign in to comment.