Adding funcotator stand-alone WDL to supported area (#5999)

* Moving stand-alone funcotator.wdl to supported area of scripts. * Updated funcotator.wdl to be more correct / full-featured. * Added Readme.md for Funcotator WDL folder.
broadinstitute · Jun 13, 2019 · 730164d · 730164d
1 parent 4581998
commit 730164d
Show file tree

Hide file tree

Showing 2 changed files with 149 additions and 52 deletions.
diff --git a/scripts/funcotator_wdl/README.md b/scripts/funcotator_wdl/README.md
@@ -0,0 +1,91 @@
+# Running the Funcotator WDL
+
+## Background Information
+Funcotator (**FUNC**tional ann**OTATOR**) is a functional annotation tool in the core GATK toolset and was designed to handle both somatic and germline use cases. It analyzes given variants for their function (as retrieved from a set of data sources) and produces the analysis in a specified output file.  Funcotator reads in a VCF file, labels each variant with one of twenty-three distinct variant classifications, produces gene information (e.g. affected gene, predicted variant amino acid sequence, etc.), and associations to information in datasources. Default supported datasources include GENCODE (gene information and protein change prediction), dbSNP, gnomAD, and COSMIC (among others). The corpus of datasources is extensible and user-configurable and includes cloud-based datasources supported with Google Cloud Storage. Funcotator produces either a Variant Call Format (VCF) file (with annotations in the INFO field) or a Mutation Annotation Format (MAF) file.
+
+Funcotator allows the user to add their own annotations to variants based on a set of data sources.  Each data source can be customized to annotate a variant based on several matching criteria.  This allows a user to create their own custom annotations easily, without modifying any Java code.
+
+## Setup 
+
+To run the Funcotator WDL you must have access to a cromwell server that can run your job.
+
+Once your cromwell instance is active, you will need to generate input arguments to pass to funcotator.wdl.  These arguments re passed in as a JSON file (see below for a non-working example).
+
+Once a JSON file has been created you can submit your job to a cromwell server directly (i.e. using a tool such as Cromshell) or through Terra/FireCloud.
+
+## WDL Input Parameters
+
+The input parameters to the Funcotator WDL are as follows:
+
+### Required Inputs:
+String gatk_docker                  - GATK Docker image in which to run
+
+File ref_fasta                      - Reference FASTA file.
+
+File ref_fasta_index                - Reference FASTA file index.
+
+File ref_fasta_dict                 - Reference FASTA file sequence dictionary.
+
+File variant_vcf_to_funcotate       - Variant Context File (VCF) containing the variants to annotate.
+
+File variant_vcf_to_funcotate_index - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate.
+
+String reference_version            - Version of the reference being used.  Either `hg19` or `hg38`.
+
+String output_file_name             - Path to desired output file.
+
+String output_format                - Output file format (either VCF or MAF).
+
+Boolean compress				      - Whether to compress the resulting output file.
+
+Boolean use_gnomad                  - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist.
+
+
+### Optional Inputs:
+File? interval_list                      - Intervals to be used for traversal.  If specified will only traverse the given intervals.
+
+File? data_sources_tar_gz                - Path to tar.gz containing the data sources for Funcotator to create annotations.
+
+String? transcript_selection_mode        - Method of detailed transcript selection.  This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`).
+
+Array[String]? transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript.
+
+Array[String]? annotation_defaults       - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format <ANNOTATION>:<VALUE>).  This will add the specified annotation to every annotated variant if it is not already present.
+
+Array[String]? annotation_overrides      - Override values for annotations (in the format <ANNOTATION>:<VALUE>).  Replaces existing annotations of the given name with given values.
+
+File? gatk4_jar_override                 - Override Jar file containing GATK 4.0.  Use this when overriding the docker JAR or when using a backend without docker.
+
+String? funcotator_extra_args            - Extra command-line arguments to pass through to Funcotator.  (e.g. " --exclude-field foo_field --exclude-field bar_field ")
+
+## Example JSON File (Non-Working)
+
+The follwing is an example of a JSON input file.  It will not work as-is but is provided as a starting point for you to create your own input file:
+
+```
+{
+  "Funcotator.gatk_docker": "broadinstitute/gatk:latest",
+  
+  "Funcotator.ref_fasta": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
+  "Funcotator.ref_fasta_index": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
+  "Funcotator.ref_dict": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
+  
+  "Funcotator.reference_version": "hg38",
+  "Funcotator.output_format": "VCF",
+
+  "Funcotator.compress": "false",
+  "Funcotator.use_gnomad": "false",
+  "Funcotator.data_sources_tar_gz": "gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz",
+
+  "Funcotator.variant_vcf_to_funcotate": "variants.vcf",
+  "Funcotator.variant_vcf_to_funcotate_index": "variants.vcf.idx",
+  
+  "Funcotator.output_file_base_name": "variants.funcotated"
+}
+```
+
+## Further Information
+ - https://software.broadinstitute.org/gatk/documentation/article?id=11193
+ - https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.2.0/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php
+
+
diff --git a/...ts/mutect2_wdl/unsupported/funcotator.wdl → scripts/funcotator_wdl/funcotator.wdl b/...ts/mutect2_wdl/unsupported/funcotator.wdl → scripts/funcotator_wdl/funcotator.wdl
@@ -1,36 +1,34 @@
-# Run Funcotator on a set of called variants from Mutect 2.
+# Run Funcotator on a set of called variants.
 #
 # Description of inputs:
 #
 #   Required:
-#     String gatk_docker                  - GATK Docker image in which to run
-#     File ref_fasta                      - Reference FASTA file.
-#     File ref_fasta_index                - Reference FASTA file index.
-#     File ref_fasta_dict                 - Reference FASTA file sequence dictionary.
-#     File variant_vcf_to_funcotate       - Variant Context File (VCF) containing the variants to annotate.
-#     File variant_vcf_to_funcotate_index - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate.
-#     String reference_version            - Version of the reference being used.  Either `hg19` or `hg38`.
-#     String output_file_name             - Path to desired output file.
-#     String output_format                - Output file format (either VCF or MAF).
-#     Boolean compress				      - Whether to compress the resulting output file.
-#     Boolean use_gnomad                  - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist.
+#     String gatk_docker                       - GATK Docker image in which to run
+#     File ref_fasta                           - Reference FASTA file.
+#     File ref_fasta_index                     - Reference FASTA file index.
+#     File ref_fasta_dict                      - Reference FASTA file sequence dictionary.
+#     File variant_vcf_to_funcotate            - Variant Context File (VCF) containing the variants to annotate.
+#     File variant_vcf_to_funcotate_index      - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate.
+#     String reference_version                 - Version of the reference being used.  Either `hg19` or `hg38`.
+#     String output_file_name                  - Path to desired output file.
+#     String output_format                     - Output file format (either VCF or MAF).
+#     Boolean compress				           - Whether to compress the resulting output file.
+#     Boolean use_gnomad                       - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist.
 #
 #   Optional:
-#     interval_list                       - Intervals to be used for traversal.  If specified will only traverse the given intervals.
-#     data_sources_tar_gz                 - Path to tar.gz containing the data sources for Funcotator to create annotations.
-#     transcript_selection_mode           - Method of detailed transcript selection.  This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`).
-#     transcript_selection_list           - Set of transcript IDs to use for annotation to override selected transcript.
-#     annotation_defaults                 - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format <ANNOTATION>:<VALUE>).  This will add the specified annotation to every annotated variant if it is not already present.
-#     annotation_overrides                - Override values for annotations (in the format <ANNOTATION>:<VALUE>).  Replaces existing annotations of the given name with given values.
-#     gatk4_jar_override                  - Override Jar file containing GATK 4.0.  Use this when overriding the docker JAR or when using a backend without docker.
-#     funcotator_extra_args               - Extra command-line arguments to pass through to Funcotator.  (e.g. " --exclude-field foo_field --exclude-field bar_field ")
+#     File? interval_list                      - Intervals to be used for traversal.  If specified will only traverse the given intervals.
+#     File? data_sources_tar_gz                - Path to tar.gz containing the data sources for Funcotator to create annotations.
+#     String? transcript_selection_mode        - Method of detailed transcript selection.  This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`).
+#     Array[String]? transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript.
+#     Array[String]? annotation_defaults       - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format <ANNOTATION>:<VALUE>).  This will add the specified annotation to every annotated variant if it is not already present.
+#     Array[String]? annotation_overrides      - Override values for annotations (in the format <ANNOTATION>:<VALUE>).  Replaces existing annotations of the given name with given values.
+#     File? gatk4_jar_override                 - Override Jar file containing GATK 4.0.  Use this when overriding the docker JAR or when using a backend without docker.
+#     String? funcotator_extra_args            - Extra command-line arguments to pass through to Funcotator.  (e.g. " --exclude-field foo_field --exclude-field bar_field ")
 #
 # This WDL needs to decide whether to use the ``gatk_jar`` or ``gatk_jar_override`` for the jar location.  As of cromwell-0.24,
 # this logic *must* go into each task.  Therefore, there is a lot of duplicated code.  This allows users to specify a jar file
 # independent of what is in the docker file.  See the README.md for more info.
 #
-# NOTE: This only does VCF output right now!
-#
 workflow Funcotator {
     String gatk_docker
     File ref_fasta
@@ -94,15 +92,21 @@ task Funcotate {
      File ref_fasta
      File ref_fasta_index
      File ref_dict
+
      File input_vcf
      File input_vcf_idx
+
      String reference_version
+
      String output_file_base_name
      String output_format
+
      Boolean compress
      Boolean use_gnomad
 
-     File? data_sources_tar_gz
+     # This should be updated when a new version of the data sources is released
+     # TODO: Make this dynamically chosen in the command.
+     File? data_sources_tar_gz = "gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz"
 
      String? control_id
      String? case_id
@@ -125,10 +129,10 @@ task Funcotate {
      String output_maf_index = output_maf + ".idx"
 
      String output_vcf = output_file_base_name + if compress then ".vcf.gz" else ".vcf"
-     String output_vcf_index = output_vcf +  if compress then ".tbi" else ".idx"
+     String output_vcf_idx = output_vcf +  if compress then ".tbi" else ".idx"
 
      String output_file = if output_format == "MAF" then output_maf else output_vcf
-     String output_file_index = if output_format == "MAF" then output_maf_index else output_vcf_index
+     String output_file_index = if output_format == "MAF" then output_maf_index else output_vcf_idx
 
      String transcript_selection_arg = if defined(transcript_selection_list) then " --transcript-list " else ""
      String annotation_def_arg = if defined(annotation_defaults) then " --annotation-default " else ""
@@ -153,43 +157,43 @@ task Funcotate {
 
      Boolean use_ssd = false
 
-     # This should be updated when a new version of the data sources is released
-     # TODO: Make this dynamically chosen in the command.
-     # TODO: Make this pull from google cloud, rather than from the FTP:
-     String default_datasources_version = "funcotator_dataSources.v1.6.20190124s"
+     # Mem is in units of GB but our command and memory runtime values are in MB
+     Int default_ram_mb = 1024 * 3
+     Int machine_mem = if defined(mem) then mem *1024 else default_ram_mb
+     Int command_mem = machine_mem - 1024
 
-     # You may have to change the following two parameter values depending on the task requirements
-     Int default_ram_mb = 3000
-     # WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb).  Please see [TODO: Link from Jose] for examples.
-     Int default_disk_space_gb = 100
+     # Calculate disk size:
+     Float ref_size_gb = size(ref_fasta, "GiB") + size(ref_fasta_index, "GiB") + size(ref_dict, "GiB")
+     Float vcf_size_gb = size(input_vcf, "GiB") + size(input_vcf_idx, "GiB")
+     Float ds_size_gb = size(data_sources_tar_gz, "GiB")
 
-     # Mem is in units of GB but our command and memory runtime values are in MB
-     Int machine_mem = if defined(mem) then mem *1000 else default_ram_mb
-     Int command_mem = machine_mem - 1000
+     Int default_disk_space_gb = ceil( ref_size_gb + (ds_size_gb * 2) + (vcf_size_gb * 10) ) + 20
 
+     # Silly hack to allow us to use the dollar sign in the command section:
      String dollar = "$"
 
      command <<<
          set -e
          export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
 
-         # Handle our data sources:
-         DATA_SOURCES_TAR_GZ=${data_sources_tar_gz}
-         if [[ ! -e $DATA_SOURCES_TAR_GZ ]] ; then
-             # We have to download the data sources:
-             echo "Data sources gzip does not exist: $DATA_SOURCES_TAR_GZ"
-             echo "Downloading default data sources..."
-             wget ftp://[email protected]/bundle/funcotator/${default_datasources_version}.tar.gz
-             tar -zxf ${default_datasources_version}.tar.gz
-             DATA_SOURCES_FOLDER=${default_datasources_version}
-         else
-             # Extract the tar.gz:
-             echo "Extracting data sources zip file..."
-             mkdir datasources_dir
-             tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1
-             DATA_SOURCES_FOLDER="$PWD/datasources_dir"
+         # =======================================
+         # Hack to validate our WDL inputs:
+         #
+         # NOTE: This happens here so that we don't waste time copying down the data sources if there's an error.
+
+         if [[ "${output_format}" != "MAF" ]] && [[ "${output_format}" != "VCF" ]] ; then
+            echo "ERROR: Output format must be MAF or VCF."
          fi
 
+         # =======================================
+         # Handle our data sources:
+
+         # Extract the tar.gz:
+         echo "Extracting data sources tar/gzip file..."
+         mkdir datasources_dir
+         tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1
+         DATA_SOURCES_FOLDER="$PWD/datasources_dir"
+
          # Handle gnomAD:
          if ${use_gnomad} ; then
              echo "Enabling gnomAD..."
@@ -205,6 +209,7 @@ task Funcotate {
              done
          fi
 
+         # =======================================
          # Run Funcotator:
          gatk --java-options "-Xmx${command_mem}m" Funcotator \
              --data-sources-path $DATA_SOURCES_FOLDER \
@@ -226,6 +231,7 @@ task Funcotate {
              ${filter_funcotations_args} \
              ${extra_args_arg}
 
+         # =======================================
          # Make sure we have a placeholder index for MAF files so this workflow doesn't fail:
          if [[ "${output_format}" == "MAF" ]] ; then
             touch ${output_maf_index}
@@ -238,7 +244,7 @@ task Funcotate {
          memory: machine_mem + " MB"
          disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD"
          preemptible: select_first([preemptible_attempts, 3])
-         maxRetries: select_first([max_retries, 3])
+         maxRetries: select_first([max_retries, 0])
          cpu: select_first([cpu, 1])
      }