diff --git a/docs/funcotator/forum_info/forum_post_tutorial.md b/docs/funcotator/forum_info/forum_post_tutorial.md index b3b01835ab8..eacb1ff59e9 100644 --- a/docs/funcotator/forum_info/forum_post_tutorial.md +++ b/docs/funcotator/forum_info/forum_post_tutorial.md @@ -12,6 +12,8 @@ This page explains what **Funcotator** is and how to run it. 2. [1.1.2 Pre-Packaged Data Sources](#1.1.2) 1. [1.1.2.1 Downloading Pre-Packaged Data Sources](#1.1.2.1) 2. [1.1.2.2 gnomAD](#1.1.2.2) + 1. [1.1.2.2.1 Enabling gnomAD](#1.1.2.2.1) + 2. [1.1.2.2.2 Included gnomAD Fields](#1.1.2.2.2) 3. [1.1.3 Data Source Downloader Tool](#1.1.3) 4. [1.1.4 Disabling Data Sourcesl](#1.1.4) 5. [1.1.5 User-Defined Data Sources](#1.1.5) @@ -112,10 +114,16 @@ Versioned gzip archives of data source files are provided here: ### 1.1.2.2 - gnomAD -The pre-packaged data sources include gnomAD, a large database of known variants. gnomAD is split into two parts - one based on exome data, one based on whole genome data. +The pre-packaged data sources include a subset of gnomAD, a large database of known variants. This subset contains a greatly reduced subset of the INFO fields, primarily containing allele frequency data. gnomAD is split into two parts - one based on exome data, one based on whole genome data. These two data sources are not equivalent and for complete coverage using gnomAD, we recommend annotating with both. Due to the size of gnomAD, it cannot be included in the data sources package directly. Instead, the configuration data are present and point to a Google bucket in which the gnomAD data reside. This will cause [Funcotator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php "Funcotator") to actively connect to that bucket when it is run. For this reason, **gnomAD is disabled by default**. + +Because [Funcotator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php "Funcotator") will query the Internet **when gnomAD is enabled, performance will be impacted** by the machine's Internet connection speed. +If this degradation is significant, you can localize gnomAD to the machine running [Funcotator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php "Funcotator") to improve performance (however due to the size of gnomAD this may be impractical). + + +### 1.1.2.2.1 - Enabling gnomAD To enable gnomAD, simply change directories to your data sources directory and untar the gnomAD tar.gz files: ``` cd DATA_SOURCES_DIR @@ -123,8 +131,56 @@ tar -zxf gnomAD_exome.tar.gz tar -zxf gnomAD_genome.tar.gz ``` -Because [Funcotator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php "Funcotator") will query the Internet when gnomAD is enabled, performance will be impacted by the machine's Internet connection speed. -If this degradation is significant, you can localize gnomAD to the machine running [Funcotator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_funcotator_Funcotator.php "Funcotator") to improve performance (however due to the size of gnomAD this may be impractical). + +### 1.1.2.2.2 - Included gnomAD Fields +The fields included in the pre-packaged gnomAD subset are the following: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Field NameField Description
AFAllele Frequency, for each ALT allele, in the same order as listed
AF_afrAlternate allele frequency in samples of African-American ancestry
AF_afr_femaleAlternate allele frequency in female samples of African-American ancestry
AF_afr_maleAlternate allele frequency in male samples of African-American ancestry
AF_amrAlternate allele frequency in samples of Latino ancestry
AF_amr_femaleAlternate allele frequency in female samples of Latino ancestry
AF_amr_maleAlternate allele frequency in male samples of Latino ancestry
AF_asjAlternate allele frequency in samples of Ashkenazi Jewish ancestry
AF_asj_femaleAlternate allele frequency in female samples of Ashkenazi Jewish ancestry
AF_asj_maleAlternate allele frequency in male samples of Ashkenazi Jewish ancestry
AF_easAlternate allele frequency in samples of East Asian ancestry
AF_eas_femaleAlternate allele frequency in female samples of East Asian ancestry
AF_eas_jpnAlternate allele frequency in samples of Japanese ancestry
AF_eas_korAlternate allele frequency in samples of Korean ancestry
AF_eas_maleAlternate allele frequency in male samples of East Asian ancestry
AF_eas_oeaAlternate allele frequency in samples of non-Korean, non-Japanese East Asian ancestry
AF_femaleAlternate allele frequency in female samples
AF_finAlternate allele frequency in samples of Finnish ancestry
AF_fin_femaleAlternate allele frequency in female samples of Finnish ancestry
AF_fin_maleAlternate allele frequency in male samples of Finnish ancestry
AF_maleAlternate allele frequency in male samples
AF_nfeAlternate allele frequency in samples of non-Finnish European ancestry
AF_nfe_bgrAlternate allele frequency in samples of Bulgarian ancestry
AF_nfe_estAlternate allele frequency in samples of Estonian ancestry
AF_nfe_femaleAlternate allele frequency in female samples of non-Finnish European ancestry
AF_nfe_maleAlternate allele frequency in male samples of non-Finnish European ancestry
AF_nfe_nweAlternate allele frequency in samples of North-Western European ancestry
AF_nfe_onfAlternate allele frequency in samples of non-Finnish but otherwise indeterminate European ancestry
AF_nfe_seuAlternate allele frequency in samples of Southern European ancestry
AF_nfe_sweAlternate allele frequency in samples of Swedish ancestry
AF_othAlternate allele frequency in samples of uncertain ancestry
AF_oth_femaleAlternate allele frequency in female samples of uncertain ancestry
AF_oth_maleAlternate allele frequency in male samples of uncertain ancestry
AF_popmaxMaximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry)
AF_rawAlternate allele frequency in samples, before removing low-confidence genotypes
AF_sasAlternate allele frequency in samples of South Asian ancestry
AF_sas_femaleAlternate allele frequency in female samples of South Asian ancestry
AF_sas_maleAlternate allele frequency in male samples of South Asian ancestry
OriginalAlleles*A list of the original alleles (including REF) of the variant prior to liftover. If the alleles were not changed during liftover, this attribute will be omitted.
OriginalContig*The name of the source contig/chromosome prior to liftover.
OriginalStart*The position of the variant on the source contig prior to liftover.
ReverseComplementedAlleles*The REF and the ALT alleles have been reverse complemented in liftover since the mapping from the previous reference to the current one was on the negative strand.
SwappedAlleles*The REF and the ALT alleles have been swapped in liftover due to changes in the reference. It is possible that not all INFO annotations reflect this swap, and in the genotypes, only the GT, PL, and AD fields have been modified. You should check the TAGS_TO_REVERSE parameter that was used during the LiftOver to be sure.
+\* - only available in *hg38* ### 1.1.3 - Data Source Downloader Tool @@ -406,7 +462,9 @@ This effect has not yet been quantified, but in most cases should not be appreci #### 1.5 - Comparisons with Oncotator -Oncotator is an older functional annotation tool developed by The Broad Institute. Funcotator and Oncotator are fundamentally different tools with some similarities. Some comparison highlights between Oncotator and Funcotator are in the following two tables: +Oncotator is an older functional annotation tool developed by The Broad Institute. Funcotator and Oncotator are fundamentally different tools with some similarities. + +While I maintain that a direct comparison should not be made, to address some inevitable questions some comparison highlights between Oncotator and Funcotator are in the following two tables: #### 1.5.1 - Funcotator / Oncotator Feature Comparison @@ -428,7 +486,7 @@ Oncotator is an older functional annotation tool developed by The Broad Institut Default config speed germline (muts/min) (hg19)A very long time.... Default config speed somatic (muts/min) (hg38)N/A Default config speed germline (muts/min) (hg38)N/A -DocumentationTutorial; Specifications forum post; inclusion in workshop materialsMinimal support in forum +DocumentationTutorial; Specifications forum post; inclusion in workshop materialsMinimal support in forum ManuscriptPlannedYes HGVS supportNoYes BigWig datasource supportNoLinux only @@ -462,8 +520,8 @@ Oncotator is an older functional annotation tool developed by The Broad Institut #### 1.5.2 - Oncotator Bugs Compared With Funcotator - - + + diff --git a/scripts/mutect2_wdl/mutect2.wdl b/scripts/mutect2_wdl/mutect2.wdl index dce0e8d52a3..dba538b397a 100755 --- a/scripts/mutect2_wdl/mutect2.wdl +++ b/scripts/mutect2_wdl/mutect2.wdl @@ -52,12 +52,17 @@ ## ## Funcotator parameters (see Funcotator help for more details). ## funco_reference_version: "hg19" for hg19 or b37. "hg38" for hg38. Default: "hg19" -## funco_transcript_selection_list: Transcripts (one GENCODE ID per line) to give priority during selection process. -## funco_transcript_selection_mode: How to select transcripts in Funcotator. ALL, CANONICAL, or BEST_EFFECT +## funco_output_format: "MAF" to produce a MAF file, "VCF" to procude a VCF file. Default: "MAF" +## funco_compress: (Only valid if funco_output_format == "VCF" ) If true, will compress the output of Funcotator. If false, produces an uncompressed output file. Default: false +## funco_use_gnomad_AF: If true, will include gnomAD allele frequency annotations in output by connecting to the internet to query gnomAD (this impacts performance). If false, will not annotate with gnomAD. Default: false ## funco_data_sources_tar_gz: Funcotator datasources tar gz file. Bucket location is recommended when running on the cloud. +## funco_transcript_selection_mode: How to select transcripts in Funcotator. ALL, CANONICAL, or BEST_EFFECT +## funco_transcript_selection_list: Transcripts (one GENCODE ID per line) to give priority during selection process. ## funco_annotation_defaults: Default values for annotations, when values are unspecified. Specified as :. For example: "Center:Broad" ## funco_annotation_overrides: Values for annotations, even when values are unspecified. Specified as :. For example: "Center:Broad" ## funcotator_excluded_fields: Annotations that should not appear in the output (VCF or MAF). Specified as . For example: "ClinVar_ALLELEID" +## funco_filter_funcotations: If true, will only annotate variants that have passed filtering (. or PASS value in the FILTER column). If false, will annotate all variants in the input file. Default: true +## funcotator_extra_args: Any additional arguments to pass to Funcotator. Default: "" ## ## Outputs : ## - One VCF file and its index with primary filtering applied; secondary filtering and functional annotation if requested; a bamout.bam @@ -119,22 +124,28 @@ workflow Mutect2 { File? default_config_file String? oncotator_extra_args - # funcotator inputs + # Funcotator inputs Boolean? run_funcotator - Boolean run_funcotator_or_default = select_first([run_funcotator, false]) + String? funco_reference_version + String? funco_output_format + Boolean? funco_compress + Boolean? funco_use_gnomad_AF File? funco_data_sources_tar_gz String? funco_transcript_selection_mode File? funco_transcript_selection_list Array[String]? funco_annotation_defaults Array[String]? funco_annotation_overrides Array[String]? funcotator_excluded_fields + Boolean? funco_filter_funcotations String? funcotator_extra_args - File? gatk_override + Boolean run_funcotator_or_default = select_first([run_funcotator, false]) + String funco_default_output_format = "MAF" # runtime String gatk_docker + File? gatk_override String basic_bash_docker = "ubuntu:16.04" String? oncotator_docker String oncotator_docker_or_default = select_first([oncotator_docker, "broadinstitute/oncotator:1.9.9.0"]) @@ -413,30 +424,40 @@ workflow Mutect2 { if (run_funcotator_or_default) { File funcotate_vcf_input = select_first([FilterAlignmentArtifacts.filtered_vcf, FilterByOrientationBias.filtered_vcf, Filter.filtered_vcf]) File funcotate_vcf_input_index = select_first([FilterAlignmentArtifacts.filtered_vcf_index, FilterByOrientationBias.filtered_vcf_index, Filter.filtered_vcf_index]) - call FuncotateMaf { + call Funcotate { input: - input_vcf = funcotate_vcf_input, - input_vcf_idx = funcotate_vcf_input_index, ref_fasta = ref_fasta, ref_fasta_index = ref_fai, ref_dict = ref_dict, + input_vcf = funcotate_vcf_input, + input_vcf_idx = funcotate_vcf_input_index, reference_version = select_first([funco_reference_version, "hg19"]), + output_file_base_name = basename(funcotate_vcf_input, ".vcf") + ".annotated", + output_format = if defined(funco_output_format) then "" + funco_output_format else funco_default_output_format, + compress = if defined(funco_compress) then funco_compress else false, + use_gnomad = if defined(funco_use_gnomad_AF) then funco_use_gnomad_AF else false, + data_sources_tar_gz = funco_data_sources_tar_gz, - case_id = M2.tumor_sample[0], + control_id = M2.normal_sample[0], + case_id = M2.tumor_sample[0], + sequencing_center = sequencing_center, + sequence_source = sequence_source, transcript_selection_mode = funco_transcript_selection_mode, transcript_selection_list = funco_transcript_selection_list, annotation_defaults = funco_annotation_defaults, annotation_overrides = funco_annotation_overrides, + funcotator_excluded_fields = funcotator_excluded_fields, + filter_funcotations = filter_funcotations_or_default, + + extra_args = funcotator_extra_args, + gatk_docker = gatk_docker, + gatk_override = gatk_override, - filter_funcotations = filter_funcotations_or_default, - funcotator_excluded_fields = funcotator_excluded_fields, - sequencing_center = sequencing_center, - sequence_source = sequence_source, - disk_space_gb = ceil(size(funcotate_vcf_input, "GB") * large_input_to_output_multiplier) + onco_tar_size + disk_pad, + preemptible_attempts = preemptible_attempts, max_retries = max_retries, - extra_args = funcotator_extra_args + disk_space_gb = ceil(size(funcotate_vcf_input, "GB") * large_input_to_output_multiplier) + onco_tar_size + disk_pad } } @@ -445,7 +466,8 @@ workflow Mutect2 { File filtered_vcf_index = select_first([FilterAlignmentArtifacts.filtered_vcf_index, FilterByOrientationBias.filtered_vcf_index, Filter.filtered_vcf_index]) File? contamination_table = CalculateContamination.contamination_table File? oncotated_m2_maf = oncotate_m2.oncotated_m2_maf - File? funcotated_maf = FuncotateMaf.funcotated_output + File? funcotated_file = Funcotate.funcotated_output_file + File? funcotated_file_index = Funcotate.funcotated_output_file_index File? preadapter_detail_metrics = CollectSequencingArtifactMetrics.pre_adapter_metrics File? bamout = MergeBamOuts.merged_bam_out File? bamout_index = MergeBamOuts.merged_bam_out_index @@ -1181,43 +1203,65 @@ task SumFloats { } } -task FuncotateMaf { - # inputs +task Funcotate { + + # ============== + # Inputs File ref_fasta File ref_fasta_index File ref_dict File input_vcf File input_vcf_idx String reference_version - String output_format = "MAF" + String output_file_base_name + String output_format + Boolean compress + Boolean use_gnomad + + # This should be updated when a new version of the data sources is released + # TODO: Make this dynamically chosen in the command. + File? data_sources_tar_gz = "gs://broad-public-datasets/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz" + + String? control_id + String? case_id String? sequencing_center String? sequence_source - String case_id - String? control_id - - File? data_sources_tar_gz String? transcript_selection_mode File? transcript_selection_list Array[String]? annotation_defaults Array[String]? annotation_overrides Array[String]? funcotator_excluded_fields - Boolean filter_funcotations + Boolean? filter_funcotations File? interval_list String? extra_args # ============== # Process input args: + + String output_maf = output_file_base_name + ".maf" + String output_maf_index = output_maf + ".idx" + + String output_vcf = output_file_base_name + if compress then ".vcf.gz" else ".vcf" + String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx" + + String output_file = if output_format == "MAF" then output_maf else output_vcf + String output_file_index = if output_format == "MAF" then output_maf_index else output_vcf_index + + String transcript_selection_arg = if defined(transcript_selection_list) then " --transcript-list " else "" String annotation_def_arg = if defined(annotation_defaults) then " --annotation-default " else "" String annotation_over_arg = if defined(annotation_overrides) then " --annotation-override " else "" - String filter_funcotations_args = if (filter_funcotations) then " --remove-filtered-variants " else "" + String filter_funcotations_args = if defined(filter_funcotations) && (filter_funcotations) then " --remove-filtered-variants " else "" String excluded_fields_args = if defined(funcotator_excluded_fields) then " --exclude-field " else "" - String final_output_filename = basename(input_vcf, ".vcf") + ".maf.annotated" - # ============== - # runtime + String interval_list_arg = if defined(interval_list) then " -L " else "" + + String extra_args_arg = select_first([extra_args, ""]) + # ============== + # Runtime options: String gatk_docker + File? gatk_override Int? mem Int? preemptible_attempts @@ -1227,56 +1271,67 @@ task FuncotateMaf { Boolean use_ssd = false - # This should be updated when a new version of the data sources is released - String default_datasources_version = "funcotator_dataSources.v1.4.20180615" - # You may have to change the following two parameter values depending on the task requirements Int default_ram_mb = 3000 - # WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb). + # WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb). Please see [TODO: Link from Jose] for examples. Int default_disk_space_gb = 100 # Mem is in units of GB but our command and memory runtime values are in MB Int machine_mem = if defined(mem) then mem *1000 else default_ram_mb Int command_mem = machine_mem - 1000 + String dollar = "$" + command <<< set -e export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override} - DATA_SOURCES_TAR_GZ=${data_sources_tar_gz} - if [[ ! -e $DATA_SOURCES_TAR_GZ ]] ; then - # We have to download the data sources: - echo "Data sources gzip does not exist: $DATA_SOURCES_TAR_GZ" - echo "Downloading default data sources..." - wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/${default_datasources_version}.tar.gz - tar -zxf ${default_datasources_version}.tar.gz - DATA_SOURCES_FOLDER=${default_datasources_version} - else - # Extract the tar.gz: - mkdir datasources_dir - tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1 - DATA_SOURCES_FOLDER="$PWD/datasources_dir" + # Extract our data sources: + echo "Extracting data sources zip file..." + mkdir datasources_dir + tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1 + DATA_SOURCES_FOLDER="$PWD/datasources_dir" + + # Handle gnomAD: + if ${use_gnomad} ; then + echo "Enabling gnomAD..." + for potential_gnomad_gz in gnomAD_exome.tar.gz gnomAD_genome.tar.gz ; do + if [[ -f ${dollar}{DATA_SOURCES_FOLDER}/${dollar}{potential_gnomad_gz} ]] ; then + cd ${dollar}{DATA_SOURCES_FOLDER} + tar -zvxf ${dollar}{potential_gnomad_gz} + cd - + else + echo "ERROR: Cannot find gnomAD folder: ${dollar}{potential_gnomad_gz}" 1>&2 + false + fi + done fi + # Run Funcotator: gatk --java-options "-Xmx${command_mem}m" Funcotator \ --data-sources-path $DATA_SOURCES_FOLDER \ --ref-version ${reference_version} \ --output-file-format ${output_format} \ -R ${ref_fasta} \ -V ${input_vcf} \ - -O ${final_output_filename} \ - ${"-L " + interval_list} \ + -O ${output_file} \ + ${interval_list_arg} ${default="" interval_list} \ + --annotation-default normal_barcode:${default="Unknown" control_id} \ + --annotation-default tumor_barcode:${default="Unknown" case_id} \ + --annotation-default Center:${default="Unknown" sequencing_center} \ + --annotation-default source:${default="Unknown" sequence_source} \ ${"--transcript-selection-mode " + transcript_selection_mode} \ - ${"--transcript-list " + transcript_selection_list} \ - --annotation-default normal_barcode:${control_id} \ - --annotation-default tumor_barcode:${case_id} \ - --annotation-default Center:${default="Unknown" sequencing_center} \ - --annotation-default source:${default="Unknown" sequence_source} \ + ${transcript_selection_arg}${default="" sep=" --transcript-list " transcript_selection_list} \ ${annotation_def_arg}${default="" sep=" --annotation-default " annotation_defaults} \ ${annotation_over_arg}${default="" sep=" --annotation-override " annotation_overrides} \ ${excluded_fields_args}${default="" sep=" --exclude-field " funcotator_excluded_fields} \ ${filter_funcotations_args} \ - ${extra_args} + ${extra_args_arg} + + # Make sure we have a placeholder index for MAF files so this workflow doesn't fail: + if [[ "${output_format}" == "MAF" ]] ; then + touch ${output_maf_index} + fi >>> runtime { @@ -1290,6 +1345,7 @@ task FuncotateMaf { } output { - File funcotated_output = "${final_output_filename}" + File funcotated_output_file = "${output_file}" + File funcotated_output_file_index = "${output_file_index}" } - } \ No newline at end of file + } diff --git a/scripts/mutect2_wdl/unsupported/README.md b/scripts/mutect2_wdl/unsupported/README.md deleted file mode 100644 index 3bedc319287..00000000000 --- a/scripts/mutect2_wdl/unsupported/README.md +++ /dev/null @@ -1,107 +0,0 @@ -### Mutect2 autovalidation - -## Introduction -The Mutect2 autovalidation comprises a sensitivity validation and a specificity validation. - -In the sensitivity validation, we mix (in vitro, not in silico) several Hapmap samples in roughly equal proportions to simulate a tumor with varying allele fractions, sequence the resulting mixture, and run Mutect2 in tumor-only mode. Sensitivity to "somatic" variations is then defined as sensitivity to the known germline variants of the constituent samples. A validation consists of 5-plex, 10-plex, and 20-plex mixtures, each with several replicates. - -In the specificity validation, we make several replicates of a non-tumor sample and for each pair of these replicates we run Mutect2 in tumor-normal mode, with one replicate arbitrarily assigned as the "tumor." Since every call is by definition a false positive, this yields a measure of specificity. - - -## Requirements - -The following files from a clone of the gatk git repository, copied into a single directory: -* scripts/mutect2_wdl/mutect2.wdl -* scripts/mutect2_wdl/unsupported/hapmap_sensitivity.wdl -* scripts/mutect2_wdl/unsupported/hapmap_sensitivity_all_plexes.wdl -* scripts/mutect2_wdl/unsupported/mutect2-replicate-validation.wdl -* scripts/mutect2_wdl/unsupported/calculate_sensitivity.py - -Additionally, the gatk git repository has a script called gatk (in the root directory of the repo) that is used to invoke the gatk. If running on the cloud this is in the gatk docker image and you don't have to do anything. If running on SGE, you must copy this script to a directory that is in your $PATH. - -The following resources: -* Three preprocessed Hapmap vcfs -- one each for the 5-plex, 10-plex and 20-plex mixtures. These are produced by preprocess_hapmap.wdl but as long as the sample composition of the mixtures remains the same they do not need to be generated again. That is, the proportions need not be the same, but the same 5, 10, and 20 Hapmap samples must be present. -* A reference .fasta file, along with accompanying .fasta.fai and .dict files. -* A gatk4 java .jar file. -* Three lists of .bam files -- one each for 5-plex, 10-plex and 20-plex replicates -- where each row has the format -* A list of .bam files of the specificity validation's replicates, where each row has the format , with one row for each *ordered* pair i, j eg (1,2), (1,3), (2,1), (2,3), (3,1), (3,2) if there are three replicates. -* An intervals file. -* A gnomAD vcf. -* A Mutect2 panel of normals .vcf corresponding to the intervals and sequencing protocol of the specificity replicates. - -## Preparing the wdl input files -In the same directory as your wdl scripts, fill in a file called sensitivity.json as follows: - -``` -{ - "HapmapSensitivityAllPlexes.gatk_override": "[Path to a gatk jar file. Omitting this line uses the gatk jar in the docker image.]", - "HapmapSensitivityAllPlexes.gatk_docker": "[gatk docker image eg broadinstitute/gatk:4.beta.3 -- this is not used in SGE but you still have to fill it in.]", - "HapmapSensitivityAllPlexes.intervals": "[path to intervals file]", - "HapmapSensitivityAllPlexes.ref_fasta": "[path to reference .fasta file]", - "HapmapSensitivityAllPlexes.ref_fai": "[path to reference .fasta.fai file]", - "HapmapSensitivityAllPlexes.ref_dict": "[path to reference .dict file]", - "HapmapSensitivityAllPlexes.five_plex_bam_list": "[path to 5-plex bams list]", - "HapmapSensitivityAllPlexes.ten_plex_bam_list": "[path to 10-plex bams list]", - "HapmapSensitivityAllPlexes.twenty_plex_bam_list": "[path to 20-plex bams list]", - "HapmapSensitivityAllPlexes.max_depth": "The maximum depth to consider for sensitivity. 1000 is a reasonable default.", - "HapmapSensitivityAllPlexes.depth_bins": "Discrete depths at which to bin statistics. [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800] is reasonable for many exomes", - "HapmapSensitivityAllPlexes.depth_bin_width": "The width of depth bins. Half the spacing betweens depths is reasonable.", - "HapmapSensitivityAllPlexes.scatter_count": "[How many ways to scatter runs on Mutect2 on each bam file]", - "HapmapSensitivityAllPlexes.run_orientation_bias_filter": "true/false depending on whether you wish to run this filter", - "HapmapSensitivityAllPlexes.artifact_modes": The artifact modes of the orientation bias filter eg: ["G/T", "C/T"], - "HapmapSensitivityAllPlexes.five_plex_preprocessed": "[path to preprocessed 5-plex vcf]", - "HapmapSensitivityAllPlexes.five_plex_preprocessed_idx": "[path to preprocessed 5-plex vcf index]", - "HapmapSensitivityAllPlexes.ten_plex_preprocessed": "[path to preprocessed 10-plex vcf]", - "HapmapSensitivityAllPlexes.ten_plex_preprocessed_idx": "[path to preprocessed 10-plex vcf index]", - "HapmapSensitivityAllPlexes.twenty_plex_preprocessed": "[path to preprocessed 20-plex vcf]", - "HapmapSensitivityAllPlexes.twenty_plex_preprocessed_idx": "[path to preprocessed 20-plex vcf index]", - "HapmapSensitivityAllPlexes.python_script": "path to calculate_sensitivity.py", - "HapmapSensitivityAllPlexes.m2_extra_args": "optionally, any additional Mutect2 command line arguments", - "HapmapSensitivityAllPlexes.m2_extra_filtering_args": "--maxEventsInHaplotype 100 --max_germline_posterior 1.0" -} -``` - -Note the extra filtering arguments hard-coded into these inputs. These are necessary to disable filtering of germline variants, because the "somatic" variants here are actually germline variants. - -In the same directory as your wdl scripts, fill in a file called specificity.json as follows: - -``` -{ - "Mutect2ReplicateValidation.gatk_override": "[Path to a gatk jar file. Omitting this line uses the gatk jar in the docker image.]", - "Mutect2ReplicateValidation.gatk_docker": "[gatk docker image eg broadinstitute/gatk:4.beta.3 -- this is not used in SGE but you still have to fill it in.]", - "Mutect2ReplicateValidation.ref_fasta": "[path to reference .fasta file]", - "Mutect2ReplicateValidation.ref_fai": "[path to reference .fasta.fai file]", - "Mutect2ReplicateValidation.ref_dict": "[path to reference .dict file]", - "Mutect2ReplicateValidation.replicate_pair_list": "[path to replicate bams list]", - "Mutect2ReplicateValidation.intervals": "[path to intervals file]", - "Mutect2ReplicateValidation.pon": "[path to panel of normals vcf]", - "Mutect2ReplicateValidation.pon_index": "[path to panel of normals vcf index]", - "Mutect2ReplicateValidation.gnomad": "[path to panel of gnomAD vcf]", - "Mutect2ReplicateValidation.gnomad_index": "[path to panel of gnomAD vcf index]", - "Mutect2ReplicateValidation.scatter_count": "[How many ways to scatter runs on Mutect2 on each bam file]", - "Mutect2ReplicateValidation.run_orientation_bias_filter": "true/false depending on whether you wish to run this filter", - "Mutect2ReplicateValidation.artifact_modes": The artifact modes of the orientation bias filter eg: ["G/T", "C/T"], - "Mutect2ReplicateValidation.preemptible_attempts": "2", - "Mutect2ReplicateValidation.m2_extra_args": "optionally, any additional Mutect2 command line arguments", - "Mutect2ReplicateValidation.m2_extra_filtering_args": "optionally, any additional Mutect2 command line arguments" -} -``` - -Note that the docker image path is not used when the validations are run on an SGE cluster. When running on SGE, a valid docker path must still be given or else cromwell will fail. - -To summarize the differences between running in the cloud and on SGE: -* Your jsons must include a valid gatk_docker in both cases, however, when running on SGE this docker image is not actually used. -* When running in SGE you must put a gatk_override jar file in your jsons. When running in the cloud you may include one but if you omit this line from your jsons the gatk jar in the docker image will be used. -* When running in SGE you must make sure to copy the gatk script in the root directory of the gatk git repo into a folder that is in your bash $PATH variable. - -## Running in Cromwell -* Run hapmap_sensitivity_all_plexes.wdl with the parameters in sensitivity.json -* Run mutect2-replicate-validation.wdl with the parameters in specificity.json - -## Outputs -The sensitivity validation outputs include many vcfs of true positives and false negatives used for debugging and improving Mutect. The relevant outputs for validation results are: -* {snp, indel}\_{table, plot}\_{5, 10, 20, all}\_plex: tables in tsv format and graphs in png format of sensitivity versus depth and allele fraction for snvs and indels at each plex and aggregated over all plexes. -* {snp, indel}\_jaccard\_{5, 10, 20}\_plex: matrices in tsv format of the snv and indel jaccard index between each pair of replicates for each plex. The jaccard index is the overlap of callsets divided by the union. -* summary\_{5, 10, 20}\_plex: the overall sensitivity (not binned by depth and allele fraction) for snvs and indels for each replicate of each plex. - -The specificty validation's primary output, summary.txt, is a tsv file containing the rates of snv and indel false positives for each replicate pair. diff --git a/scripts/mutect2_wdl/unsupported/funcotator.wdl b/scripts/mutect2_wdl/unsupported/funcotator.wdl index e2d8106aa2f..ed522ae981e 100644 --- a/scripts/mutect2_wdl/unsupported/funcotator.wdl +++ b/scripts/mutect2_wdl/unsupported/funcotator.wdl @@ -3,25 +3,27 @@ # Description of inputs: # # Required: -# gatk_docker - GATK Docker image in which to run -# ref_fasta - Reference FASTA file. -# ref_fasta_index - Reference FASTA file index. -# ref_fasta_dict - Reference FASTA file sequence dictionary. -# variant_vcf_to_funcotate - Variant Context File (VCF) containing the variants to annotate. -# reference_version - Version of the reference being used. Either `hg19` or `hg38`. -# output_file_name - Path to desired output file. -# compress - Whether to compress the resulting output file. -# Boolean use_gnomad - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist. +# String gatk_docker - GATK Docker image in which to run +# File ref_fasta - Reference FASTA file. +# File ref_fasta_index - Reference FASTA file index. +# File ref_fasta_dict - Reference FASTA file sequence dictionary. +# File variant_vcf_to_funcotate - Variant Context File (VCF) containing the variants to annotate. +# File variant_vcf_to_funcotate_index - Index file corresponding to the input Variant Context File (VCF) containing the variants to annotate. +# String reference_version - Version of the reference being used. Either `hg19` or `hg38`. +# String output_file_name - Path to desired output file. +# String output_format - Output file format (either VCF or MAF). +# Boolean compress - Whether to compress the resulting output file. +# Boolean use_gnomad - If true, will enable the gnomAD data sources in the data source tar.gz, if they exist. # # Optional: -# interval_list - Intervals to be used for traversal. If specified will only traverse the given intervals. -# data_sources_tar_gz - Path to tar.gz containing the data sources for Funcotator to create annotations. -# transcript_selection_mode - Method of detailed transcript selection. This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`). -# transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript. -# annotation_defaults - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format :). This will add the specified annotation to every annotated variant if it is not already present. -# annotation_overrides - Override values for annotations (in the format :). Replaces existing annotations of the given name with given values. -# gatk4_jar_override - Override Jar file containing GATK 4.0. Use this when overriding the docker JAR or when using a backend without docker. -# funcotator_extra_args - Extra command-line arguments to pass through to Funcotator. (e.g. " --exclude-field foo_field --exclude-field bar_field ") +# interval_list - Intervals to be used for traversal. If specified will only traverse the given intervals. +# data_sources_tar_gz - Path to tar.gz containing the data sources for Funcotator to create annotations. +# transcript_selection_mode - Method of detailed transcript selection. This will select the transcript for detailed annotation (either `CANONICAL` or `BEST_EFFECT`). +# transcript_selection_list - Set of transcript IDs to use for annotation to override selected transcript. +# annotation_defaults - Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format :). This will add the specified annotation to every annotated variant if it is not already present. +# annotation_overrides - Override values for annotations (in the format :). Replaces existing annotations of the given name with given values. +# gatk4_jar_override - Override Jar file containing GATK 4.0. Use this when overriding the docker JAR or when using a backend without docker. +# funcotator_extra_args - Extra command-line arguments to pass through to Funcotator. (e.g. " --exclude-field foo_field --exclude-field bar_field ") # # This WDL needs to decide whether to use the ``gatk_jar`` or ``gatk_jar_override`` for the jar location. As of cromwell-0.24, # this logic *must* go into each task. Therefore, there is a lot of duplicated code. This allows users to specify a jar file @@ -35,10 +37,12 @@ workflow Funcotator { File ref_fasta_index File ref_dict File variant_vcf_to_funcotate + File variant_vcf_to_funcotate_index String reference_version String output_file_base_name - Boolean compress - Boolean use_gnomad + String output_format + Boolean compress + Boolean use_gnomad File? interval_list File? data_sources_tar_gz @@ -46,58 +50,69 @@ workflow Funcotator { Array[String]? transcript_selection_list Array[String]? annotation_defaults Array[String]? annotation_overrides - File? gatk4_jar_override - String? funcotator_extra_args + File? gatk4_jar_override + call Funcotate { input: + gatk_docker = gatk_docker, ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, ref_dict = ref_dict, input_vcf = variant_vcf_to_funcotate, + input_vcf_idx = variant_vcf_to_funcotate_index, reference_version = reference_version, - interval_list = interval_list, output_file_base_name = output_file_base_name, - compress = compress, - output_format = "VCF", + output_format = output_format, + compress = compress, + use_gnomad = use_gnomad, + + interval_list = interval_list, data_sources_tar_gz = data_sources_tar_gz, transcript_selection_mode = transcript_selection_mode, transcript_selection_list = transcript_selection_list, annotation_defaults = annotation_defaults, annotation_overrides = annotation_overrides, - gatk_override = gatk4_jar_override, - gatk_docker = gatk_docker, - use_gnomad = use_gnomad, - extra_args = funcotator_extra_args + extra_args = funcotator_extra_args, + + gatk_override = gatk4_jar_override } output { - File funcotated_vcf_out = Funcotate.funcotated_vcf - File funcotated_vcf_out_idx = Funcotate.funcotated_vcf_index + File funcotated_file_out = Funcotate.funcotated_output_file + File funcotated_file_out_idx = Funcotate.funcotated_output_file_index } } +################################################################################ task Funcotate { - # inputs + + # ============== + # Inputs File ref_fasta File ref_fasta_index File ref_dict File input_vcf + File input_vcf_idx String reference_version String output_file_base_name String output_format Boolean compress - String output_vcf = output_file_base_name + if compress then ".vcf.gz" else ".vcf" - String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx" + Boolean use_gnomad - Boolean use_gnomad File? data_sources_tar_gz + + String? control_id + String? case_id + String? sequencing_center + String? sequence_source String? transcript_selection_mode - Array[String]? transcript_selection_list + File? transcript_selection_list Array[String]? annotation_defaults Array[String]? annotation_overrides + Array[String]? funcotator_excluded_fields Boolean? filter_funcotations File? interval_list @@ -105,20 +120,34 @@ task Funcotate { # ============== # Process input args: + + String output_maf = output_file_base_name + ".maf" + String output_maf_index = output_maf + ".idx" + + String output_vcf = output_file_base_name + if compress then ".vcf.gz" else ".vcf" + String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx" + + String output_file = if output_format == "MAF" then output_maf else output_vcf + String output_file_index = if output_format == "MAF" then output_maf_index else output_vcf_index + String transcript_selection_arg = if defined(transcript_selection_list) then " --transcript-list " else "" String annotation_def_arg = if defined(annotation_defaults) then " --annotation-default " else "" String annotation_over_arg = if defined(annotation_overrides) then " --annotation-override " else "" String filter_funcotations_args = if defined(filter_funcotations) && (filter_funcotations) then " --remove-filtered-variants " else "" + String excluded_fields_args = if defined(funcotator_excluded_fields) then " --exclude-field " else "" + String interval_list_arg = if defined(interval_list) then " -L " else "" - String extra_args_arg = select_first([extra_args, ""]) - # ============== - # runtime + String extra_args_arg = select_first([extra_args, ""]) + # ============== + # Runtime options: String gatk_docker + File? gatk_override Int? mem Int? preemptible_attempts + Int? max_retries Int? disk_space_gb Int? cpu @@ -126,6 +155,7 @@ task Funcotate { # This should be updated when a new version of the data sources is released # TODO: Make this dynamically chosen in the command. + # TODO: Make this pull from google cloud, rather than from the FTP: String default_datasources_version = "funcotator_dataSources.v1.6.20190124s" # You may have to change the following two parameter values depending on the task requirements @@ -137,12 +167,13 @@ task Funcotate { Int machine_mem = if defined(mem) then mem *1000 else default_ram_mb Int command_mem = machine_mem - 1000 - String dollar = "$" + String dollar = "$" command <<< set -e export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override} + # Handle our data sources: DATA_SOURCES_TAR_GZ=${data_sources_tar_gz} if [[ ! -e $DATA_SOURCES_TAR_GZ ]] ; then # We have to download the data sources: @@ -153,50 +184,66 @@ task Funcotate { DATA_SOURCES_FOLDER=${default_datasources_version} else # Extract the tar.gz: + echo "Extracting data sources zip file..." mkdir datasources_dir tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1 DATA_SOURCES_FOLDER="$PWD/datasources_dir" fi - if ${use_gnomad} ; then - for potential_gnomad_gz in gnomAD_exome.tar.gz gnomAD_genome.tar.gz ; do - if [[ -f ${dollar}{DATA_SOURCES_FOLDER}/${dollar}{potential_gnomad_gz} ]] ; then - cd ${dollar}{DATA_SOURCES_FOLDER} - tar -zvxf ${dollar}{potential_gnomad_gz} - cd - - else - echo "ERROR: Cannot find gnomAD folder: ${dollar}{potential_gnomad_gz}" 1>&2 - false - fi - done - fi + # Handle gnomAD: + if ${use_gnomad} ; then + echo "Enabling gnomAD..." + for potential_gnomad_gz in gnomAD_exome.tar.gz gnomAD_genome.tar.gz ; do + if [[ -f ${dollar}{DATA_SOURCES_FOLDER}/${dollar}{potential_gnomad_gz} ]] ; then + cd ${dollar}{DATA_SOURCES_FOLDER} + tar -zvxf ${dollar}{potential_gnomad_gz} + cd - + else + echo "ERROR: Cannot find gnomAD folder: ${dollar}{potential_gnomad_gz}" 1>&2 + false + fi + done + fi + # Run Funcotator: gatk --java-options "-Xmx${command_mem}m" Funcotator \ --data-sources-path $DATA_SOURCES_FOLDER \ --ref-version ${reference_version} \ --output-file-format ${output_format} \ -R ${ref_fasta} \ -V ${input_vcf} \ - -O ${output_vcf} \ + -O ${output_file} \ ${interval_list_arg} ${default="" interval_list} \ + --annotation-default normal_barcode:${default="Unknown" control_id} \ + --annotation-default tumor_barcode:${default="Unknown" case_id} \ + --annotation-default Center:${default="Unknown" sequencing_center} \ + --annotation-default source:${default="Unknown" sequence_source} \ ${"--transcript-selection-mode " + transcript_selection_mode} \ ${transcript_selection_arg}${default="" sep=" --transcript-list " transcript_selection_list} \ ${annotation_def_arg}${default="" sep=" --annotation-default " annotation_defaults} \ ${annotation_over_arg}${default="" sep=" --annotation-override " annotation_overrides} \ + ${excluded_fields_args}${default="" sep=" --exclude-field " funcotator_excluded_fields} \ ${filter_funcotations_args} \ ${extra_args_arg} + + # Make sure we have a placeholder index for MAF files so this workflow doesn't fail: + if [[ "${output_format}" == "MAF" ]] ; then + touch ${output_maf_index} + fi >>> runtime { docker: gatk_docker + bootDiskSizeGb: 20 memory: machine_mem + " MB" disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD" preemptible: select_first([preemptible_attempts, 3]) + maxRetries: select_first([max_retries, 3]) cpu: select_first([cpu, 1]) } output { - File funcotated_vcf = "${output_vcf}" - File funcotated_vcf_index = "${output_vcf_index}" + File funcotated_output_file = "${output_file}" + File funcotated_output_file_index = "${output_file_index}" } }
Fixed in FuncotatorFixed in OncotatorNotes
Collapsing ONP counts into one numberN/ANo
Fixed in FuncotatorFixed in OncotatorNotes
Collapsing ONP counts into one numberN/ANo
Variants resulting in protein changes that do not overlap the variant codon itself are not rendered properlyYesNo
Appris ranking not properly sortedYesNo
Using protein-coding status of gene for sorting (instead of transcript)YesNo