workflow/kfdrc_consensus_calling.cwl

cwlVersion: v1.0
class: Workflow
id: kfdrc_consensus_calling
label: Kids First DRC Simple Variant Consensus Calling Workflow
doc: |
  # Kids First DRC Consensus Calling Workflow
  This workflow is used by the Kids First (KF) Data Resource Center (DRC) to create consensus calls from outputs generated by our somatic variant callers.

  ![data service logo](https://github.com/d3b-center/d3b-research-workflows/raw/master/doc/kfdrc-logo-sm.png)

  This workflow takes the protected vcf outputs from the [Kids First DRC Somatic Workflow](workflow/kfdrc-somatic-variant-workflow.cwl) and creates protected and public consensus VCF and MAF files.
  The general outline is as follows:

  1. Prep MNP Variants
     - Strelka2 outputs multi-nucleotide polymorphisms (MNPs) as consecutive single-nucleotide polymorphisms
     - In order preserve MNPs, we gather MNP calls from the other caller inputs, and search for evidence supporting these consecutive SNP calls as MNP candidates
      - Once found, the Strelka2 SNP calls supporting a MNP are converted to a single MNP call
      - This is done to preserve the predicted gene model as accurately as possible in our consensus calls
  1. Consensus merge
     - Calls are gathered from all four callers
     - By default, calls with support from 2+ callers OR calls that are marked as `HotSpotAllele` in the `INFO` field are retained
     - Retained calls then have their `MQ` and `MQ0` values calculated from the input tumor cram
     - `GT` fields are estimated as "majority rules," and when no majority exists, set as `0/1` by default
     - `AD`, `DP`, and `AF` are calculated as the average value between callers
     - `ADR`, `DPR`, and `AFR` fields are added as the range of values from the previous point, to give the observer a sense on confidence in the value
  1. VEP Annotate Consensus (see [Kids First DRC Somatic Variant Annotation Workflow](https://github.com/kids-first/kf-somatic-workflow/blob/master/docs/kfdrc_annotation_wf.md) for details )
  1. Echtvar Annotation
     - Additional annotation is performed augment VEP annotation
     - While VEP does have extensive gnomad allele frequency annotation, it is limited to exome values. The added gnomad AF only resource we use augments this as an additional `INFO/AF` field to add WGS frequencies
  1. Soft filter variants
     - A soft filter is added based on criteria provided
     - By default, we perform soft filtering as outlined in the [KFDRC Annotation Subworkflow](kfdrc_annotation_subworkflow.md#workflow_description_and_kf_recommended_inputs)
  1. VCF2MAF protected
     - Here, for convenience of analysis we convert the resultant, soft-filtered VCF (AKA, "Protected VCF") into MAF format
  1. Hard filter VCF
     - The Protected VCF is hard filtered on `PASS` and `HotSpotAllele` for reasons outlined in the `Soft filter variants` step
     - This VCF is known as the "Public VCF"
  1. VCF2MAF public
  1. Rename outputs

  ## Workflow Description and KF Recommended Inputs

  ### General workflow inputs, all file references can be obtained [here](https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/):
  - indexed_reference_fasta: Homo_sapiens_assembly38.fasta
  - strelka2_vcf
  - mutect2_vcf
  - lancet_vcf
  - vardict_vcf
  - cram #Tumor cram recommended for MQ score calculation
  - input_tumor_name
  - input_normal_name
  - output_basename
  - tool_name: "consensus_somatic"
  - ncallers: # Optional number of callers required for consensus, recommend `2`
  - consensus_ram: `3`
  - annotation_zip: gnomad.v3.1.1.custom.echtvar.zip # population stats VCF for public filtering
  - vep_cache: homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz
  - gatk_filter_name: `[NORM_DP_LOW, GNOMAD_AF_HIGH]`
  - gatk_filter_expression: `[ vc.getGenotype('`_insert_norm_sample_id_here_`').getDP() <= 7,gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF > 0.001 &&  && gnomad_3_1_1_FILTER=='PASS']`
  - bcftools_public_filter: `FILTER="PASS"|INFO/HotSpotAllele=1`
  - retain_info: "gnomad_3_1_1_AC,gnomad_3_1_1_AN,gnomad_3_1_1_AF,gnomad_3_1_1_nhomalt,gnomad_3_1_1_AC_popmax,gnomad_3_1_1_AN_popmax,gnomad_3_1_1_AF_popmax,gnomad_3_1_1_nhomalt_popmax,gnomad_3_1_1_AC_controls_and_biobanks,gnomad_3_1_1_AN_controls_and_biobanks,gnomad_3_1_1_AF_controls_and_biobanks,gnomad_3_1_1_AF_non_cancer,gnomad_3_1_1_primate_ai_score,gnomad_3_1_1_splice_ai_consequence,gnomad_3_1_1_AF_non_cancer_afr,gnomad_3_1_1_AF_non_cancer_ami,gnomad_3_1_1_AF_non_cancer_asj,gnomad_3_1_1_AF_non_cancer_eas,gnomad_3_1_1_AF_non_cancer_fin,gnomad_3_1_1_AF_non_cancer_mid,gnomad_3_1_1_AF_non_cancer_nfe,gnomad_3_1_1_AF_non_cancer_oth,gnomad_3_1_1_AF_non_cancer_raw,gnomad_3_1_1_AF_non_cancer_sas,gnomad_3_1_1_AF_non_cancer_amr,gnomad_3_1_1_AF_non_cancer_popmax,gnomad_3_1_1_AF_non_cancer_all_popmax,gnomad_3_1_1_FILTER,MQ,MQ0,CAL,HotSpotAllele"
  - retain_fmt: # csv string with FORMAT fields that you want to keep
  - retain_ann: "HGVSg"
  - maf_center: "."
  - `custom_enst`: `kf_isoform_override.tsv`. As of VEP 104, several genes have had their canonical transcripts redefined. While the VCF will have all possible isoforms, this affects maf file output and may results in representative protein changes that defy historical expectations


  ## Workflow outputs
  - `annotated_protected_outputs`: Array of files containing MAF format of PASS hits, `PASS` VCF with annotation pipeline soft `FILTER`-added values, and VCF index
  - `annotated_public_outputs`: Same as above, except MAF and VCF have had entries with soft `FILTER` values removed
requirements:
- class: ScatterFeatureRequirement
- class: SubworkflowFeatureRequirement
- class: MultipleInputFeatureRequirement
- class: StepInputExpressionRequirement
- class: InlineJavascriptRequirement

inputs:
  indexed_reference_fasta: {type: 'File', secondaryFiles: ['.fai', '^.dict'], "sbg:suggestedValue": {class: File, path: 60639014357c3a53540ca7a3,
      name: Homo_sapiens_assembly38.fasta, secondaryFiles: [{class: File, path: 60639016357c3a53540ca7af, name: Homo_sapiens_assembly38.fasta},
        {class: File, path: 60639019357c3a53540ca7e7, name: Homo_sapiens_assembly38.dict}]}}
  strelka2_vcf: {type: 'File', secondaryFiles: ['.tbi']}
  mutect2_vcf: {type: 'File', secondaryFiles: ['.tbi']}
  lancet_vcf: {type: 'File', secondaryFiles: ['.tbi']}
  vardict_vcf: {type: 'File', secondaryFiles: ['.tbi']}
  cram: {type: 'File', secondaryFiles: ['.crai'], doc: "Tumor cram recommended for MQ score calculation"}
  input_tumor_name: string
  input_normal_name: string
  output_basename: string
  tool_name: {type: 'string?', default: "consensus_somatic", doc: "A helpful file name building component"}
  ncallers: {type: 'int?', doc: "Optional number of callers required for consensus [2]", default: 2}
  hotspot_source: {type: 'string?', doc: "Optional description of hotspot definition source"}
  contig_bed: {type: 'File?', doc: "Optional BED file containing names of target contigs / chromosomes"}
  consensus_ram: {type: 'int?', doc: "Set min memory in GB for consensus merge step", default: 3}
  vep_cache: {type: 'File', doc: "tar gzipped cache from ensembl/local converted cache", "sbg:suggestedValue": {class: File, path: 6332f8e47535110eb79c794f,
      name: homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz}}
  dbnsfp: {type: 'File?', secondaryFiles: [.tbi, ^.readme.txt], doc: "VEP-formatted plugin file, index, and readme file containing
      dbNSFP annotations"}
  dbnsfp_fields: {type: 'string?', doc: "csv string with desired fields to annotate if dbnsfp provided. Use ALL to grab all"}
  merged: {type: 'boolean?', doc: "Set to true if merged cache used", default: true}
  cadd_indels: {type: 'File?', secondaryFiles: [.tbi], doc: "VEP-formatted plugin file and index containing CADD indel annotations"}
  cadd_snvs: {type: 'File?', secondaryFiles: [.tbi], doc: "VEP-formatted plugin file and index containing CADD SNV annotations"}
  run_cache_existing: {type: 'boolean?', doc: "Run the check_existing flag for cache"}
  run_cache_af: {type: 'boolean?', doc: "Run the allele frequency flags for cache"}

  # annotation vars
  genomic_hotspots: {type: 'File[]?', doc: "Tab-delimited BED formatted file(s) containing hg38 genomic positions corresponding to
      hotspots", "sbg:suggestedValue": [{class: File, path: 607713829360f10e3982a423, name: tert.bed}]}
  protein_snv_hotspots: {type: 'File[]?', doc: "Column-name-containing, tab-delimited file(s) containing protein names and amino acid
      positions corresponding to hotspots", "sbg:suggestedValue": [{class: File, path: 66980e845a58091951d53984, name: kfdrc_protein_snv_cancer_hotspots_20240718.txt}]}
  protein_indel_hotspots: {type: 'File[]?', doc: "Column-name-containing, tab-delimited file(s) containing protein names and amino
      acid position ranges corresponding to hotspots", "sbg:suggestedValue": [{class: File, path: 663d2bcc27374715fccd8c6f, name: protein_indel_cancer_hotspots_v2.ENS105_liftover.tsv}]}
  retain_info: {type: 'string?', doc: "csv string with INFO fields that you want to keep", default: "gnomad_3_1_1_AC,gnomad_3_1_1_AN,gnomad_3_1_1_AF,gnomad_3_1_1_nhomalt,gnomad_3_1_1_AC_popmax,gnomad_3_1_1_AN_popmax,gnomad_3_1_1_AF_popmax,gnomad_3_1_1_nhomalt_popmax,gnomad_3_1_1_AC_controls_and_biobanks,gnomad_3_1_1_AN_controls_and_biobanks,gnomad_3_1_1_AF_controls_and_biobanks,gnomad_3_1_1_AF_non_cancer,gnomad_3_1_1_primate_ai_score,gnomad_3_1_1_splice_ai_consequence,gnomad_3_1_1_AF_non_cancer_afr,gnomad_3_1_1_AF_non_cancer_ami,gnomad_3_1_1_AF_non_cancer_asj,gnomad_3_1_1_AF_non_cancer_eas,gnomad_3_1_1_AF_non_cancer_fin,gnomad_3_1_1_AF_non_cancer_mid,gnomad_3_1_1_AF_non_cancer_nfe,gnomad_3_1_1_AF_non_cancer_oth,gnomad_3_1_1_AF_non_cancer_raw,gnomad_3_1_1_AF_non_cancer_sas,gnomad_3_1_1_AF_non_cancer_amr,gnomad_3_1_1_AF_non_cancer_popmax,gnomad_3_1_1_AF_non_cancer_all_popmax,gnomad_3_1_1_FILTER,MQ,MQ0,CAL,HotSpotAllele"}
  retain_fmt: {type: 'string?', doc: "csv string with FORMAT fields that you want to keep"}
  retain_ann: {type: 'string?', doc: "csv string of annotations (within the VEP CSQ/ANN) to retain as extra columns in MAF", default: "HGVSg"}
  add_common_fields: {type: 'boolean?', doc: "Set to true if input is a strelka2 vcf that hasn't had common fields added", default: false}
  bcftools_strip_columns: {type: 'string?', doc: "csv string of columns to strip if needed to avoid conflict, i.e INFO/AF"}
  echtvar_anno_zips: {type: 'File[]?', doc: "Annotation ZIP files for echtvar anno", "sbg:suggestedValue": [{class: File, path: 65c64d847dab7758206248c6,
        name: gnomad.v3.1.1.custom.echtvar.zip}]}
  bcftools_public_filter: {type: 'string?', doc: "Will hard filter final result to create a public version", default: FILTER="PASS"|INFO/HotSpotAllele=1}
  gatk_filter_name: {type: 'string[]', doc: "Array of names for each filter tag to add, recommend: [\"NORM_DP_LOW\", \"GNOMAD_AF_HIGH\"\
      ]"}
  gatk_filter_expression: {type: 'string[]', doc: "Array of filter expressions to establish criteria to tag variants with. See https://gatk.broadinstitute.org/hc/en-us/articles/360036730071-VariantFiltration,
      recommend: \"vc.getGenotype('\" + inputs.input_normal_name + \"').getDP() <= 7\"), \"gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF
      > 0.001 &&  && gnomad_3_1_1_FILTER=='PASS'\"]"}
  disable_hotspot_annotation: {type: 'boolean?', doc: "Disable Hotspot Annotation and skip this task.", default: true}
  maf_center: {type: 'string?', doc: "Sequencing center of variant called", default: "."}
  custom_enst: {type: 'File?', doc: "Use a file with ens tx IDs for each gene to override VEP PICK", "sbg:suggestedValue": {class: File,
      path: 663d2bcc27374715fccd8c65, name: kf_isoform_override.tsv}}

outputs:
  annotated_protected_outputs: {type: 'File[]', outputSource: annotate/annotated_protected}
  annotated_public_outputs: {type: 'File[]', outputSource: annotate/annotated_public}

steps:
  prep_mnp_variants:
    run: ../tools/prep_mnp_variants.cwl
    in:
      strelka2_vcf: strelka2_vcf
      other_vcfs: [mutect2_vcf, lancet_vcf, vardict_vcf]
      output_basename: output_basename
    out: [output_vcfs]

  consensus_merge:
    run: ../tools/consensus_merge.cwl
    in:
      strelka2_vcf:
        source: prep_mnp_variants/output_vcfs
        valueFrom: '$(self[0])'
      mutect2_vcf: mutect2_vcf
      lancet_vcf: lancet_vcf
      vardict_vcf: vardict_vcf
      cram: cram
      ncallers: ncallers
      ram: consensus_ram
      reference: indexed_reference_fasta
      output_basename: output_basename
      hotspot_source: hotspot_source
      contig_bed: contig_bed
    out: [output]

  annotate:
    run: ../kf-annotation-tools/workflows/kfdrc-somatic-snv-annot-workflow.cwl
    in:
      indexed_reference_fasta: indexed_reference_fasta
      input_vcf: consensus_merge/output
      input_tumor_name: input_tumor_name
      input_normal_name: input_normal_name
      add_common_fields: add_common_fields
      retain_info: retain_info
      retain_fmt: retain_fmt
      retain_ann: retain_ann
      echtvar_anno_zips: echtvar_anno_zips
      bcftools_strip_columns: bcftools_strip_columns
      bcftools_public_filter: bcftools_public_filter
      dbnsfp: dbnsfp
      dbnsfp_fields: dbnsfp_fields
      merged: merged
      cadd_indels: cadd_indels
      cadd_snvs: cadd_snvs
      run_cache_af: run_cache_af
      run_cache_existing: run_cache_existing
      gatk_filter_name: gatk_filter_name
      gatk_filter_expression: gatk_filter_expression
      vep_cache: vep_cache
      disable_hotspot_annotation: disable_hotspot_annotation
      genomic_hotspots: genomic_hotspots
      protein_snv_hotspots: protein_snv_hotspots
      protein_indel_hotspots: protein_indel_hotspots
      maf_center: maf_center
      custom_enst: custom_enst
      output_basename: output_basename
      tool_name: tool_name
    out: [annotated_protected, annotated_public]

$namespaces:
  sbg: https://sevenbridges.com
"sbg:license": Apache License 2.0
"sbg:publisher": KFDRC

"sbg:links":
- id: 'https://github.com/kids-first/kf-somatic-workflow/releases/tag/v5.2.1'
  label: github-release