-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathkfdrc_consensus_calling.cwl
203 lines (188 loc) · 13.8 KB
/
kfdrc_consensus_calling.cwl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
cwlVersion: v1.0
class: Workflow
id: kfdrc_consensus_calling
label: Kids First DRC Simple Variant Consensus Calling Workflow
doc: |
# Kids First DRC Consensus Calling Workflow
This workflow is used by the Kids First (KF) Data Resource Center (DRC) to create consensus calls from outputs generated by our somatic variant callers.
![data service logo](https://github.com/d3b-center/d3b-research-workflows/raw/master/doc/kfdrc-logo-sm.png)
This workflow takes the protected vcf outputs from the [Kids First DRC Somatic Workflow](workflow/kfdrc-somatic-variant-workflow.cwl) and creates protected and public consensus VCF and MAF files.
The general outline is as follows:
1. Prep MNP Variants
- Strelka2 outputs multi-nucleotide polymorphisms (MNPs) as consecutive single-nucleotide polymorphisms
- In order preserve MNPs, we gather MNP calls from the other caller inputs, and search for evidence supporting these consecutive SNP calls as MNP candidates
- Once found, the Strelka2 SNP calls supporting a MNP are converted to a single MNP call
- This is done to preserve the predicted gene model as accurately as possible in our consensus calls
1. Consensus merge
- Calls are gathered from all four callers
- By default, calls with support from 2+ callers OR calls that are marked as `HotSpotAllele` in the `INFO` field are retained
- Retained calls then have their `MQ` and `MQ0` values calculated from the input tumor cram
- `GT` fields are estimated as "majority rules," and when no majority exists, set as `0/1` by default
- `AD`, `DP`, and `AF` are calculated as the average value between callers
- `ADR`, `DPR`, and `AFR` fields are added as the range of values from the previous point, to give the observer a sense on confidence in the value
1. VEP Annotate Consensus (see [Kids First DRC Somatic Variant Annotation Workflow](https://github.com/kids-first/kf-somatic-workflow/blob/master/docs/kfdrc_annotation_wf.md) for details )
1. Echtvar Annotation
- Additional annotation is performed augment VEP annotation
- While VEP does have extensive gnomad allele frequency annotation, it is limited to exome values. The added gnomad AF only resource we use augments this as an additional `INFO/AF` field to add WGS frequencies
1. Soft filter variants
- A soft filter is added based on criteria provided
- By default, we perform soft filtering as outlined in the [KFDRC Annotation Subworkflow](kfdrc_annotation_subworkflow.md#workflow_description_and_kf_recommended_inputs)
1. VCF2MAF protected
- Here, for convenience of analysis we convert the resultant, soft-filtered VCF (AKA, "Protected VCF") into MAF format
1. Hard filter VCF
- The Protected VCF is hard filtered on `PASS` and `HotSpotAllele` for reasons outlined in the `Soft filter variants` step
- This VCF is known as the "Public VCF"
1. VCF2MAF public
1. Rename outputs
## Workflow Description and KF Recommended Inputs
### General workflow inputs, all file references can be obtained [here](https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/):
- indexed_reference_fasta: Homo_sapiens_assembly38.fasta
- strelka2_vcf
- mutect2_vcf
- lancet_vcf
- vardict_vcf
- cram #Tumor cram recommended for MQ score calculation
- input_tumor_name
- input_normal_name
- output_basename
- tool_name: "consensus_somatic"
- ncallers: # Optional number of callers required for consensus, recommend `2`
- consensus_ram: `3`
- annotation_zip: gnomad.v3.1.1.custom.echtvar.zip # population stats VCF for public filtering
- vep_cache: homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz
- gatk_filter_name: `[NORM_DP_LOW, GNOMAD_AF_HIGH]`
- gatk_filter_expression: `[ vc.getGenotype('`_insert_norm_sample_id_here_`').getDP() <= 7,gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF > 0.001 && && gnomad_3_1_1_FILTER=='PASS']`
- bcftools_public_filter: `FILTER="PASS"|INFO/HotSpotAllele=1`
- retain_info: "gnomad_3_1_1_AC,gnomad_3_1_1_AN,gnomad_3_1_1_AF,gnomad_3_1_1_nhomalt,gnomad_3_1_1_AC_popmax,gnomad_3_1_1_AN_popmax,gnomad_3_1_1_AF_popmax,gnomad_3_1_1_nhomalt_popmax,gnomad_3_1_1_AC_controls_and_biobanks,gnomad_3_1_1_AN_controls_and_biobanks,gnomad_3_1_1_AF_controls_and_biobanks,gnomad_3_1_1_AF_non_cancer,gnomad_3_1_1_primate_ai_score,gnomad_3_1_1_splice_ai_consequence,gnomad_3_1_1_AF_non_cancer_afr,gnomad_3_1_1_AF_non_cancer_ami,gnomad_3_1_1_AF_non_cancer_asj,gnomad_3_1_1_AF_non_cancer_eas,gnomad_3_1_1_AF_non_cancer_fin,gnomad_3_1_1_AF_non_cancer_mid,gnomad_3_1_1_AF_non_cancer_nfe,gnomad_3_1_1_AF_non_cancer_oth,gnomad_3_1_1_AF_non_cancer_raw,gnomad_3_1_1_AF_non_cancer_sas,gnomad_3_1_1_AF_non_cancer_amr,gnomad_3_1_1_AF_non_cancer_popmax,gnomad_3_1_1_AF_non_cancer_all_popmax,gnomad_3_1_1_FILTER,MQ,MQ0,CAL,HotSpotAllele"
- retain_fmt: # csv string with FORMAT fields that you want to keep
- retain_ann: "HGVSg"
- maf_center: "."
- `custom_enst`: `kf_isoform_override.tsv`. As of VEP 104, several genes have had their canonical transcripts redefined. While the VCF will have all possible isoforms, this affects maf file output and may results in representative protein changes that defy historical expectations
## Workflow outputs
- `annotated_protected_outputs`: Array of files containing MAF format of PASS hits, `PASS` VCF with annotation pipeline soft `FILTER`-added values, and VCF index
- `annotated_public_outputs`: Same as above, except MAF and VCF have had entries with soft `FILTER` values removed
requirements:
- class: ScatterFeatureRequirement
- class: SubworkflowFeatureRequirement
- class: MultipleInputFeatureRequirement
- class: StepInputExpressionRequirement
- class: InlineJavascriptRequirement
inputs:
indexed_reference_fasta: {type: 'File', secondaryFiles: ['.fai', '^.dict'], "sbg:suggestedValue": {class: File, path: 60639014357c3a53540ca7a3,
name: Homo_sapiens_assembly38.fasta, secondaryFiles: [{class: File, path: 60639016357c3a53540ca7af, name: Homo_sapiens_assembly38.fasta},
{class: File, path: 60639019357c3a53540ca7e7, name: Homo_sapiens_assembly38.dict}]}}
strelka2_vcf: {type: 'File', secondaryFiles: ['.tbi']}
mutect2_vcf: {type: 'File', secondaryFiles: ['.tbi']}
lancet_vcf: {type: 'File', secondaryFiles: ['.tbi']}
vardict_vcf: {type: 'File', secondaryFiles: ['.tbi']}
cram: {type: 'File', secondaryFiles: ['.crai'], doc: "Tumor cram recommended for MQ score calculation"}
input_tumor_name: string
input_normal_name: string
output_basename: string
tool_name: {type: 'string?', default: "consensus_somatic", doc: "A helpful file name building component"}
ncallers: {type: 'int?', doc: "Optional number of callers required for consensus [2]", default: 2}
hotspot_source: {type: 'string?', doc: "Optional description of hotspot definition source"}
contig_bed: {type: 'File?', doc: "Optional BED file containing names of target contigs / chromosomes"}
consensus_ram: {type: 'int?', doc: "Set min memory in GB for consensus merge step", default: 3}
vep_cache: {type: 'File', doc: "tar gzipped cache from ensembl/local converted cache", "sbg:suggestedValue": {class: File, path: 6332f8e47535110eb79c794f,
name: homo_sapiens_merged_vep_105_indexed_GRCh38.tar.gz}}
dbnsfp: {type: 'File?', secondaryFiles: [.tbi, ^.readme.txt], doc: "VEP-formatted plugin file, index, and readme file containing
dbNSFP annotations"}
dbnsfp_fields: {type: 'string?', doc: "csv string with desired fields to annotate if dbnsfp provided. Use ALL to grab all"}
merged: {type: 'boolean?', doc: "Set to true if merged cache used", default: true}
cadd_indels: {type: 'File?', secondaryFiles: [.tbi], doc: "VEP-formatted plugin file and index containing CADD indel annotations"}
cadd_snvs: {type: 'File?', secondaryFiles: [.tbi], doc: "VEP-formatted plugin file and index containing CADD SNV annotations"}
run_cache_existing: {type: 'boolean?', doc: "Run the check_existing flag for cache"}
run_cache_af: {type: 'boolean?', doc: "Run the allele frequency flags for cache"}
# annotation vars
genomic_hotspots: {type: 'File[]?', doc: "Tab-delimited BED formatted file(s) containing hg38 genomic positions corresponding to
hotspots", "sbg:suggestedValue": [{class: File, path: 607713829360f10e3982a423, name: tert.bed}]}
protein_snv_hotspots: {type: 'File[]?', doc: "Column-name-containing, tab-delimited file(s) containing protein names and amino acid
positions corresponding to hotspots", "sbg:suggestedValue": [{class: File, path: 66980e845a58091951d53984, name: kfdrc_protein_snv_cancer_hotspots_20240718.txt}]}
protein_indel_hotspots: {type: 'File[]?', doc: "Column-name-containing, tab-delimited file(s) containing protein names and amino
acid position ranges corresponding to hotspots", "sbg:suggestedValue": [{class: File, path: 663d2bcc27374715fccd8c6f, name: protein_indel_cancer_hotspots_v2.ENS105_liftover.tsv}]}
retain_info: {type: 'string?', doc: "csv string with INFO fields that you want to keep", default: "gnomad_3_1_1_AC,gnomad_3_1_1_AN,gnomad_3_1_1_AF,gnomad_3_1_1_nhomalt,gnomad_3_1_1_AC_popmax,gnomad_3_1_1_AN_popmax,gnomad_3_1_1_AF_popmax,gnomad_3_1_1_nhomalt_popmax,gnomad_3_1_1_AC_controls_and_biobanks,gnomad_3_1_1_AN_controls_and_biobanks,gnomad_3_1_1_AF_controls_and_biobanks,gnomad_3_1_1_AF_non_cancer,gnomad_3_1_1_primate_ai_score,gnomad_3_1_1_splice_ai_consequence,gnomad_3_1_1_AF_non_cancer_afr,gnomad_3_1_1_AF_non_cancer_ami,gnomad_3_1_1_AF_non_cancer_asj,gnomad_3_1_1_AF_non_cancer_eas,gnomad_3_1_1_AF_non_cancer_fin,gnomad_3_1_1_AF_non_cancer_mid,gnomad_3_1_1_AF_non_cancer_nfe,gnomad_3_1_1_AF_non_cancer_oth,gnomad_3_1_1_AF_non_cancer_raw,gnomad_3_1_1_AF_non_cancer_sas,gnomad_3_1_1_AF_non_cancer_amr,gnomad_3_1_1_AF_non_cancer_popmax,gnomad_3_1_1_AF_non_cancer_all_popmax,gnomad_3_1_1_FILTER,MQ,MQ0,CAL,HotSpotAllele"}
retain_fmt: {type: 'string?', doc: "csv string with FORMAT fields that you want to keep"}
retain_ann: {type: 'string?', doc: "csv string of annotations (within the VEP CSQ/ANN) to retain as extra columns in MAF", default: "HGVSg"}
add_common_fields: {type: 'boolean?', doc: "Set to true if input is a strelka2 vcf that hasn't had common fields added", default: false}
bcftools_strip_columns: {type: 'string?', doc: "csv string of columns to strip if needed to avoid conflict, i.e INFO/AF"}
echtvar_anno_zips: {type: 'File[]?', doc: "Annotation ZIP files for echtvar anno", "sbg:suggestedValue": [{class: File, path: 65c64d847dab7758206248c6,
name: gnomad.v3.1.1.custom.echtvar.zip}]}
bcftools_public_filter: {type: 'string?', doc: "Will hard filter final result to create a public version", default: FILTER="PASS"|INFO/HotSpotAllele=1}
gatk_filter_name: {type: 'string[]', doc: "Array of names for each filter tag to add, recommend: [\"NORM_DP_LOW\", \"GNOMAD_AF_HIGH\"\
]"}
gatk_filter_expression: {type: 'string[]', doc: "Array of filter expressions to establish criteria to tag variants with. See https://gatk.broadinstitute.org/hc/en-us/articles/360036730071-VariantFiltration,
recommend: \"vc.getGenotype('\" + inputs.input_normal_name + \"').getDP() <= 7\"), \"gnomad_3_1_1_AF != '.' && gnomad_3_1_1_AF
> 0.001 && && gnomad_3_1_1_FILTER=='PASS'\"]"}
disable_hotspot_annotation: {type: 'boolean?', doc: "Disable Hotspot Annotation and skip this task.", default: true}
maf_center: {type: 'string?', doc: "Sequencing center of variant called", default: "."}
custom_enst: {type: 'File?', doc: "Use a file with ens tx IDs for each gene to override VEP PICK", "sbg:suggestedValue": {class: File,
path: 663d2bcc27374715fccd8c65, name: kf_isoform_override.tsv}}
outputs:
annotated_protected_outputs: {type: 'File[]', outputSource: annotate/annotated_protected}
annotated_public_outputs: {type: 'File[]', outputSource: annotate/annotated_public}
steps:
prep_mnp_variants:
run: ../tools/prep_mnp_variants.cwl
in:
strelka2_vcf: strelka2_vcf
other_vcfs: [mutect2_vcf, lancet_vcf, vardict_vcf]
output_basename: output_basename
out: [output_vcfs]
consensus_merge:
run: ../tools/consensus_merge.cwl
in:
strelka2_vcf:
source: prep_mnp_variants/output_vcfs
valueFrom: '$(self[0])'
mutect2_vcf: mutect2_vcf
lancet_vcf: lancet_vcf
vardict_vcf: vardict_vcf
cram: cram
ncallers: ncallers
ram: consensus_ram
reference: indexed_reference_fasta
output_basename: output_basename
hotspot_source: hotspot_source
contig_bed: contig_bed
out: [output]
annotate:
run: ../kf-annotation-tools/workflows/kfdrc-somatic-snv-annot-workflow.cwl
in:
indexed_reference_fasta: indexed_reference_fasta
input_vcf: consensus_merge/output
input_tumor_name: input_tumor_name
input_normal_name: input_normal_name
add_common_fields: add_common_fields
retain_info: retain_info
retain_fmt: retain_fmt
retain_ann: retain_ann
echtvar_anno_zips: echtvar_anno_zips
bcftools_strip_columns: bcftools_strip_columns
bcftools_public_filter: bcftools_public_filter
dbnsfp: dbnsfp
dbnsfp_fields: dbnsfp_fields
merged: merged
cadd_indels: cadd_indels
cadd_snvs: cadd_snvs
run_cache_af: run_cache_af
run_cache_existing: run_cache_existing
gatk_filter_name: gatk_filter_name
gatk_filter_expression: gatk_filter_expression
vep_cache: vep_cache
disable_hotspot_annotation: disable_hotspot_annotation
genomic_hotspots: genomic_hotspots
protein_snv_hotspots: protein_snv_hotspots
protein_indel_hotspots: protein_indel_hotspots
maf_center: maf_center
custom_enst: custom_enst
output_basename: output_basename
tool_name: tool_name
out: [annotated_protected, annotated_public]
$namespaces:
sbg: https://sevenbridges.com
"sbg:license": Apache License 2.0
"sbg:publisher": KFDRC
"sbg:links":
- id: 'https://github.com/kids-first/kf-somatic-workflow/releases/tag/v5.2.1'
label: github-release