This document summarizes the key components in DepMap omics' processing pipeline.
Note that input references, indices, and parameters used in the WDL workflows for the following pipelines in any given quarter can be found in data/[release quarter]/[workflow name].json
.
We are currently running the following workflows to generate datasets from WGS data:
WGS_pipeline runs the following sub-processes to generate relative and absolute copy number segments, mutation MAF file, structural variant (SV) calls, and various genomic features including loss of heterozygosity (LoH), LoH fraction, ploidy estimate, Whole Genome Doubling (WGD), Chromosomal Instability (CIN), and Microsatellite Instability (MSI) score. This workflow runs the following subtasks:
- gatk cnv:
- outputs relative segment and copy number from WES/WGS data
- https://software.broadinstitute.org/gatk/documentation/article?id=11682
- mutect2 (from
broadinstitute/gatk:4.2.6.1
):- outputs mutation calls from RNAseq/WES/WGS data
- https://gatk.broadinstitute.org/hc/en-us/articles/360036490432-Mutect2
- see our documentation on the mutation pipeline for details regarding filtering, annotation, and more
- PureCN:
- https://github.com/lima1/PureCN
- computes absolute copy number, as well as features including loss of heterozygosity (LoH), LoH fraction, ploidy estimate, Whole Genome Doubling (WGD), Chromasomal Instability (CIN) from WES/WGS data
- We are filtering out calls for cell lines with ploidy > 5 or if non-aberrant, goodness of fit < 70%, since PureCN is not able to produce confident predictions for them.
- Details on how PureCN is run for DepMap data
- MSIsensor2:
- https://github.com/niu-lab/msisensor2
- computes Microsatellite Instability (MSI) score from WES/WGS data
- Manta:
- https://github.com/Illumina/manta
- calls structural variants from WES/WGS data
- Manta SV annotator:
- https://github.com/acranej/MantaSVAnnotator
- annotates structural variants generated by Manta
Aggregate_CN_seg_files aggregates copy number segment outputs.
WES:
The same subtasks above were run on our WES data. We used both Illumina ICE and Agilent intervals for our WES data. You can find their respective PON files and interval files as parameters in our workflow configurations in data/[release quarter]/[workflow name].json
.
For CN PONs are made from normals from the GTEx project as they were sequenced in the same fashion as CCLE samples with the same set of baits.
PONs for each bait set were created with XY only lines using the workflow gatk/CNV_Somatic_Panel_Workflow
.
We are generating both expression and fusion datasets with RNAseq data. Specifically, we use the GTEx pipeline to generate the expression dataset, and STAR-Fusion to generate gene fusion calls. This task also contains a flag that lets you specify if you want to delete the intermediates (fastqs) that can be large and might cost a lot to store. The following two workflows are run in this order:
RNA_pipeline imports and runs the following sub-processes to generate RNA expression and fusion data matrices.
- star (from docker image
gcr.io/broad-cga-francois-gtex/gtex_rnaseq:V10
):- aligns RNAseq bam files for downstream processing
- https://www.ncbi.nlm.nih.gov/pubmed/23104886
- https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
- STAR and RSEM indices are generated using GENCODE (v38)'s "comprehensive gene annotations" GTF and the GRCh38 reference genome for RNA-seq alignment provided in GTEx's pipeline, which includes ERCC spike-in and excludes ALT, HLA, and Decoy contigs. The STAR index is generated with flag
--sjdbOverhang 100
.
- rsem (from docker image
gcr.io/broad-cga-francois-gtex/gtex_rnaseq:V10
):- quantifies gene and isoform abundances from RNAseq data
- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323
- star fusion (from docker image
trinityctat/starfusion:1.7.0
):- generates fusion prediction from RNAseq data
- https://github.com/STAR-Fusion/STAR-Fusion/wiki
- http://biorxiv.org/content/early/2017/03/24/120295
RNA_aggregate aggregates expression and fusion data files into their respective aggregated file.
Finally, we save the workflow configurations used in the pipeline runs
Remarks:
- for the copy number pipeline we have parametrized both a Chromosome XX version and a Chromosome XY version, we recommend using the XY version as it covers the entire genome