This document summarizes the key components in DepMap omics' processing pipeline.

Note that input references, indices, and parameters used in the WDL workflows for the following pipelines in any given quarter can be found in data/[release quarter]/[workflow name].json.

Copy Numbers and Somatic Mutations

We are currently running the following workflows to generate datasets from WGS data:

WGS_pipeline runs the following sub-processes to generate relative and absolute copy number segments, mutation MAF file, structural variant (SV) calls, and various genomic features including loss of heterozygosity (LoH), LoH fraction, ploidy estimate, Whole Genome Doubling (WGD), Chromosomal Instability (CIN), and Microsatellite Instability (MSI) score. This workflow runs the following subtasks:

gatk cnv:
- outputs relative segment and copy number from WES/WGS data
- https://software.broadinstitute.org/gatk/documentation/article?id=11682
mutect2 (from broadinstitute/gatk:4.2.6.1):
- outputs mutation calls from RNAseq/WES/WGS data
- https://gatk.broadinstitute.org/hc/en-us/articles/360036490432-Mutect2
- see our documentation on the mutation pipeline for details regarding filtering, annotation, and more
PureCN:
- https://github.com/lima1/PureCN
- computes absolute copy number, as well as features including loss of heterozygosity (LoH), LoH fraction, ploidy estimate, Whole Genome Doubling (WGD), Chromasomal Instability (CIN) from WES/WGS data
- We are filtering out calls for cell lines with ploidy > 5 or if non-aberrant, goodness of fit < 70%, since PureCN is not able to produce confident predictions for them.
- Details on how PureCN is run for DepMap data
MSIsensor2:
- https://github.com/niu-lab/msisensor2
- computes Microsatellite Instability (MSI) score from WES/WGS data
Manta:
- https://github.com/Illumina/manta
- calls structural variants from WES/WGS data
Manta SV annotator:
- https://github.com/acranej/MantaSVAnnotator
- annotates structural variants generated by Manta

Aggregate_CN_seg_files aggregates copy number segment outputs.

WES:

The same subtasks above were run on our WES data. We used both Illumina ICE and Agilent intervals for our WES data. You can find their respective PON files and interval files as parameters in our workflow configurations in data/[release quarter]/[workflow name].json.

Panel of Normals (PONs)

For CN PONs are made from normals from the GTEx project as they were sequenced in the same fashion as CCLE samples with the same set of baits. PONs for each bait set were created with XY only lines using the workflow gatk/CNV_Somatic_Panel_Workflow.

Expression and Fusion

We are generating both expression and fusion datasets with RNAseq data. Specifically, we use the GTEx pipeline to generate the expression dataset, and STAR-Fusion to generate gene fusion calls. This task also contains a flag that lets you specify if you want to delete the intermediates (fastqs) that can be large and might cost a lot to store. The following two workflows are run in this order:

RNA_pipeline imports and runs the following sub-processes to generate RNA expression and fusion data matrices.

star (from docker image gcr.io/broad-cga-francois-gtex/gtex_rnaseq:V10):
- aligns RNAseq bam files for downstream processing
- https://www.ncbi.nlm.nih.gov/pubmed/23104886
- https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
- STAR and RSEM indices are generated using GENCODE (v38)'s "comprehensive gene annotations" GTF and the GRCh38 reference genome for RNA-seq alignment provided in GTEx's pipeline, which includes ERCC spike-in and excludes ALT, HLA, and Decoy contigs. The STAR index is generated with flag --sjdbOverhang 100.
rsem (from docker image gcr.io/broad-cga-francois-gtex/gtex_rnaseq:V10):
- quantifies gene and isoform abundances from RNAseq data
- https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323
star fusion (from docker image trinityctat/starfusion:1.7.0):
- generates fusion prediction from RNAseq data
- https://github.com/STAR-Fusion/STAR-Fusion/wiki
- http://biorxiv.org/content/early/2017/03/24/120295

RNA_aggregate aggregates expression and fusion data files into their respective aggregated file.

Finally, we save the workflow configurations used in the pipeline runs

Remarks:

for the copy number pipeline we have parametrized both a Chromosome XX version and a Chromosome XY version, we recommend using the XY version as it covers the entire genome

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DepMap_processing_pipeline.md

DepMap_processing_pipeline.md

Copy Numbers and Somatic Mutations

Panel of Normals (PONs)

Expression and Fusion

Files

DepMap_processing_pipeline.md

Latest commit

History

DepMap_processing_pipeline.md

File metadata and controls

Copy Numbers and Somatic Mutations

Panel of Normals (PONs)

Expression and Fusion