Skip to content

Latest commit

 

History

History
69 lines (51 loc) · 5.91 KB

DepMap_processing_pipeline.md

File metadata and controls

69 lines (51 loc) · 5.91 KB

This document summarizes the key components in DepMap omics' processing pipeline.

Note that input references, indices, and parameters used in the WDL workflows for the following pipelines in any given quarter can be found in data/[release quarter]/[workflow name].json.

Copy Numbers and Somatic Mutations

We are currently running the following workflows to generate datasets from WGS data:

WGS_pipeline runs the following sub-processes to generate relative and absolute copy number segments, mutation MAF file, structural variant (SV) calls, and various genomic features including loss of heterozygosity (LoH), LoH fraction, ploidy estimate, Whole Genome Doubling (WGD), Chromosomal Instability (CIN), and Microsatellite Instability (MSI) score. This workflow runs the following subtasks:

Aggregate_CN_seg_files aggregates copy number segment outputs.

WES:

The same subtasks above were run on our WES data. We used both Illumina ICE and Agilent intervals for our WES data. You can find their respective PON files and interval files as parameters in our workflow configurations in data/[release quarter]/[workflow name].json.

Panel of Normals (PONs)

For CN PONs are made from normals from the GTEx project as they were sequenced in the same fashion as CCLE samples with the same set of baits. PONs for each bait set were created with XY only lines using the workflow gatk/CNV_Somatic_Panel_Workflow.

Expression and Fusion

We are generating both expression and fusion datasets with RNAseq data. Specifically, we use the GTEx pipeline to generate the expression dataset, and STAR-Fusion to generate gene fusion calls. This task also contains a flag that lets you specify if you want to delete the intermediates (fastqs) that can be large and might cost a lot to store. The following two workflows are run in this order:

RNA_pipeline imports and runs the following sub-processes to generate RNA expression and fusion data matrices.

RNA_aggregate aggregates expression and fusion data files into their respective aggregated file.

Finally, we save the workflow configurations used in the pipeline runs

Remarks:

  • for the copy number pipeline we have parametrized both a Chromosome XX version and a Chromosome XY version, we recommend using the XY version as it covers the entire genome