This Snakemake workflow generates allele specific expression data for validation purposes. This workflow outputs RNA alignment file containing know ASE genes for a ground truth set. Additionally, a phased VCF and expression matrix is also generated. This workflow was used to validate the IMPALA software which can be found here.
This will clone the repository. You can run the AlleleSynth within this directory.
git clone https://github.com/Glenn032787/AlleleSynth.git
To run this workflow, you must have snakemake (v6.12.3) and singularity (v3.5.2-1.1.el7). You can install snakemake using this guide and singularity using this guide. The remaining dependencies will be downloaded automatically within the snakemake workflow.
The main inputs required are the reference genome and gene annotation. Example files for hg38 chromosome 22 is included in the ref directory.
Additional gene annotation and index files is required.
- Kalisto genome index
- Only needed if kalisto is used instead of STAR for alignment
kallisto index reference.fa
- Picard genome index
- Only needed if perfect phasing is done
java -jar picard.jar CreateSequenceDictionary reference.fa
- Ensembl transcript to HGNC gene
- Convert transcript ID to hgnc symbol
- Ensembl100 genes are included in
ref/ensembl100_transcript2gene.tsv
- Gene BED file
- Bed files for gene location
- hg38 gene bed is included in
ref/biomart_ensembl100_GRCh38.sorted.bed.gz
There are four main outputs for each run. These can be found in the output/{sampleName}/final
directory. The outputs can be used as input for the IMPALA workflow.
- RNA alignment file
- Expression matrix
- Phased VCF
- List of genes expressed and ASE/BAE status
There are two config files that needs to be edited before running the workflow. Both config files are found in the config
directory.
The config/params.yaml
file is used to specify parameters for the simulated data.
# Number of snps in each allele for normal and tumor genome
snps:
normalSNPcount: 2500000
tumorSNPcount: 20000 #20000 # 1000
# Number of ASE genes
numASE: 500
# Proportion of tumor vs normal RNA reads
tumorContent: 0.75
# Parameters for RNA simulation
rnaReads:
readLength: 150
percentExpressed: 0.4 # Proportion of genes expressed
depth: 5 # Control gene expression
ASEfoldchange: 2 # Difference in expression between alleles for ASE
# Simulates long read for phasing (takes much longer), else use perfect phasing (Every SNP is phased)
simulatePhasing: FALSE
# If true, uses WASP filtering star alignment
# If false, uses kalisto alignment
WASPfilter: TRUE
# Parameter for nanopore simulation
longRead:
numReads: 3000000
The config/refPaths.yaml
file is used to specify paths to the input files. Places to download or generate these files are listed in input
ref_genome:
"ref/chr22.fa"
nanosim_model:
"ref/nanosimModel/human_NA12878_DNA_FAB49712_guppy"
annotation_gtf:
"ref/chr22.gtf"
gene_annotation:
"ref/biomart_ensembl100_GRCh38.sorted.bed.gz"
ensembl2hgnc:
"ref/ensembl100_transcript2gene.tsv"
kalliso_index:
"/path/to/kalliso/index"
chrom_length:
"/path/to/chrom/length"
picardIndex:
"/path/to/picard/index"
This is the command to run it with singularity. The -c parameter can be used to specify maximum number of threads. The -B parameter is used to speceify paths for the docker container to bind. The sample parameter specifies the sample name, listing multiple sample name will generate multiple sets of ASE data.
snakemake -c 30 --use-singularity --singularity-args "-B /projects,/home,/gsc" --config sample=["sampleName1", "sampleName2"]
The pipeline was originnally written by Glenn Chang with the help and input from:
- Members of the Jones lab (Canada's Michael Smith Genome Sciences Centre, Vancouver, Canada).
- Special thanks to Steven Jones, Kieran O'Niell, Vannessa Porter and Luka Cuilibrk
AlleleSynth
is licensed under the terms of the GNU GPL v3.