Skip to content

aberthel/AlleleSeq2

This branch is 1 commit ahead of trgaleev/AlleleSeq2:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

0d11d43 · Oct 27, 2021
Apr 7, 2021
Apr 6, 2021
Oct 27, 2021
Apr 8, 2021
Feb 12, 2017
Mar 6, 2019
Mar 6, 2019
Apr 6, 2021
Apr 6, 2021
Apr 6, 2021
Oct 27, 2021
Feb 12, 2017
Mar 6, 2019
Mar 6, 2019
Feb 12, 2017
May 18, 2017
Mar 31, 2017
Jul 20, 2018
Jul 20, 2018
Mar 6, 2019
Mar 6, 2019
Mar 6, 2019
Mar 6, 2019
Mar 6, 2019
Mar 6, 2019
Mar 6, 2019
Mar 6, 2019
Nov 20, 2017
Mar 6, 2019
Apr 6, 2021
Mar 6, 2019
Jul 27, 2018
Jul 27, 2018
Mar 6, 2019
Jul 20, 2018

Repository files navigation

AlleleSeq2

alt text

Generate personal genomes, STAR indices and other helper files:

Makefile options (can be specified in makePersonalGenome.mk or as command-line arguments)

Dependencies, system parameters/paths:

VCF2DIPLOID_DIR: vcf2diploid (available from http://alleleseq.gersteinlab.org/tools.html)
PL: path to AlleleSeq2
LIFTOVER: UCSC liftOver tool
BEDTOOLS_intersectBed: Bedtools intersectBed
SAMTOOLS: Samtools
STAR: STAR aligner

Other options:

N_THREADS: number or threads (for STAR genomeGenerate)
REFGENOME_VERSION: reference genome version, 'GRCh37' or 'GRCh38'
REFGENOME: path to the reference genome .fasta file
FILE_PATH_VCF: path to VCF
VCF_SAMPLE_ID: sample name in VCF
FILE_PATH_BAM: path to WGS bam
OUTPUT_DIR: output folder

Example:

make -f makePersonalGenome.mk \
        N_THREADS=8 \
        VCF_SAMPLE_ID=sge_Aug_encodev2_2_local \
        REFGENOME_VERSION=GRCh38 \
        OUTPUT_DIR=pgenome_ENC-003 \
        FILE_PATH_BAM=ENCFF200XCY.bam \
        FILE_PATH_VCF=enc003.spliced.scrubbed.vcf 

(1) Calling AS hetSNVs from a single sample

Makefile options (can be specified in PIPELINE.mk or as command-line arguments):

Dependencies:

python2

scipy
numpy
pandas

R

VGAM
ggplot2

Dependencies, system parameters/paths:

PL: path to AlleleSeq2
SAMTOOLS: samtools
PICARD: Broad picard tools
STAR: STAR aligner FASTQC: FastQC quality control tool
CUTADAPT: Cutadapt to remove adapter sequences (ATAC-seq samples)

Other options:

READS_R1: path to input .fastq file (R1)
READS_R2: path to input .fastq file (R2, if PE sequencing)
PGENOME_DIR: path to personal genome folder
VCF_SAMPLE_ID: sample name in VCF
ALIGNMENT_MODE: 'ASE' for RNA-seq, 'ASB' for ChIP-seq' and 'ASCA' for ATAC-seq
RM_DUPLICATE_READS: 'on' to remove duplicate reads with picard tools
STAR_readFilesCommand: --readFilesIn option in STAR
REFGENOME_VERSION: reference genome version, 'GRCh37' or 'GRCh38'
Annotation_diploid: path to diploid GENCODE gene annotation (for 'ASE')
FDR_CUTOFF: FDR threshold
Cntthresh_tot: threshold for the total number of reads mapped to hetSNV
Cntthresh_min: threshold for the minimal number of reads mapped to each allele

Example:

make -f PIPELINE.mk \
        PGENOME_DIR=pgenome_ENC-003 \
        REFGENOME_VERSION=GRCh38 \
        NTHR=8 \
        READS_R1=ENCFF337ZBN.fastq.gz \
        READS_R2=ENCFF481IQE.fastq.gz \
        STAR_readFilesCommand=zcat \
        ALIGNMENT_MODE=ASE \
        RM_DUPLICATE_READS=on \
        FDR_CUTOFF=0.10 \
	VCF_SAMPLE_ID=sge_Aug_encodev2_2_local

The main output file containing AS hetSNVs is
ENCFF337ZBN_ENCFF481IQE_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv

(2) Pool replicates/tissues

Makefile options (can be specified in PIPELINE_aggregated_counts.mk or as command-line arguments)

Dependencies, system parameters/paths:

PL: path to AlleleSeq2
PGENOME_DIR: path to personal genome folder
VCF_SAMPLE_ID: sample name in VCF

Other options:

INPUT_UNIQ_READS_PILEUP_FILES: list of .mpileup files with uniquely mapped reads to aggregate generated in (1)
INPUT_MMAP_READS_PILEUP_FILES: list of .mpileup files with multi-mapping reads to aggregate generated in (1)
PREFIX: prefix for output file names
Cntthresh_tot: threshold for the total number of reads mapped to hetSNV
Cntthresh_min: threshold for the minimal number of reads mapped to each allele
FDR_CUTOFF: FDR threshold

Example:

make -f PIPELINE_aggregated_counts.mk \
        PGENOME_DIR=pgenome_ENC-003 \
        INPUT_UNIQ_READS_PILEUP_FILES="../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap1_uniqreads.mpileup    ../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap2_uniqreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap1_uniqreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap2_uniqreads.mpileup" \
        INPUT_MMAP_READS_PILEUP_FILES="../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap1_mmapreads.mpileup ../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap2_mmapreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap1_mmapreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap2_mmapreads.mpileup" \
        PREFIX=ENCSR238ZZD \
        FDR_CUTOFF=0.10 \
        VCF_SAMPLE_ID=sge_Aug_encodev2_2_local

The main output file is ENCSR238ZZD_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv

(3) Aggregate across genomic elements

Makefile options (can be specified in PIPELINE_aggregate_over_genomic_regions.mk or as command-line arguments)

Dependencies, system parameters/paths:

PL: path to AlleleSeq2
BEDTOOLS_intersectBed: Bedtools intersectBed
SAMTOOLS: Samtools

Other options:

PREFIX: prefix for output file names
Cntthresh_tot: threshold for the total number of reads mapped to hetSNV
Cntthresh_min: threshold for the minimal number of reads mapped to each allele
FDR_CUTOFF: FDR threshold COUNTS_FILE: filtered read count file generated in (1) or (2). REGIONS_FILE: path to a .bed file with genomic elments (e.g. gene, cCRE) coordinates
UNIQ_ALN_FILES: .bam(s) with uniquely mapping reads generated in (1)
MMAP_ALN_FILES: .bam(s) with multimapping reads generated in (1)

Example:

make -f   PIPELINE_aggregate_over_genomic_regions.mk \
	  PREFIX=ENCSR238ZZD \
	  REGIONS_FILE="gencode.v24_all_genes.bed" \
	  COUNTS_FILE=../ENCSR238ZZD_combined/ENCSR238ZZD_filtered_counts.tsv \
	  UNIQ_ALN_FILES='../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_ASE-params_crdsorted_uniqreads_over_hetSNVs.bam ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_ASE-params_crdsorted_uniqreads_over_hetSNVs.bam' \
	  MMAP_ALN_FILES='../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_ASE-params_crdsorted_mmapreads_over_hetSNVs.bam ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_ASE-params_crdsorted_mmapreads_over_hetSNVs.bam' \
	  FDR_CUTOFF=0.10  

The main output file listing AS genomic elements is ENCSR238ZZD_interesting_regions.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv

(4) Additional scripts to calculate hap-specific read coverage are provided under the directory calc_read_coverage

(5) Scripts to query the ENTEx catalog are provided under the directory query

The catalog file is hetSNVs_default_AS.tsv.

(6) Additional scripts to perform high-powered AS calling on samples with low read count are provided under the directory high_power

This requires files generated from (1), calling AS hetSNVs from a single sample. Where indicated, use the same values for options as in (1)

Two methods are included: the one-sided betabinomial test and the relaxed-FDR method. By default, both are calculated. To run only one, use "make onesided" or "make fdr", respectively.

Makefile options (can be specified in PIPELINE_high_power.mk or as command-line arguments):

Dependencies:

python2

pandas

R

VGAM

Dependencies, system parameters/paths:

PL: path to AlleleSeq2 (same as in (1))

Other options:

READS_R1: path to input .fastq file (R1) (same as in (1)) READS_R2: path to input .fastq file (R2, if PE sequencing) (same as in (1)) FDR_CUTOFF: FDR threshold (same as in (1)) Cntthresh_tot: threshold for the total number of reads mapped to hetSNV (same as in (1)) Cntthresh_min: threshold for the minimal number of reads mapped to each allele (same as in (1)) AS_PRIOR: path to tab-separated file of hetSNVs that are likely to be AS. Required columns are: "chr", "ref_coord", and "preferred_allele" (1 if ref allele is favored, 0 otherwise) RELAXED_FDR_CUTOFF: for relaxed-FDR method, the less stringent FDR threshold FDR_GRANULARITY: maximum p-value at which to check FDR with fine granularity (0.01 is usually sufficient for FDR = 0.1, 0.03 is sufficient for FDR = 0.2)

Example:

make -f PIPELINE_high_power.mk \
        READS_R1=ENCFF000FAH.fastq.gz \
        FDR_CUTOFF=0.10 \
        FDR_RELAXED=0.2 \
        FDR_GRANULARITY=0.03 \
        AS_PRIOR=AS_prior.tsv

The main output files are ENCFF000FAH_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt_relaxed_fdr.tsv and ENCFF000FAH_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt_onesided.tsv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 41.2%
  • Makefile 30.3%
  • R 25.4%
  • Shell 3.1%