AlleleSeq2

Generate personal genomes, STAR indices and other helper files:

Makefile options (can be specified in `makePersonalGenome.mk` or as command-line arguments)

Dependencies, system parameters/paths:

VCF2DIPLOID_DIR: vcf2diploid (available from http://alleleseq.gersteinlab.org/tools.html)
PL: path to AlleleSeq2
LIFTOVER: UCSC liftOver tool
BEDTOOLS_intersectBed: Bedtools intersectBed
SAMTOOLS: Samtools
STAR: STAR aligner

Other options:

N_THREADS: number or threads (for STAR genomeGenerate)
REFGENOME_VERSION: reference genome version, 'GRCh37' or 'GRCh38'
REFGENOME: path to the reference genome .fasta file
FILE_PATH_VCF: path to VCF
VCF_SAMPLE_ID: sample name in VCF
FILE_PATH_BAM: path to WGS bam
OUTPUT_DIR: output folder

Example:

make -f makePersonalGenome.mk \
        N_THREADS=8 \
        VCF_SAMPLE_ID=sge_Aug_encodev2_2_local \
        REFGENOME_VERSION=GRCh38 \
        OUTPUT_DIR=pgenome_ENC-003 \
        FILE_PATH_BAM=ENCFF200XCY.bam \
        FILE_PATH_VCF=enc003.spliced.scrubbed.vcf

(1) Calling AS hetSNVs from a single sample

Makefile options (can be specified in `PIPELINE.mk` or as command-line arguments):

Dependencies:

python2

scipy
numpy
pandas

R

VGAM
ggplot2

Dependencies, system parameters/paths:

PL: path to AlleleSeq2
SAMTOOLS: samtools
PICARD: Broad picard tools
STAR: STAR aligner FASTQC: FastQC quality control tool
CUTADAPT: Cutadapt to remove adapter sequences (ATAC-seq samples)

Other options:

READS_R1: path to input .fastq file (R1)
READS_R2: path to input .fastq file (R2, if PE sequencing)
PGENOME_DIR: path to personal genome folder
VCF_SAMPLE_ID: sample name in VCF
ALIGNMENT_MODE: 'ASE' for RNA-seq, 'ASB' for ChIP-seq' and 'ASCA' for ATAC-seq
RM_DUPLICATE_READS: 'on' to remove duplicate reads with picard tools
STAR_readFilesCommand: --readFilesIn option in STAR
REFGENOME_VERSION: reference genome version, 'GRCh37' or 'GRCh38'
Annotation_diploid: path to diploid GENCODE gene annotation (for 'ASE')
FDR_CUTOFF: FDR threshold
Cntthresh_tot: threshold for the total number of reads mapped to hetSNV
Cntthresh_min: threshold for the minimal number of reads mapped to each allele

Example:

make -f PIPELINE.mk \
        PGENOME_DIR=pgenome_ENC-003 \
        REFGENOME_VERSION=GRCh38 \
        NTHR=8 \
        READS_R1=ENCFF337ZBN.fastq.gz \
        READS_R2=ENCFF481IQE.fastq.gz \
        STAR_readFilesCommand=zcat \
        ALIGNMENT_MODE=ASE \
        RM_DUPLICATE_READS=on \
        FDR_CUTOFF=0.10 \
	VCF_SAMPLE_ID=sge_Aug_encodev2_2_local

The main output file containing AS hetSNVs is
ENCFF337ZBN_ENCFF481IQE_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv

(2) Pool replicates/tissues

Makefile options (can be specified in `PIPELINE_aggregated_counts.mk` or as command-line arguments)

Dependencies, system parameters/paths:

PL: path to AlleleSeq2
PGENOME_DIR: path to personal genome folder
VCF_SAMPLE_ID: sample name in VCF

Other options:

INPUT_UNIQ_READS_PILEUP_FILES: list of .mpileup files with uniquely mapped reads to aggregate generated in (1)
INPUT_MMAP_READS_PILEUP_FILES: list of .mpileup files with multi-mapping reads to aggregate generated in (1)
PREFIX: prefix for output file names
Cntthresh_tot: threshold for the total number of reads mapped to hetSNV
Cntthresh_min: threshold for the minimal number of reads mapped to each allele
FDR_CUTOFF: FDR threshold

Example:

make -f PIPELINE_aggregated_counts.mk \
        PGENOME_DIR=pgenome_ENC-003 \
        INPUT_UNIQ_READS_PILEUP_FILES="../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap1_uniqreads.mpileup    ../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap2_uniqreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap1_uniqreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap2_uniqreads.mpileup" \
        INPUT_MMAP_READS_PILEUP_FILES="../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap1_mmapreads.mpileup ../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_hap2_mmapreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap1_mmapreads.mpileup ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_hap2_mmapreads.mpileup" \
        PREFIX=ENCSR238ZZD \
        FDR_CUTOFF=0.10 \
        VCF_SAMPLE_ID=sge_Aug_encodev2_2_local

The main output file is ENCSR238ZZD_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv

(3) Aggregate across genomic elements

Makefile options (can be specified in `PIPELINE_aggregate_over_genomic_regions.mk` or as command-line arguments)

Dependencies, system parameters/paths:

PL: path to AlleleSeq2
BEDTOOLS_intersectBed: Bedtools intersectBed
SAMTOOLS: Samtools

Other options:

PREFIX: prefix for output file names
Cntthresh_tot: threshold for the total number of reads mapped to hetSNV
Cntthresh_min: threshold for the minimal number of reads mapped to each allele
FDR_CUTOFF: FDR threshold COUNTS_FILE: filtered read count file generated in (1) or (2). REGIONS_FILE: path to a .bed file with genomic elments (e.g. gene, cCRE) coordinates
UNIQ_ALN_FILES: .bam(s) with uniquely mapping reads generated in (1)
MMAP_ALN_FILES: .bam(s) with multimapping reads generated in (1)

Example:

make -f   PIPELINE_aggregate_over_genomic_regions.mk \
	  PREFIX=ENCSR238ZZD \
	  REGIONS_FILE="gencode.v24_all_genes.bed" \
	  COUNTS_FILE=../ENCSR238ZZD_combined/ENCSR238ZZD_filtered_counts.tsv \
	  UNIQ_ALN_FILES='../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_ASE-params_crdsorted_uniqreads_over_hetSNVs.bam ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_ASE-params_crdsorted_uniqreads_over_hetSNVs.bam' \
	  MMAP_ALN_FILES='../ENCSR238ZZD_ENCFF719MSG_1_ENCFF120MML_2_1_1/ENCFF719MSG_ENCFF120MML_ASE-params_crdsorted_mmapreads_over_hetSNVs.bam ../ENCSR238ZZD_ENCFF337ZBN_1_ENCFF481IQE_2_1_1/ENCFF337ZBN_ENCFF481IQE_ASE-params_crdsorted_mmapreads_over_hetSNVs.bam' \
	  FDR_CUTOFF=0.10

The main output file listing AS genomic elements is ENCSR238ZZD_interesting_regions.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt.tsv

(4) Additional scripts to calculate hap-specific read coverage are provided under the directory calc_read_coverage

(5) Scripts to query the ENTEx catalog are provided under the directory query

The catalog file is hetSNVs_default_AS.tsv.

(6) Additional scripts to perform high-powered AS calling on samples with low read count are provided under the directory high_power

This requires files generated from (1), calling AS hetSNVs from a single sample. Where indicated, use the same values for options as in (1)

Two methods are included: the one-sided betabinomial test and the relaxed-FDR method. By default, both are calculated. To run only one, use "make onesided" or "make fdr", respectively.

Makefile options (can be specified in `PIPELINE_high_power.mk` or as command-line arguments):

Dependencies:

python2

pandas

R

VGAM

Dependencies, system parameters/paths:

PL: path to AlleleSeq2 (same as in (1))

Other options:

READS_R1: path to input .fastq file (R1) (same as in (1)) READS_R2: path to input .fastq file (R2, if PE sequencing) (same as in (1)) FDR_CUTOFF: FDR threshold (same as in (1)) Cntthresh_tot: threshold for the total number of reads mapped to hetSNV (same as in (1)) Cntthresh_min: threshold for the minimal number of reads mapped to each allele (same as in (1)) AS_PRIOR: path to tab-separated file of hetSNVs that are likely to be AS. Required columns are: "chr", "ref_coord", and "preferred_allele" (1 if ref allele is favored, 0 otherwise) RELAXED_FDR_CUTOFF: for relaxed-FDR method, the less stringent FDR threshold FDR_GRANULARITY: maximum p-value at which to check FDR with fine granularity (0.01 is usually sufficient for FDR = 0.1, 0.03 is sufficient for FDR = 0.2)

Example:

make -f PIPELINE_high_power.mk \
        READS_R1=ENCFF000FAH.fastq.gz \
        FDR_CUTOFF=0.10 \
        FDR_RELAXED=0.2 \
        FDR_GRANULARITY=0.03 \
        AS_PRIOR=AS_prior.tsv

The main output files are ENCFF000FAH_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt_relaxed_fdr.tsv and ENCFF000FAH_interestingHets.FDR-0.10.betabinom.chrs1-22.6-tot_0-min_cnt_onesided.tsv

Name	Name	Last commit message	Last commit date
Latest commit aberthel added high power calculations Oct 27, 2021 0d11d43 · Oct 27, 2021 History 106 Commits
calc_read_coverage	calc_read_coverage	updated readme	Apr 7, 2021
docs	docs	README update	Apr 6, 2021
high_power	high_power	added high power calculations	Oct 27, 2021
query	query	Update README.md	Apr 8, 2021
.gitignore	.gitignore	mapping to diploid genome, pileups	Feb 12, 2017
FalsePos.py	FalsePos.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
FalsePos_regions.py	FalsePos_regions.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
PIPELINE.mk	PIPELINE.mk	minor clean up and README update	Apr 6, 2021
PIPELINE_aggregate_over_genomic_regions.mk	PIPELINE_aggregate_over_genomic_regions.mk	minor clean up and README update	Apr 6, 2021
PIPELINE_aggregated_counts.mk	PIPELINE_aggregated_counts.mk	minor clean up and README update	Apr 6, 2021
README.md	README.md	added high power calculations	Oct 27, 2021
STAR_custom_parameters_sample_file	STAR_custom_parameters_sample_file	mapping to diploid genome, pileups	Feb 12, 2017
alleledb_alleleseqBetabinomial.R	alleledb_alleleseqBetabinomial.R	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
alleledb_calcOverdispersion.R	alleledb_calcOverdispersion.R	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
binom.py	binom.py	binom.py from AlleleSeq_v1.2a	Feb 12, 2017
calculate_rd.py	calculate_rd.py	changes to rd calcn procedure	May 18, 2017
feed_sorted_check_names.py	feed_sorted_check_names.py	deal with diff chr names in input files	Mar 31, 2017
filter_by_counts.py	filter_by_counts.py	h1/h2 to hap1/hap2	Jul 20, 2018
filter_by_pval.py	filter_by_pval.py	h1/h2 to hap1/hap2	Jul 20, 2018
filter_cnv_sites.py	filter_cnv_sites.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
filter_non-autosomal_chr.py	filter_non-autosomal_chr.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
filter_phase_warnings.py	filter_phase_warnings.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
filter_regions_by_counts.py	filter_regions_by_counts.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
filter_regions_by_pval.py	filter_regions_by_pval.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
filter_regions_non-autosomal_chr.py	filter_regions_non-autosomal_chr.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
filter_regions_w_mmaps.py	filter_regions_w_mmaps.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
filter_sites_w_mmaps.py	filter_sites_w_mmaps.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
get_hetSNV_bed.py	get_hetSNV_bed.py	get phasing info from mpileups	Nov 20, 2017
intersect2counts.py	intersect2counts.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
makePersonalGenome.mk	makePersonalGenome.mk	minor clean up and README update	Apr 6, 2021
pileup2counts.py	pileup2counts.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
plot_AllelicRatio_distribution.R	plot_AllelicRatio_distribution.R	fixed filenames for ratio plots, switched back to R	Jul 27, 2018
plot_AllelicRatio_distribution.py	plot_AllelicRatio_distribution.py	fixed filenames for ratio plots, switched back to R	Jul 27, 2018
read_pileup.py	read_pileup.py	minor edits/rearrangements, not affecting results, except reps pool o…	Mar 6, 2019
refbed2hapcoords.py	refbed2hapcoords.py	h1/h2 to hap1/hap2	Jul 20, 2018

aberthel/AlleleSeq2

Folders and files

Latest commit

History

Repository files navigation

AlleleSeq2

Generate personal genomes, STAR indices and other helper files:

Makefile options (can be specified in makePersonalGenome.mk or as command-line arguments)

Dependencies, system parameters/paths:

Other options:

Example:

(1) Calling AS hetSNVs from a single sample

Makefile options (can be specified in PIPELINE.mk or as command-line arguments):

Dependencies:

python2

R

Dependencies, system parameters/paths:

Other options:

Example:

(2) Pool replicates/tissues

Makefile options (can be specified in PIPELINE_aggregated_counts.mk or as command-line arguments)

Dependencies, system parameters/paths:

Other options:

Example:

(3) Aggregate across genomic elements

Makefile options (can be specified in PIPELINE_aggregate_over_genomic_regions.mk or as command-line arguments)

Dependencies, system parameters/paths:

Other options:

Example:

(4) Additional scripts to calculate hap-specific read coverage are provided under the directory calc_read_coverage

(5) Scripts to query the ENTEx catalog are provided under the directory query

(6) Additional scripts to perform high-powered AS calling on samples with low read count are provided under the directory high_power

This requires files generated from (1), calling AS hetSNVs from a single sample. Where indicated, use the same values for options as in (1)

Two methods are included: the one-sided betabinomial test and the relaxed-FDR method. By default, both are calculated. To run only one, use "make onesided" or "make fdr", respectively.

Makefile options (can be specified in PIPELINE_high_power.mk or as command-line arguments):

Dependencies:

python2

R

Dependencies, system parameters/paths:

Other options:

Example:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Makefile options (can be specified in `makePersonalGenome.mk` or as command-line arguments)

Makefile options (can be specified in `PIPELINE.mk` or as command-line arguments):

Makefile options (can be specified in `PIPELINE_aggregated_counts.mk` or as command-line arguments)

Makefile options (can be specified in `PIPELINE_aggregate_over_genomic_regions.mk` or as command-line arguments)

Makefile options (can be specified in `PIPELINE_high_power.mk` or as command-line arguments):

Packages