Bayesian reconstruction of excised mRNA (BREM)

Authors:
Marjan Hosseini
Devin J. McConnell
Derek Aguiar

The code for BREM method is provided in 'brem.py' file. The junction files for one gene (A2ML1) is added in the github. To run BREM with default configurations, simply run the requirements.txt. Then run 'brem.py'. More detailed guide is the follow.

Usage

Clone the repository:

git clone https://github.com/aguiarlab/BREM.git
cd BREM

Install the project dependencies and activate the environment:

conda env create -f brem_env.yml
conda activate brem_env

Run the code:

python3 brem.py

Or alternatively for changing model parameters or training other genes run:

python3 brem.py -k clusters_no -i max iteration  -e eta hyper parameter -a alpha hyper-parameter -r r -s s -p path -g gene

- clusters_no: number of clusters, this number should be larger than the number of minimum node cover of the interval graph of the intron excisions
- max iteration: the maximum number of iterations for the Gibbs updates
- eta corresponds to eta hyper-parameter (See the model section)  
- alpha corresponds to alpha hyper-parameter (See the model section)  
- r corresponds to r hyper-parameter (See the model section)  
- s corresponds to s hyper-parameter (See the model section) 
- path is the relative or full path to the folder that contains the genes samples junctions reads as a zip file. Here one gene (A2ML1.zip) has been uploaded for example to be run and the path to it is the default path.
- gene: the name of the gene (Here by default, gene is A2ML1).

Prepare the input:

STAR

For creating the EGA and simulated data BAM files we ran the STAR [1] aligner for fast and accurate alignment.

STAR --runThreadN 20 --genomeDir ../genome_data/genome_index/ --outFileNamePrefix ./person_${i}_ --twopassMode  Basic --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate  --readFilesIn ${1}/person_${i}_1.fa ${1}/person_${i}_2.fa

In this example $${1}$ is the directory where the files are located. The input file 'person_*_1.fa' is a collection of genes for the ith sample on forward sequence and the second is the collection of genes on the backward sequence. This allows us to use the twopassMode and we also used the intronMotif in order to obtain spliced alignments (XS). Once the files have been aligned they are then separated out into the individual bam files of just one gene to work on at a time.

Regtools

Regtools [2] was used for an efficient filtering of the junctions.

regtools junctions extract -s 0 -a 6 -m 50 -M 500000 %s -o %s.junc

Here the two %s are the bam file and output name respectively. On all data used in this project, EGA, Geuvadis, and simulations, we used the following flags:

* -s: finds XS/unstranded flags
* -a: minimum anchor length into exon (6 bp)
* -m: minimum intron size (50 bp)
* -M: Maximum intron size (500000 bp)

Portcullis

portcullis prep -t 20 -v --force -o %s_portcullis/1-prep/ GRCh38.primary_assembly.genome.fa %s/%s.bam

First step of portcullis [3] is prep and it takes in the fasta file from the reference genome used, here is an example from the data simulations. It is important to note that portcullis was run on our simulations and both experimental results. Here %s is there name of the folder to direct output to and the bam file name.

portcullis junc -t 20 -v -o %s_portcullis/2-junc/portcullis_all --intron_gff %s_portcullis/1-prep/

The next step that is being done is extraction of the junctions into a gff format. Here %s is just giving it the name of the folder to look into.

portcullis filt -t 20 -v -n --max_length 500000 --min_cov 30 -o %s_portcullis/3-filt/portcullis_filtered --intron_gff %s_portcullis/1-prep/ %s_portcullis/2-junc/portcullis_all.junctions.tab

Finally, we set some filtering using portcullis. We only keep introns that have a max length of 500000, do not use the machine learning filtering, and have a minimum coverage of 30. Here again %s is just point to the gene folder to look into for the files. After portcullis is complete we do an overlap check between the junctions found from regtools and portcullis and only keep the ones that overlap with at least 90%.

Model:

BREM Probabilistic Graphical Model

Main variables and parameters include:

V is the set of unique intron excisions, indexed by v and its size of denoted by |V|.
N is the number of samples and are indexed by i.
J_i is the number of intron excisions in i^th sample.
K is the number of clusters (indexed by k).
For the j^th intron excision in the i^th sample, we assign a cluster k.
Graph G = (V, E), where V is the set of unique intron excision and there is an edge between two intron excisions iff they intersect each other.
Ω is the set of all the independent sets in G.
r, s are priors for π (Beta distribution):
- Increase in mean of Beta(r,s) results in increase in cluster size |SIE|.
The structure of a (clusters) SIEs consists of the inclusion or exclusion of intron excisions.
For cluster k, β_k is a |V|-dimensional Dirichlet which represents the distribution of the cluster k over the intron excisions.
- η = (η₁, η₂, ..., η_|V|) is β variable prior.
For the i^th sample, variable θ_i is a K-dimensional Dirichlet distribution and represents the proportions of the clusters in sample i.
- α = (α₁, α₂, ..., α_N) is θ variable prior.
Variable z_ij is the cluster assignment for j^th intron excision in i^th sample. It can take a natural value between 1 and K and follows a Multinomial:
In the ith sample, the jth intron excision is w_ij and is observed and follows a Multinomial distribution:

References

[1] Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.

[2] Feng, Y. Y., Ramu, A., Cotto, K. C., Skidmore, Z. L., Kunisaki, J., Conrad, D. F., ... & Griffith, M. (2018). RegTools: Integrated analysis of genomic and transcriptomic data for discovery of splicing variants in cancer. BioRxiv, 436634.

[3] Mapleson, D., Venturini, L., Kaithakottil, G., & Swarbreck, D. (2018). Efficient and accurate detection of splice junctions from RNA-seq with Portcullis. GigaScience, 7(12), giy131.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
BREM		BREM
docs		docs
.gitignore		.gitignore
A2ML1.zip		A2ML1.zip
README.md		README.md
brem.py		brem.py
brem_env.yml		brem_env.yml
extract_junctions.sh		extract_junctions.sh
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bayesian reconstruction of excised mRNA (BREM)

Usage

Prepare the input:

STAR

Regtools

Portcullis

Model:

References

About

Releases

Packages

Contributors 2

Languages

bayesomicslab/BSEEJ

Folders and files

Latest commit

History

Repository files navigation

Bayesian reconstruction of excised mRNA (BREM)

Usage

Prepare the input:

STAR

Regtools

Portcullis

Model:

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages