Authors:
Marjan Hosseini
Devin J. McConnell
Derek Aguiar
The code for BREM method is provided in 'brem.py' file. The junction files for one gene (A2ML1) is added in the github. To run BREM with default configurations, simply run the requirements.txt. Then run 'brem.py'. More detailed guide is the follow.
- Clone the repository:
git clone https://github.com/aguiarlab/BREM.git
cd BREM
- Install the project dependencies and activate the environment:
conda env create -f brem_env.yml
conda activate brem_env
- Run the code:
python3 brem.py
Or alternatively for changing model parameters or training other genes run:
python3 brem.py -k clusters_no -i max iteration -e eta hyper parameter -a alpha hyper-parameter -r r -s s -p path -g gene
- clusters_no: number of clusters, this number should be larger than the number of minimum node cover of the interval graph of the intron excisions
- max iteration: the maximum number of iterations for the Gibbs updates
- eta corresponds to eta hyper-parameter (See the model section)
- alpha corresponds to alpha hyper-parameter (See the model section)
- r corresponds to r hyper-parameter (See the model section)
- s corresponds to s hyper-parameter (See the model section)
- path is the relative or full path to the folder that contains the genes samples junctions reads as a zip file. Here one gene (A2ML1.zip) has been uploaded for example to be run and the path to it is the default path.
- gene: the name of the gene (Here by default, gene is A2ML1).
For creating the EGA and simulated data BAM files we ran the STAR [1] aligner for fast and accurate alignment.
STAR --runThreadN 20 --genomeDir ../genome_data/genome_index/ --outFileNamePrefix ./person_${i}_ --twopassMode Basic --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --readFilesIn ${1}/person_${i}_1.fa ${1}/person_${i}_2.fa
In this example $${1}$ is the directory where the files are located. The input file 'person_*_1.fa' is a collection of genes for the ith sample on forward sequence and the second is the collection of genes on the backward sequence. This allows us to use the twopassMode and we also used the intronMotif in order to obtain spliced alignments (XS). Once the files have been aligned they are then separated out into the individual bam files of just one gene to work on at a time.
Regtools [2] was used for an efficient filtering of the junctions.
regtools junctions extract -s 0 -a 6 -m 50 -M 500000 %s -o %s.junc
Here the two %s are the bam file and output name respectively. On all data used in this project, EGA, Geuvadis, and simulations, we used the following flags:
* -s: finds XS/unstranded flags
* -a: minimum anchor length into exon (6 bp)
* -m: minimum intron size (50 bp)
* -M: Maximum intron size (500000 bp)
portcullis prep -t 20 -v --force -o %s_portcullis/1-prep/ GRCh38.primary_assembly.genome.fa %s/%s.bam
First step of portcullis [3] is prep and it takes in the fasta file from the reference genome used, here is an example from the data simulations. It is important to note that portcullis was run on our simulations and both experimental results. Here %s is there name of the folder to direct output to and the bam file name.
portcullis junc -t 20 -v -o %s_portcullis/2-junc/portcullis_all --intron_gff %s_portcullis/1-prep/
The next step that is being done is extraction of the junctions into a gff format. Here %s is just giving it the name of the folder to look into.
portcullis filt -t 20 -v -n --max_length 500000 --min_cov 30 -o %s_portcullis/3-filt/portcullis_filtered --intron_gff %s_portcullis/1-prep/ %s_portcullis/2-junc/portcullis_all.junctions.tab
Finally, we set some filtering using portcullis. We only keep introns that have a max length of 500000, do not use the machine learning filtering, and have a minimum coverage of 30. Here again %s is just point to the gene folder to look into for the files. After portcullis is complete we do an overlap check between the junctions found from regtools and portcullis and only keep the ones that overlap with at least 90%.
BREM Probabilistic Graphical Model
Main variables and parameters include:
-
V is the set of unique intron excisions, indexed by v and its size of denoted by |V|.
-
N is the number of samples and are indexed by i.
-
Ji is the number of intron excisions in ith sample.
-
K is the number of clusters (indexed by k).
-
For the jth intron excision in the ith sample, we assign a cluster k.
-
Graph G = (V, E), where V is the set of unique intron excision and there is an edge between two intron excisions iff they intersect each other.
-
Ω is the set of all the independent sets in G.
-
r, s are priors for π (Beta distribution):
- Increase in mean of Beta(r,s) results in increase in cluster size |SIE|.
-
The structure of a (clusters) SIEs consists of the inclusion or exclusion of intron excisions.
-
For cluster k, βk is a |V|-dimensional Dirichlet which represents the distribution of the cluster k over the intron excisions.
- η = (η1, η2, ..., η|V|) is β variable prior.
-
For the ith sample, variable θi is a K-dimensional Dirichlet distribution and represents the proportions of the clusters in sample i.
- α = (α1, α2, ..., αN) is θ variable prior.
-
Variable zij is the cluster assignment for jth intron excision in ith sample. It can take a natural value between 1 and K and follows a Multinomial:
-
In the ith sample, the jth intron excision is wij and is observed and follows a Multinomial distribution:
[1] Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.
[2] Feng, Y. Y., Ramu, A., Cotto, K. C., Skidmore, Z. L., Kunisaki, J., Conrad, D. F., ... & Griffith, M. (2018). RegTools: Integrated analysis of genomic and transcriptomic data for discovery of splicing variants in cancer. BioRxiv, 436634.
[3] Mapleson, D., Venturini, L., Kaithakottil, G., & Swarbreck, D. (2018). Efficient and accurate detection of splice junctions from RNA-seq with Portcullis. GigaScience, 7(12), giy131.