(Current version does not contain RNA sequence specificities. Experimental data will be provided upon submission. Consequently, only scripts that work without RNA binding specificities can be executed)
This directory contains the code for "Reconstructing sequence specificities of RNA binding proteins across eukaryotes".
We used a joint linear embedding approach to model the relationship between protein sequence and RNA sequence specificity.
Recommended: install anaconda and create virtual environment with python 2.7 adding dependencies listed in dependencies.txt
Additional requirements:
- python3 (only agglomerative_clustering.py), scikit-learn==0.23.2)
- Hmmer (http://hmmer.org/)
- conservation_code (Capra JA and Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics, 23(15):1875-82, 2007. (https://compbio.cs.princeton.edu/conservation/))
- pymol (https://pymol.org/2/) to visualize individual pdbs
Note: To execute "full" pipeline (i.e. every intermediate step), very long running times and large amounts of memory are required. Parallel execution recommended!
To run every step, and modify intermediate results, set $full=1 in individual bash scripts.
RUN: Before reconstructing the figures with (fig1-5.sh), execute data processing scripts (rncmpt_data.sh, performance_calc.sh, interface_importance.sh, jple_reconstruction.sh, cisbp-recstats.sh, arabidopsis.sh)
The following order is recommended/required:
- rncmpt_data.sh
- fig1.sh
- performance_calc.sh
- interface_importance.sh
- fig2.sh
- jple_reconstruction.sh
- fig3.sh
- cisbp-recstats.sh
- arabidopsis.sh
- fig4.sh
- fig5.sh
cisbp_reconstruction.sh executes scripts to locally generate data available on http://cisbp-rna.ccbr.utoronto.ca/, e.g. PWMs jpgs for confidentily reconstructed specificities.