This repo contains code for promoter discovery and biodiversity mining using natural language processing. This effort is being pursued through three specific aims:
- Aim 1: Develop natural processing-based model for promoter identification
- Aim 2: Extend model to identify inducible promoter sequences
- Aim 3: Experimentally validate promoter predictions
Promoter sequences were collected from three main databases: EPDnew, RegulonDB, DBTBS.
All data files are found in and/or will be written to data/
data/DBTBS/
- Contains raw data from DBTBS: Bacillus subtilis promoter database
data/EPDnew/
- Contains raw data from EPDnew: Eukaryote promoter database (promoter data for 15 different organisms
data/RegulonDB/
- Contains raw data from RegulonDB: Escherichia coli promoter database
data/parsed_promoter_data/
- Promoter data parsed from each database
data/20191114promoter_identification_ML_curation
- Manually curated information on other state-of-the-art ML models for promoter prediction
All code are found in and/or will be written to src/
in either notebook or script form
Notebooks:
src/notebooks/20191125_promoter_database_parsing.ipynb
- Notebook containing code for parsing promoter data
All raw and edits figures will be writted to figs/