Family 7 cellulases (Cel7s), or glycoside hydrolases (GH7s), are principal enzymes for cellulose degradation, both in nature and in industry. In this work, machine learning (ML)is applied to relate the amino acid sequence of GH7s to function by identifying key sequence features utilized by the ML algorithms that correlate with functional subtypes.
The strategies utilized in this work may be adapted to uncover sequence-function relationships in other protein families.
- Python (>=3)
- pandas (0.24.2)
- numpy (1.16.2)
- scipy (1.1.0)
- biopython (1.73)
- scikit-learn (0.20.3)
- imbalanced-learn (0.4.3)
- matplotlib (3.0.2)
- seaborn (0.9.0)
- pydot_ng (2.0.0)
subtype_hmm.py
: Use hidden Markov models (HMM) to discriminate GH7 functional subtypes (CBH vs EG)subtype_ml.py
: Use supervised machine learning to discriminate GH7 functional subtypessubtype_rules.py
: Derive position-specific classification rules for discriminating GH7 functional subtypescbm_ml.py
: Supervised ML to predict the presence of carbohydrate binding modules (CBM) in GH7s.
bioinformatics.py
: contains adhoc functions for bioinformatic analysisplots_and_analysis.py
: for analyzing results and plotting the figures in the manuscript
- Sequence datasets are in
fasta/
- Sequences split into five folds used for validation and design of the HMM, as well as the final trained HMMs, are in
hmm_train_test/
- Datasets containing results presented in the paper (Gado et al, 2019) are in
results_final/
- Figures and tables in the manuscript are in
plots/
If you find this work useful, please cite this paper:
Gado JE, Harrison BE, Sandgren M, Ståhlberg J, Beckham GT, and Payne CM. Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases. J. Biol. Chem. (2021).