Skip to content

Latest commit

 

History

History
123 lines (106 loc) · 7.47 KB

README.md

File metadata and controls

123 lines (106 loc) · 7.47 KB

DeepUTR

Motivation

messenger RNA (mRNA) degradation plays critical roles in post-transcriptional gene regulation. A major component of mRNA degradation is determined by 3′-UTR elements. Hence, researchers are interested in studying mRNA dynamics as a function of 3′-UTR elements. A recent study measured the mRNA degradation dynamics of tens of thousands of 3′-UTR sequences using a massively parallel reporter assay. However, the computational approach used to model mRNA degradation was based on a simplifying assumption of a linear degradation rate. Consequently, the underlying mechanism of 3′-UTR elements is still not fully understood.

Results

Here, we developed deep neural networks to predict mRNA degradation dynamics and interpreted the networks to identify regulatory elements in the 3′-UTR and their positional effect. Given an input of a 110 nt-long 3′-UTR sequence and an initial mRNA level, the model predicts mRNA levels of eight consecutive time points. Our deep neural networks significantly improved prediction performance of mRNA degradation dynamics compared with extant methods for the task. Moreover, we demonstrated that models predicting the dynamics of two identical 3′-UTR sequences, differing by their poly(A) tail, performed better than single-task models. On the interpretability front, by using Integrated Gradients, our convolutional neural networks (CNNs) models identified known and novel cis-regulatory sequence elements of mRNA degradation. By applying a novel systematic evaluation of model interpretability, we demonstrated that the recurrent neural network models are inferior to the CNN models in terms of interpretability and that random initialization ensemble improves both prediction and interoperability performance. Moreover, using a mutagenesis analysis, we newly discovered the positional effect of various 3′-UTR elements.

Usage

usage: DeepUTR.py [-h] [--train TRAIN] [--predict PREDICT]
                  [--evaluate EVALUATE] [--model_type MODEL_TYPE]
                  [--NN_type NN_TYPE] [--data_type DATA_TYPE]
                  [--conventional_model CONVENTIONAL_MODEL]
                  [--input_model_path_1 INPUT_MODEL_PATH_1]
                  [--input_model_path_2 INPUT_MODEL_PATH_2]
                  [--input_sequences INPUT_SEQUENCES]
                  [--input_A_minus_initial INPUT_A_MINUS_INITIAL]
                  [--input_A_plus_initial INPUT_A_PLUS_INITIAL]
                  [--input_A_minus_labels INPUT_A_MINUS_LABELS]
                  [--input_A_plus_labels INPUT_A_PLUS_LABELS]
                  [--input_split_indices INPUT_SPLIT_INDICES]
                  [--output_path OUTPUT_PATH]

DeepUTR - train, predict, or evaluate. Note: folder path must end with '/'

optional arguments:
  -h, --help            show this help message and exit
  --train TRAIN         Perform model type training. Options: '0' (default) -
                        do not perform training. '1' - perform training.
  --predict PREDICT     Perform model type prediction. Avilable only for NN
                        models. Options: '0' (default) - do not perform
                        prediction. '1' - perform prediction.
  --evaluate EVALUATE   Perform model type evaluation. Options: '0' (default)
                        - do not perform evaluation. '1' - perform evaluation.
  --model_type MODEL_TYPE
                        Model_type. Options: 'dynamics' (default) - mRNA
                        degradation dynamics model. 'rate' - mRNA degradation
                        rate model.
  --NN_type NN_TYPE     Neural network architecture. Options: 'CNN' (default)
                        - CNN architecture. 'RNN' - RNN architecture.
  --data_type DATA_TYPE
                        Input 3'UTR sequences data type. Options: 'minus'
                        (default) - non-tailed with poly(A). 'plus' - tailed
                        with poly(A). 'minus_plus' - both non-tailed and
                        tailed with poly(A); for multi-task models.
  --conventional_model CONVENTIONAL_MODEL
                        Conventional model type. Conventional models must be
                        created and trained before used. If a conventional
                        model is used, then 'NN_type' is ignored and
                        'data_type' can not support '-+' input. options:
                        'false' (default) - do not use conventional model.
                        'lasso' - Lasso model. 'RF' - Random Forest model.
  --input_model_path_1 INPUT_MODEL_PATH_1
                        Path of the model. For ensemble model path provide the
                        dirctory path containing the ensemble components (only
                        NN models support ensemble). Default path for NN
                        models are the DeepUTR trained models dirctory
                        corresponding to model_type, NN_type, data_type. For
                        conventional model, you must provide model path, and
                        this is the path of the A- model
  --input_model_path_2 INPUT_MODEL_PATH_2
                        Used only for conventional models. Path of the model
                        A+ model. If needed you must provide this model path
  --input_sequences INPUT_SEQUENCES
                        Path of the sequences input. Default path is the UTR-
                        seq dataset sequences.
  --input_A_minus_initial INPUT_A_MINUS_INITIAL
                        Path of the A- initial mRNA level input for mRNA
                        degradation dynamics models. Used only for predicting.
                        None default path is provided.
  --input_A_plus_initial INPUT_A_PLUS_INITIAL
                        Path of the A+ initial mRNA level input for mRNA
                        degradation dynamics models. Used only for predicting.
                        None default path is provided.
  --input_A_minus_labels INPUT_A_MINUS_LABELS
                        Path of the A- labels input. Used for training and
                        evaluation. Default path is the UTR-seq A- dataset
                        labels.
  --input_A_plus_labels INPUT_A_PLUS_LABELS
                        Path of the A+ labels input. Used for training and
                        evaluation. Default path is the UTR-seq A+ dataset
                        labels.
  --input_split_indices INPUT_SPLIT_INDICES
                        Path of the file containing indices of splitting the
                        dataset for train, validation, and test. Used for
                        training and evaluation. Options: 'false' - do not
                        split the dataset (in this case, in training, the
                        dataset will be splitted randomly). 'random' - random
                        split (see code for detailes). Default option is the
                        path of the file containing indices of splitting the
                        UTR-seq dataset.
  --output_path OUTPUT_PATH
                        Path for the outputs. Default path is the files
                        dirctory of DeepUTR.

Requirements:

The code was tested with:
Python interpreter == 3.6.6
Python packages required for using DeepUTR:
numpy == 1.18.5
pandas == 1.1.2
scikit-learn == 0.23.0
scipy == 1.4.1
tensorflow == 2.3.0

Python packages required for running integrated gradient and TF-MoDISco analysis:
modisco==0.5.9.0
logomaker==0.8
tqdm==4.46.1
zipp==3.1.0

Note: There might be other packages needed. Please contact us in case of any problem.