-
Notifications
You must be signed in to change notification settings - Fork 3
/
expression-prediction.Rmd
253 lines (177 loc) · 9.65 KB
/
expression-prediction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "Expression Prediction"
author: "Anshul Kundaje, Oana Ursu, and Sean Davis"
date: "7/8/2018"
output: html_document
---
# Background
In this little set of exercises, you will be using histone marks near a gene to predict its expression.
$$y = h1 + h2 + h3 + ...$$
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache=TRUE, message=FALSE)
```
We will try a couple of different approaches:
1. Penalized regression
2. RandomForest
## Penalized regression
Adapted from http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/.
### Ridge regression
Ridge regression shrinks the regression coefficients, so that variables, with minor contribution to the outcome, have their coefficients close to zero. The shrinkage of the coefficients is achieved by penalizing the regression model with a penalty term called L2-norm, which is the sum of the squared coefficients. The amount of the penalty can be fine-tuned using a constant called lambda (λ). Selecting a good value for λ is critical. When λ=0, the penalty term has no effect, and ridge regression will produce the classical least square coefficients. However, as λ increases to infinite, the impact of the shrinkage penalty grows, and the ridge regression coefficients will get close zero.
Note that, in contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the ridge regression (James et al. 2014), so that all the predictors are on the same scale. The standardization of a predictor x, can be achieved using the formula x' = x / sd(x), where sd(x) is the standard deviation of x. The consequence of this is that, all standardized predictors will have a standard deviation of one allowing the final fit to not depend on the scale on which the predictors are measured.
One important advantage of the ridge regression, is that it still performs well, compared to the ordinary least square method (Chapter @ref(linear-regression)), in a situation where you have a large multivariate data with the number of predictors (p) larger than the number of observations (n). One disadvantage of the ridge regression is that, it will include all the predictors in the final model, unlike the stepwise regression methods, which will generally select models that involve a reduced set of variables. Ridge regression shrinks the coefficients towards zero, but it will not set any of them exactly to zero. The lasso regression is an alternative that overcomes this drawback.
### Lasso regression
Lasso stands for _Least Absolute Shrinkage and Selection Operator_. It shrinks the regression coefficients toward zero by penalizing the regression model with a penalty term called L1-norm, which is the sum of the absolute coefficients. In the case of lasso regression, the penalty has the effect of forcing some of the coefficient estimates, with a minor contribution to the model, to be exactly equal to zero. This means that, lasso can be also seen as an alternative to the subset selection methods for performing variable selection in order to reduce the complexity of the model. As in ridge regression, selecting a good value of λ for the lasso is critical.
One obvious advantage of lasso regression over ridge regression, is that it produces simpler and more interpretable models that incorporate only a reduced set of the predictors. However, neither ridge regression nor the lasso will universally dominate the other. Generally, lasso might perform better in a situation where some of the predictors have large coefficients, and the remaining predictors have very small coefficients. Ridge regression will perform better when the outcome is a function of many predictors, all with coefficients of roughly equal size (James et al. 2014).
Cross-validation methods can be used for identifying which of these two techniques is better on a particular data set.
### Elastic Net
Elastic Net produces a regression model that is penalized with both the L1-norm and L2-norm. The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in LASSO).
# Install packages
Do this at the beginning of the lab
```{r eval=FALSE}
install.packages("caret", dependencies=T)
install.packages(c("randomForest", "glmnet"), dependencies=T)
```
```{r}
require(caret)
require(randomForest)
require(glmnet)
```
# Load the preprocessed data
```{r}
fullFeatureSet <- read.table("http://seandavi.github.io/ITR/expression-prediction/features.txt");
target <- scan(url("http://seandavi.github.io/ITR/expression-prediction/target.txt"), skip=1)
```
You can see the full list of features in fullFeatureSet using:
```{r}
colnames(fullFeatureSet)
```
Take a few minutes to try to understand the relationships between the predictor variables, their scale, etc.
- https://web.stanford.edu/~hastie/TALKS/enet_talk.pdf
# Machine Learning
## Lasso
In this section, we will run the lasso procedure.
```{r}
features <- fullFeatureSet
```
```{r}
library(caret)
#how to split into train/validation using cross-validation
fitControl <- trainControl(
method="repeatedcv",
number=10,
## repeated once
repeats=1,
verboseIter=T)
```
We are going to try a number of different models, so
here, we create a range of parameters to investigate.
- alpha - is the elasticnet mixing parameter: $$\frac{(1-α)}{2}||β||_2^2+α||β||_1$$
- lambda - is the regularization parameter governing the relative importance of minimizing error vs keeping betas small
```{r lassogrid}
# setting alpha=1 specifies LASSO regression penalty
lassoGrid <- expand.grid(alpha=1, lambda=10^seq(-6, 0, 1))
```
Now, we train the model using cross-validation to find the "best"
parameters.
```{r}
lassoFit <- train(features, target, method="glmnet",
trControl=fitControl, tuneGrid=lassoGrid)
lassoModel <- lassoFit$finalModel
```
What metric is being used? Hint: print `names(lassoFit)` and get the metric used.
Printing the `lassoFit` variable gives the overall performance.
```{r}
names(lassoFit)
print(lassoFit)
```
- Accuracy per CV-fold (for best model)
```{r}
print(lassoFit$resample)
```
We can inspect the coefficients for different values of the
L1 penalty lambda - play around and see what happens.
```{r}
print(coef(lassoModel, s=1e-4))
print(coef(lassoModel, s=1))
```
We can also plot the entire regularization path
The numbers shown are the feature (column) ids - to get the name of the
feature, you can do colnames(features)[10], for instance.
the numbers at the top are the numbers of nonzero coefficients.
```{r}
plot(lassoModel, "lambda", label=T)
```
The final, trained model is in `lassoFit`.
We can plot the predictions vs original targets.
```{r}
lassoPreds <- predict(lassoFit, newdata=features)
plot(target, lassoPreds)
```
## Random Forests
```{r}
library(caret)
```
```{r}
rfGrid <- expand.grid(mtry=floor(ncol(features)/3))
```
```{r}
randomforestFit <- train(features, target, method="rf",
trControl=fitControl, tuneGrid=rfGrid,
ntree=100)
rfModel <- randomforestFit$finalModel
```
The overall accuracy is:
```{r}
print(randomforestFit)
```
We can also look at the accuracy per cross-validation fold.
```{r}
print(randomforestFit$resample)
```
The variable importance gives a measure of the relative contributions of each of the variables
(histone marks) to the expression prediction. Larger values reflect greater feature importance.
```{r}
print(rfModel$importance[order(rfModel$importance, decreasing=T),])
```
The final trained model is in randomforestFit. Again, we can plot the predictions vs original targets.
```{r}
randomforestPreds <- predict(randomforestFit, newdata=features)
plot(target, randomforestPreds)
```
# Further exploration
- How does the performance of random forests compare to the lasso?
(you can look at the output of print(lassoFit) and print(randomforestFit))
- How do other models perform? You can try other models by changing the
"method" parameter in the "train" call. Some suggestions for models:
linear regression, "lm" and regression trees, "rpart2".
- Construct a proper test set and re-run the analyses
- How do the individual histone marks contribute to the accuracy of the
predictions? You can formulate hypotheses about which marks are important
and only include those in the feature matrix when learning your model to
see how they do. We provide some code below to help you with this.
We can experiment with the weights that lasso regression produces when given
a subset of the features. First, create a column vector specifying the names
of a subset of the features with:
```{r}
featureSubset <- c("Control", "H3k4me1", "H3k4me2", "H2az", "H3k27me3",
"H3k36me3", "H3k9me1", "H3k9me3", "H4k20me1")
```
Now create the variable “features” which contains this subset of features:
```{r}
features <- fullFeatureSet[featureSubset]
```
Now, rerun the lasso regression with the subset.
```{r}
lassoFit <- train(features, target, method="glmnet",
trControl=fitControl, tuneGrid=lassoGrid)
lassoModel <- lassoFit$finalModel
```
We now generate a plot where the y axis is the coefficient of the
weights assigned to the various features by lasso, the bottom x-axis is the
log of the regularisation parameter lambda, and the top x-axis is the number
of non-zero weights for that particular value of the regularisation parameter.
The numbers on the lines correspond to the indices of the features in
“featureSubset”. The numbers at the top are the numbers of nonzero betas.
```{r}
plot(lassoModel, "lambda", label=T)
```