-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathwinequality-red.Rmd
174 lines (135 loc) · 6.25 KB
/
winequality-red.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Bayesian variable selection for red wine quality ranking data"
author: "[Aki Vehtari](https://users.aalto.fi/~ave/)"
date: "First version 2018-02-27. Last modified `r format(Sys.Date())`."
output:
html_document:
fig_caption: yes
toc: TRUE
toc_depth: 2
number_sections: TRUE
toc_float:
smooth_scroll: FALSE
bibliography: modelsel.bib
csl: harvard-cite-them-right.csl
link-citations: yes
---
# Setup {.unnumbered}
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=FALSE, message=FALSE, error=FALSE, warning=FALSE, comment=NA, out.width='95%')
```
**Load packages**
```{r}
library(rstan)
library(rstanarm)
library(loo)
library(bayesplot)
theme_set(bayesplot::theme_default())
library(projpred)
SEED=170701694
```
# Introduction
This notebook was inspired by Eric Novik's slides "Deconstructing Stan Manual Part 1: Linear". The idea is to demonstrate how easy it is to do good variable selection with [`rstanarm`](https://cran.r-project.org/package=rstanarm), [`loo`](https://cran.r-project.org/package=loo), and [`projpred`](https://cran.r-project.org/package=projpred).
In this notebook we illustrate Bayesian inference for model selection, including PSIS-LOO [@Vehtari+etal:PSIS-LOO:2017] and projection predictive approach [@Piironen+etal:projpred:2020; @Piironen+Vehtari:2017a] which makes decision theoretically justified inference after model selection..
# Wine quality data
We use [Wine quality data set from UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/wine+qualitycandy)
```{r}
d <- read.delim("winequality-red.csv", sep = ";")
dim(d)
```
Remove duplicated
```{r}
d <- d[!duplicated(d), ] # remove the duplicates
(p <- ncol(d))
(n <- nrow(d))
names(d)
prednames <- names(d)[1:(p-1)]
```
We scale the covariates so that when looking at the marginal posteriors for the effects they are on the same scale.
```{r}
ds <- scale(d)
df <- as.data.frame(ds)
```
# Fit regression model
The `rstanarm` package provides `stan_glm` which accepts same arguments as `glm`, but makes full Bayesian inference using Stan ([mc-stan.org](https://mc-stan.org)). By default a weakly informative Gaussian prior is used for weights.
```{r, results='hide'}
formula <- formula(paste("quality ~", paste(prednames, collapse = " + ")))
fitg <- stan_glm(formula, data = df, QR=TRUE, seed=SEED, refresh=0)
```
Let's look at the summary:
```{r}
summary(fitg)
```
We didn't get divergences, Rhat's are less than 1.1 and n_eff's are useful (see, e.g., [RStan workflow](http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html)).
```{r}
mcmc_areas(as.matrix(fitg), pars=prednames, prob_outer = .95)
```
Several 95% posterior intervals are not overlapping 0, so maybe there is something useful here.
# Cross-validation checking
In case of collinear variables it is possible that marginal posteriors overlap 0, but the covariates can still useful for prediction. With many variables it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,
```{r}
fitg0 <- stan_glm(quality ~ 1, data = df, seed=SEED, refresh=0)
```
We use fast Pareto smoothed importance sampling leave-one-out cross-validation [@Vehtari+etal:PSIS-LOO:2017]
```{r}
(loog <- loo(fitg))
(loog0 <- loo(fitg0))
loo_compare(loog0, loog)
```
Based on cross-validation covariates together have a high predictive power. If we need just the predictions we can stop here, but if we want to learn more about the relevance of the covariates we can continue with variable selection.
# Projection predictive variable selection
We make the projective predictive variable selection [@Piironen+etal:projpred:2020; @Piironen+Vehtari:2017a] using `projpred` package. A fast PSIS-LOO [@Vehtari+etal:PSIS-LOO:2017] is used to choose the model size.
```{r, results='hide'}
fitg_cv <- cv_varsel(fitg, method='forward', cv_method='loo', n_loo=nrow(df))
```
We can now look at the estimated predictive performance of smaller models compared to the full model.
```{r}
plot(fitg_cv, stats = c('elpd', 'rmse'))
```
Three or four variables seems to be needed to get the same performance as the full model.
We can get a loo-cv based recommendation for the model size to choose.
```{r}
(nsel <- suggest_size(fitg_cv, alpha=0.1))
(vsel <- solution_terms(fitg_cv)[1:nsel])
```
projpred recommends to use four variables: alcohol, volatile.acidity, sulphates, and chlorides.
Next we form the projected posterior for the chosen model. This projected model can be used in the future to make predictions by using only the selected variables.
```{r}
projg <- project(fitg_cv, nv = nsel, ns = 4000)
round(colMeans(as.matrix(projg)), 1)
round(posterior_interval(as.matrix(projg)), 1)
```
The marginals of projected posteriors look like this.
```{r}
mcmc_areas(as.matrix(projg), pars = vsel)
```
# Alternative regularized horseshoe prior
We also test regularized horseshoe prior [@Piironen+Vehtari:RHS:2017] which has more prior mass near 0.
```{r, results='hide'}
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(df=1, global_df=1, global_scale=tau0)
fitrhs <- stan_glm(formula, data = df, prior=hs_prior,
seed=SEED, refresh=0)
```
```{r}
mcmc_areas(as.matrix(fitrhs), pars=prednames, prob_outer = .95)
```
Many of the variables are shrunk more towards 0, but still based on these marginals it is not as easy to select the most useful variables as it is with projpred.
The posteriors with normal and regularized horseshoe priors are clearly different, but does this have an effect to the predictions? In case of collinearity prior may have a strong effect on posterior, but a weak effect on posterior predictions. We can use loo to compare
```{r}
(loorhs <- loo(fitrhs))
loo_compare(loog, loorhs)
```
There is no difference in predictive performance and thus we don't need to repeat the projpred variable selection for the model with regularized horseshoe prior.
<br />
# References {.unnumbered}
<div id="refs"></div>
# Licenses {.unnumbered}
* Code © 2017-2018, Aki Vehtari, licensed under BSD-3.
* Text © 2017-2018, Aki Vehtari, licensed under CC-BY-NC 4.0.
# Original Computing Environment {.unnumbered}
```{r}
sessionInfo()
```
<br />