-
Notifications
You must be signed in to change notification settings - Fork 5
/
primer_on_logistic_regression_in_r.qmd
122 lines (92 loc) · 4.13 KB
/
primer_on_logistic_regression_in_r.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# Primer on Logistic Regression in `R` {.unnumbered}
```{r}
#| echo: false
#| eval: true
#| message: false
library("tidyverse")
library("broom")
load(file = "data/_raw/gravier.RData")
```
## Logistic Regression
```{r}
gravier_wide <- gravier |>
bind_cols() |>
as_tibble()
gravier_wide
```
What if we wanted to understand how a continous variable like e.g. `gene_expression_level` of a given gene, determined whether the plant would survive or sucomb to heat shock at a given temperature? In that case, we would be interested in understanding `survived` yes/no as a function of `gene_expression_level`:
$$survived_{yes/no} \sim gene\_expresssion\_level$$
Moreover, we are interested in estimating the *probability* of survival given a certain `gene_expression_level`.
### Data
As before, let' us simply simulate some data and let's say that we have 100 plants at 50 degrees celsius and the mean `gene_expression_level` for plants, which did not survive (denoted `0`) was `90` and for those who did survive (denoted `1`), it was `100`:
```{r}
set.seed(718802)
mean_survived_no <- 90
mean_survived_yes <- 100
survival_data <- tibble(
gene_expression_level = c(rnorm(n = 50,
mean = mean_survived_no,
sd = 3),
rnorm(n = 50,
mean = mean_survived_yes,
sd = 3)),
survived = c(rep(x = 0, times = 50), rep(x = 1, times = 50))
)
survival_data |>
sample_n(10)
```
### Visualising
As mentioned, we are interested in `survived` as a function of `gene_expression_level`, visualising this looks like so:
```{r}
#| echo: true
#| eval: true
#| message: false
#| fig-align: center
survival_data |>
ggplot(aes(x = gene_expression_level,
y = survived)) +
geom_point(alpha = 0.5,
size = 3)
```
Doing what we did before with a straight line evidently isn't super meaningful. What we're interested in understanding is at what `gene_expression_level` does the plant not survive/survive? Clearly, at e.g. `80`, the plant does not survive and at `105` it clear does! But what about `95`? Well, that isn't really clear. Here, plants are both observed to survive and not-survive. What about `93`? Here, most do survive, but not all, although it seems that there is more chance of not-surviving and vice versa for `97`, than surviving. It's this *chance* we're interested in. So, at different values of `gene_expression_levels`, what is the probability of surviving-/not-surviving respectively? This is the quesion a logistic regression answers, so let's get to it!
### Modelling
To do a logistic regression, we use the `glm()`-function:
```{r}
my_glm_mdl <- glm(formula = survived ~ gene_expression_level,
family = binomial(link = "logit"),
data = survival_data)
my_glm_mdl
```
Now, with the model in place, we can visualise. Below here, the points are the observed data and the line is the model of how the probability of survival changes with `gene_expression_level`:
```{r}
survival_data |>
mutate(my_glm_model = pluck(my_glm_mdl, "fitted.values")) |>
ggplot(aes(x = gene_expression_level, y = survived)) +
geom_point(alpha = 0.5,
size = 3) +
geom_line(aes(y = my_glm_model))
```
This allows us to answer the question from before:
- At `80`, the plant does not survive?
- At `105` it clearly does!
- What about `95`?
- What about `93`?
- Is it vice versa for `97`?
```{r}
predict.glm(object = my_glm_mdl,
newdata = tibble(gene_expression_level = c(80, 105, 95, 93, 97)),
type = "response")
```
So:
- At `80`, the plant does not survive? **TRUE, the probability of survival is close to zero**
- At `105` it clearly does! **TRUE, the probability of survival is close to one**
- What about `95`? **Around 60% survival probability**
- What about `93`? **Around 20% survival probability**
- Is it vice versa for `97`? **Around 90% survival probability**
Again, this model object is a bit quirke, so `broom` to the rescue:
```{r}
my_glm_mdl |>
tidy(conf.int = TRUE,
conf.level = 0.95) |>
mutate(estimate = exp(estimate))
```