-
Notifications
You must be signed in to change notification settings - Fork 0
/
09_logistic_analysis.qmd
189 lines (150 loc) · 6.34 KB
/
09_logistic_analysis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "09_logistic_regression"
format: html
editor: visual
---
```{r}
library("tidyverse")
library("ggpubr")
library("forcats")
library("purrr")
library("broom")
library("ggplot2")
library("dplyr")
```
```{r}
dat_aug_03 <- read_tsv("~/projects/Group_19_project/data/03_dat_aug.tsv")
```
## Child Pugh Score as a predictor of mortality
In the following analysis we would like to check the CPS as a predictor to mortality. We will try to figure out whether a higher CPS is associated with higher odds of mortality. To do that, we will use logistic regression.
```{r}
df_logistic <- dat_aug_03 |>
mutate(`Mortality (1 = y/0 = n)` = ifelse(Mortality == "No death", 0, 1)) |>
pivot_wider(
names_from = Liver_disease,
values_from = Liver_disease
) |>
mutate(across(10:19, ~ifelse(is.na(.), 0, 1)))
```
we started by adding the mortality variable to our dataframe. The new variable is binary and its outcome is 0 for no, and 1 for yes.
```{r}
df_logistic
```
Lets try to plot the variables that we're about to analyze:
```{r}
mortality_CPS <- df_logistic |>
ggplot(mapping = aes(x = CPS,
fill = Mortality,
na.rm = TRUE)) +
geom_bar(alpha = 0.6,
position = "fill") +
scale_fill_viridis_d() +
labs(title = "Mortality by CPS",
x = "CPS",
y = "Mortality") +
theme(plot.title = element_text(face = "bold", size = 13))
mortality_CPS
```
In the following bar chart, we can see the proportions for each CPS and mortality. We can see some trend as the yellow stacks of the bars, which represent the no death observations, are on a decrease as the CPS level increase. In other words, we can see a connection between death and high CPS level. In the following analysis we will try to find this connection.
```{r}
ggsave("mortality_CPS.png",
path = "../results/plots/",
plot = mortality_CPS,
width = 7,
height = 3)
```
### Creation of a first model using the 'lg()' function
We chose logistic regression since our outcome is binary (Mortality: yes=1, no=0). We start by creating a logistic regression model:
```{r}
logistic_model <- glm(`Mortality (1 = y/0 = n)` ~ CPS, data=df_logistic, family = "binomial")
logistic_intervals <- confint(logistic_model)
logistic_summary <- summary(logistic_model)
```
The logistic interval are the variables that we are going to based the odds ratio on.
```{r}
logistic_summary
```
Next, we would like to see our results tidy and readable.
```{r}
logistic_intervals<-as.tibble(logistic_intervals) |>
mutate(term = c("(intercept)", "CPS"))|>
relocate(term)|>
rename(conf_high = `97.5 %`,
conf_low = `2.5 %`)
logistic_intervals
```
```{r}
logistic_results <- tidy(logistic_model)
logistic_results
```
We would like to have all of the results in one tibble, therefore, we can join the tibbles of the results and the intervals by the column term:
```{r}
logistic_result_intervals<- inner_join(logistic_results, logistic_intervals,
by = join_by(term == term))
logistic_result_intervals
```
The results that we have are the log-odds ratio for mortality. We will now exponentiate those results to make it easier to interpret.
```{r}
logistic_result_intervals<- logistic_result_intervals |>
mutate(exp_estimate = exp(estimate),
exp_conf_low = exp(conf_low),
exp_conf_high = exp(conf_high))
logistic_result_intervals
```
```{r}
boxplot_logistic_confidence_interval <- ggplot(
data = logistic_result_intervals,
mapping = aes(x = exp_estimate,
y = term)) +
geom_errorbarh(aes(xmin = exp_conf_low,
xmax = exp_conf_high)) +
geom_point()+
geom_hline(yintercept = 0, linetype = "dashed", color = "black", size = 0.5) +
geom_vline(xintercept = 1, linetype = "dashed", color = "red", size = 0.5) +
labs(
title = "CPS associated with Mortality",
x = "Estimates(95% CIs)",
y ="") +
theme(plot.title = element_text(face = "bold", size = 12.5))+
scale_x_continuous(limits = c(0, 2))
boxplot_logistic_confidence_interval
ggsave("boxplot_logistic_confidence_interval.png",
path = "../results/plots/",
plot = boxplot_logistic_confidence_interval,
width = 7,
height = 2)
```
Now that we have an odds-ratio results, we can understand from the logistic regression that for every one CPS increase, the likelihood for mortality increase by 50%. This is pretty significant information and we can use it to keep high CPS patients under a tighter hospitalization. Furthermore, the results for this analysis are statistically significant since the p-value much lower than 0.05. Also, accoarding to the confidence interval, we can say that 95% confidence that the odds ration will be between 1.39 and 1.62. These results are well fit with the intuition that we had in the 05_analysis_mortality, where we suspected that CPS is connected with mortality. Let us recall:
```{r}
bar_plot_CPS_Mortality<- dat_aug_03 |>
ggplot(aes(x = Mortality,
y = CPS,
fill = Mortality)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Boxplot chart of CPS values",
subtitle = "stratified by mortality",
y = "CPS") +
scale_fill_viridis_d() +
theme_minimal() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank()) +
scale_x_discrete(guide = guide_axis(angle = 25))
bar_plot_CPS_Mortality
```
```{r}
logistic_sigmoid_func <- ggplot(df_logistic, aes(x=CPS, y=`Mortality (1 = y/0 = n)`)) +
geom_point(alpha=.5) +
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial)) +
theme_minimal()
logistic_sigmoid_func
```
The logistic curve is not balanced. One possible suggestion can be due to the fact that we have 190 deaths vs. 875 survivors in our dataset.
```{r}
ggsave("logistic_sigmoid_func.png",
path = "../results/plots/",
plot = logistic_sigmoid_func,
width = 7,
height = 5)
```
## Conclusion
This analysis assesses that high CPS values have a crutial impact on patients mortality. Of course that we could have some intuition for that before, as we could see in the box plot and the bar chart for this analysis, but according to our data, the odds ratio for it is 1.5. In other words, for every unit increase in CPS, the patients have an increased likelihood for mortality by 50%. This demonstrates how dangerous can cirrhosis be if not treating it as it should.