-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
1307 lines (861 loc) · 43.3 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "CTS Mortality Risk Prediction Project"
author: "Stephanie Tring"
output:
html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo=TRUE,
warning=FALSE,
eval=TRUE,
cache=FALSE,
error=FALSE,
message=FALSE)
```
# Introduction
In this study a research study conducted by the USC Public Health Data Science Department, OSHPD hospitalization records from the California Teachers Study (CTS) were analyzed. The data from the CTS study contains extensive hospital recording and survey data on 48324 participants spanning from 2011-2018.
The primary objective of this project are:
1. Develop a prediction model to predict short-term risk of death based on prior in-patient hospitalization records from this CTS study.
2. Assess whether time window after hospitalliation plays a role in predicting risk of death.
3. Identify whether certain co-morbidities contributes to predicting the risk of death.
4. To determine whether there are any temporal(seasonal) or spatial (zip code) correlation with hospitalization-related deaths.
# Methods
1. To analyze the data, the initial dataset will be assess under exploratory data analysis methods (EDA). Under EDA, the data will be explored based on key factors such as dimension of the dataset to asses number of participants and number of variables as well as understanding the main aim the study, which is the variable deceased.
2. Preprocessing of the data. The dataset will be preprocessed to clean/manage to perform downstream analysis. This includes removing of duplicates and any other dataset that would skew or create inaccurate analysis.
3. Data Analysis of specific aims. The dataset will be categorized and parsed to study the main objectives such as timepoints of time to deceased of participants, co-morbidities in relations to deceased and time to deceased, length of stay in hospital in relation to deceased. Basic visualization and statistical pearson's correlation analysis will be performed on these topics.
4. Prediction model will be created to predict risk to death based on hospitalization records as well as specific variables as listed above. Random Forest model will be used to predict overall risk of death as well as predict risk of death based on time to deceased.
5. Spatial significance in relation to death rate will be analyzed. As economic status is highly correlated to zip code, this information will be assessed over temporal data to check whether there is a spatial relationship with rate of death as this may also infer socioeconimic status relationship with mortality rate.
# EDA
Loading data
```{r read in data, echo=FALSE, warning=FALSE, message= FALSE}
library(dplyr)
params <- c("S:/Researcher Projects/Self Service Analysis Projects/10017_USCDAT/2022/Stephanie Tring/",
"O:/Datasets/10017_USCDAT/v01/10017_USCDAT_v01_20210609_2220_formats.csv",
"O:/Datasets/10017_USCDAT/v01/10017_USCDAT_v01_20210609_2220_analytic_data.csv",
"O:/Datasets/10017_USCDAT/uscdat_oshpd_formats.csv",
"O:/Datasets/10017_USCDAT/uscdat_oshpd_2022.csv")
setwd(params[1])
# reading in data types for survey data
data_types_file <- read.csv(file = params[2],
na.strings = "",
colClasses = "character")
# reading in data types for hospitalization data
oshpd_data_types_file <- read.csv(file = params[4],
na.strings = "",
colClasses = "character")
# creating a named character vector to use in assigning character and Date types for survey data
data_types <- data_types_file[, 2]
names(data_types) <- data_types_file[, 1]
data_types <- data_types[data_types=="character" | data_types=="Date"]
# creating a named character vector to use in assigning Date types for hospitalization data
oshpd_data_types <- oshpd_data_types_file[, 2]
names(oshpd_data_types) <- oshpd_data_types_file[, 1]
oshpd_data_types <- oshpd_data_types[oshpd_data_types=="Date"]
# reading in survey data and assigning data types
analytic_data <- read.csv(file = params[3],
na.strings = "",
colClasses = data_types)
# reading in hospitalization data file
oshpd_data <- read.csv(file = params[5],
na.strings = "")
# converting date fields
oshpd_data[names(oshpd_data_types)] <- lapply(oshpd_data[names(oshpd_data_types)], as.Date, "%m/%d/%Y")
# dropping columns that already exist in analytic_data
oshpd_data[c("date_of_birth_dt", "date_of_death_dt", "cause_of_death_cde", "cause_of_death_dsc",
"qnr_1_fill_dt", "qnr_2_fill_dt", "ses_quartile_ind", "first_moveout_ca_dt")] <- list(NULL)
# joining survey data and hospitalization data on participant_key
combined_data <- inner_join(analytic_data, oshpd_data) # n = 154315
# counting unique participants in the data
length(unique(combined_data$participant_key)) # n = 48324
```
There are 48324 participants in this study.
```{r length frequency of data, echo=FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
library(kableExtra)
#checking how many participants and how many variables
#dim(combined_data)
#find the % of death among participants and convert to factor for analysis
combined_data%>%
count(deceased)
# remove duplicate data
workingdata <- combined_data %>%
dplyr::distinct(participant_key, .keep_all=TRUE)
#dim(workingdata)
workingdata%>%
count(deceased)%>%
kbl(col.names = c("Deceased", "Number of Participants")) %>%
kable_styling()
```
This data set has 154315 participants and 164 variables. Of the total participants, 83404 were noted as deceased, however after removing duplicates, the data has 48324 participants and 164 variables with 18474 recorded as deceased and 29850 alive.
## Main Variables Analyses
```{r EDA set timepoints for length of stay, echo=FALSE, warning=FALSE}
#look at the basic stats of the participants based on lentgh of stay and set timepoints
range(workingdata$length_of_stay_day_cnt)
#set timepoints for lenght of time after admission/discharge
workingdata$days_stayed = workingdata$discharge_dt - workingdata$admission_dt
workingdata$days_stayed <-as.numeric(workingdata$days_stayed)
#summary(daystodeceased)
#remove negative values
workingdata <- workingdata %>%
filter(days_stayed >=0 | days_stayed %in% NA)
#summary(workingdata$days_stayed)
```
```{r EDA set timepoints from discharge, echo=FALSE, warning=FALSE}
#set timepoints for lenght of time after discharge and deceased
workingdata$daystodeceased_dc = workingdata$date_of_death_dt - workingdata$discharge_dt
workingdata$daystodeceased_dc <-as.numeric(workingdata$daystodeceased)
#summary(daystodeceased_dc)
#remove negative values
workingdata <- workingdata %>%
filter(daystodeceased_dc >0 | daystodeceased_dc %in% NA)
#summary(workingdata$daystodeceased_dc)
```
```{r EDA set timepoints setting, echo=FALSE, warning=FALSE}
#set timepoints for lenght of time after admission/discharge and deceased
#30days
#180days
#365days or 1yr
# 365-1825 or 1-5yrs
# >1825 or >5yrs
workingdata<-
workingdata %>%
mutate(deceased30days = ifelse(daystodeceased_dc<=30, 1,0),
deceased180days = ifelse(daystodeceased_dc >30 & daystodeceased_dc <=180, 1,0),
deceased365days = ifelse(daystodeceased_dc >181 & daystodeceased_dc<=365, 1,0),
deceased1825days = ifelse(daystodeceased_dc >365 & daystodeceased_dc <=1825, 1,0),
deceasedafter1825days = ifelse(daystodeceased_dc>1825, 1,0))
#set timepoints as factors for plotting
workingdata$deceased30days[is.na(workingdata$deceased30days)]<- "2"
workingdata$deceased30days <- factor(workingdata$deceased30days,
levels= c(0,1,2),
labels= c("No","Deceased","Alive"))
workingdata$deceased180days[is.na(workingdata$deceased180days)]<- "2"
workingdata$deceased180days <- factor(workingdata$deceased180days,
levels= c(0,1,2),
labels= c("No","Deceased","Alive"))
workingdata$deceased365days[is.na(workingdata$deceased365days)]<- "2"
workingdata$deceased365days <- factor(workingdata$deceased365days,
levels= c(0,1,2),
labels= c("No","Deceased", "Alive"))
workingdata$deceased1825days[is.na(workingdata$deceased1825days)]<- "2"
workingdata$deceased1825days <- factor(workingdata$deceased1825days,
levels= c(0,1,2),
labels= c("No","Deceased","Alive"))
workingdata$deceasedafter1825days[is.na(workingdata$deceasedafter1825days)]<- "2"
workingdata$deceasedafter1825days <- factor(workingdata$deceasedafter1825days,
levels= c(0,1,2),
labels= c("No","Deceased","Alive"))
```
# Number of Days to Deceased after Discharged
```{r EDA set timepoints plot, echo=FALSE, warning=FALSE}
summary(workingdata$daystodeceased_dc)
library(ggplot2)
workingdata$deceasedtimepoints <-
ifelse(workingdata$daystodeceased_dc <=30, "1mo",
ifelse(workingdata$daystodeceased_dc >30 & workingdata$daystodeceased_dc <=180, "1mo to 6mo" ,
ifelse(workingdata$daystodeceased_dc >181 & workingdata$daystodeceased_dc<=365, "6mo to 1yr",
ifelse(workingdata$daystodeceased_dc >365 & workingdata$daystodeceased_dc <=1825,"1yr to 5yrs", "after 5yrs"))))
count_tp <- workingdata %>%
count(deceasedtimepoints)
count_tp <- na.omit(count_tp)
count_tp %>%
ggplot(aes(x=deceasedtimepoints, y=n, color=deceasedtimepoints)) +
geom_bar(stat="identity", fill="white") +
labs(title= "Number of Days to Deceased After Discharged") +
labs( x= "Number of Days", y= "Number of Participants")
```
Patients had the highest number of mortality after 5years after discharged. As 5 years is a significant amount of time, this may not be related to variables related to the hospitalization data and therefore may not be accurate for predicting risk of death. The second highest is 1 to 5 years.
```{r factor deceasedtimepoints, echo=FALSE, warning=FALSE}
#class(workingdata$deceasedtimepoints)
#class
workingdata$deceasedtimepoints <- as.factor(workingdata$deceasedtimepoints)
#class(workingdata$deceasedtimepoints)
#factor
```
# Length of Stay
```{r Variables Length of Hospital Stay, echo=FALSE, warning=FALSE}
summary(workingdata$days_stayed)
library(tidyr)
range(workingdata$length_of_stay_day_cnt)
workingdata$staytimepoints <-
ifelse(workingdata$length_of_stay_day_cnt <30, "1mo",
ifelse(workingdata$length_of_stay_day_cnt >=30 & workingdata$length_of_stay_day_cnt <60, "1-2mo",
ifelse(workingdata$length_of_stay_day_cnt >=60 & workingdata$length_of_stay_day_cnt <90, "2-3mo",">3mo")))
stay1mo = workingdata[workingdata$staytimepoints == "1mo",];
stay2mo = workingdata[workingdata$staytimepoints == "1-2mo",];
stay3mo = workingdata[workingdata$staytimepoints == "2-3mo",];
stayover3mo = workingdata[workingdata$staytimepoints == ">3mo",]
stay1modec = as.data.frame(table(stay1mo$deceased));stay1modec
stay2modec = as.data.frame(table(stay2mo$deceased));stay2modec
stay3modec = as.data.frame(table(stay3mo$deceased));stay3modec
stayover3modec = as.data.frame(table(stayover3mo$deceased));stayover3modec
staytable= t(cbind(stay1modec[,2], stay2modec[,2], stay3modec[,2], stayover3modec[,2]))
staytable= as.data.frame(staytable)
colnames(staytable)= c("Alive", "Deceased")
row.names(staytable)= c("1mo", "1-2mo", "2-3mo", ">3mo")
staydata=(staytable) %>%
mutate(ind=factor(row_number())) %>%
gather(deceased, proportion, -ind)
staydata$ind <- factor(staydata$ind,
levels= c(1,2,3,4),
labels= c("1mo", "1-2mo", "2-3mo", ">3mo"))
stayplot=ggplot(staydata, aes(x=ind, y= proportion, fill=deceased)) +
geom_col(position="fill", colour= "blue") +
scale_y_continuous(labels =scales :: percent) +
scale_fill_brewer(palette = "Set3") +
ggtitle( "Proportion of Deceased based on Hospital Lenght of Stay") +
xlab ("Number of Days in Hospital") + ylab ("Proportion of Participants")
stayplot
```
The time to deceased ranges from 1 to 6739 days after discharged
```{r Statistics for length of stay, echo=FALSE, warning=FALSE}
library(tidyverse)
#library(ggstatsplot)
workingdata$deceased= as.numeric(workingdata$deceased)
cor.test(workingdata$deceased, workingdata$length_of_stay_day_cnt, method ="pearson")
```
Based on the visual analysis of lenth of stay in relation to number of deceased, patients staying for 1 to 2 months were among the highest group that are deceased. Based on the Pearson correlation analysis, Length of stay is not signigicantly correlated to number deceased.
```{r staytimepoints factor, echo=FALSE, warning=FALSE}
workingdata$staytimepoints <- as.factor(workingdata$staytimepoints)
#class(workingdata$staytimepoints)
#factor
```
# Age
```{r Variables Age, echo=FALSE, warning=FALSE}
library(lubridate)
workingdata <-
workingdata %>%
mutate(agedata=as.numeric(year(date_of_death_dt)-year(date_of_birth_dt)))
summary(workingdata$agedata)
#workingdata <- workingdata %>%
# mutate(agedata = as.numeric(round(workingdata$date_of_death_dt-workingdata$date_of_birth_dt)/365))
workingdata<- workingdata %>%
mutate(agedata = ifelse(is.na(workingdata$agedata),
as.numeric(round((as.Date("2018-12-31")-workingdata$date_of_birth_dt)/365)), workingdata$agedata))
workingdata$agedata <-
ifelse(workingdata$agedata <50, "<50",
ifelse(workingdata$agedata >=50 & workingdata$agedata <76, "50-75",
ifelse(workingdata$agedata >=76 & workingdata$agedata <90, "76-90",">90")))
workingdata$agedata <- factor(workingdata$agedata, levels = c("<50", "50-75","76-90", ">90"))
#workingdata$deceased<- factor(workingdata$deceased, levels = c(0, 1))
agetable<- table(workingdata$deceased, workingdata$agedata)
kable(head(agetable), format="markdown", digits= 2, row.names = TRUE)
barplot(prop.table(agetable, 2), xlab= "Age Group", ylab="Deceased Proportion", main= "Deceased by Age",
col=c("purple","green"), legen=c("Alive", "Deceased"), args.legend=list(x="topright"))
```
```{r Statistics for age, echo=FALSE, warning=FALSE}
library(tidyverse)
#library(ggstatsplot)
workingdata$deceased= as.numeric(workingdata$deceased)
workingdata$agedata= as.numeric(workingdata$agedata)
cor.test(workingdata$deceased, workingdata$agedata, method ="pearson")
```
The age group is 32 to 92 years old. Based on the visualization plot, participants in groups 90 and older are among the highest that have died. Based on the Pearson correlation analysis, there is moderate association between age and death.
# Co-morbidities
```{r view for CCS, echo=FALSE, warning=FALSE}
#view the ccs code diagnosis codes
#head(workingdata %>% count(diag_ccs_code1, diag_ccs_code2,diag_ccs_code3, diag_ccs_code4, diag_ccs_code5, sort=TRUE, 10))
#view the number of patient in each major diagnosis category
workingdata %>% count(major_diag_cat_cde, sort=TRUE)
```
Top 10 most frequent diagnoses are:
8 Other infections; including parasitic
5 HIV infection
6 Hepatitis
13 Cancer of stomach
0 Undefined
1 Tuberculosis
4 Mycoses
14 Cancer of colon
9 Sexually transmitted infections (not HIV or hepatitis)
10 Immunizations and screening for infectious disease
### CCS Among Deceased {.tabset}
```{r view for CCS among deceased, echo=FALSE, warning=FALSE}
#view diagnosis among deceased
#ccscode1
ccs1= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, diag_ccs_code1, diag_ccs1) %>%
group_by(diag_ccs1) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(ccs1,5)
#ccscode2
ccs2= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, diag_ccs_code2, diag_ccs2) %>%
group_by(diag_ccs2) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(ccs2,5)
#ccscode3
ccs3= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, diag_ccs_code3, diag_ccs3) %>%
group_by(diag_ccs3) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(ccs3,5)
#ccscode4
ccs4= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, diag_ccs_code4, diag_ccs4) %>%
group_by(diag_ccs4) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(ccs4,5)
#ccscode5
ccs5= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, diag_ccs_code5, diag_ccs5) %>%
group_by(diag_ccs5) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(ccs5,5)
```
```{r Statistics for ccs1, echo=FALSE, warning=FALSE}
workingdata$deceased= as.numeric(workingdata$deceased)
#workingdata$ccs1= as.numeric(workingdata$ccs1)
cor.test(workingdata$deceased, workingdata$major_diag_cat_cde, method ="pearson")
```
Among the 5 ccs codes, essential hypertension, cardiac dysrhthmias, disorders of lipid metabolism and osteoarthritis are among the highest diagnoses within patients that have died. Based on the Pearson's correlation analysis, there is a small association between diagnosis based on theses ccs code and death.
### Co-morbidities based on CCS code per timepoint to deceased {.tabset}
#### 1: CCS code 1 {.tabset}
```{r view for CCS among deceased by ccs1, echo=FALSE, warning=FALSE}
#ccs code1 among deceased in 30 days
ccs1_30 <-
workingdata %>%
filter(deceased30days == "Deceased") %>%
count(diag_ccs_code1, diag_ccs1, sort=TRUE)
head(ccs1_30, 5)
#ccs code1 among deceased in 31-180 days
ccs1_180 <-
workingdata %>%
filter(deceased180days == "Deceased") %>%
count(diag_ccs_code1, diag_ccs1, sort=TRUE)
head(ccs1_180, 5)
#ccs code1 among deceased in 181-365 days
ccs1_365 <-
workingdata %>%
filter(deceased365days == "Deceased") %>%
count(diag_ccs_code1, diag_ccs1, sort=TRUE)
head(ccs1_365, 5)
#ccs code1 among deceased in 366-1825 days
ccs1_1825 <-
workingdata %>%
filter(deceased1825days == "Deceased") %>%
count(diag_ccs_code1, diag_ccs1, sort=TRUE)
head(ccs1_1825, 5)
#ccs code1 among deceased in after1825 days
ccs1after1825 <-
workingdata %>%
filter(deceasedafter1825days == "Deceased") %>%
count(diag_ccs_code1, diag_ccs1, sort=TRUE)
head(ccs1after1825, 5)
```
#### 2: CCS code 2 {.tabset}
```{r view for CCS among deceased by ccs2, echo=FALSE, warning=FALSE}
#ccs code2 among deceased in 30 days
ccs2_30 <-
workingdata %>%
filter(deceased30days == "Deceased") %>%
count(diag_ccs_code2, diag_ccs2, sort=TRUE)
head(ccs2_30, 5)
#ccs code2 among deceased in 31-180 days
ccs2_180 <-
workingdata %>%
filter(deceased180days == "Deceased") %>%
count(diag_ccs_code2, diag_ccs2, sort=TRUE)
head(ccs2_180, 5)
#ccs code2 among deceased in 181-365 days
ccs2_365 <-
workingdata %>%
filter(deceased365days == "Deceased") %>%
count(diag_ccs_code2, diag_ccs2, sort=TRUE)
head(ccs2_365, 5)
#ccs code2 among deceased in 366-1825 days
ccs2_1825 <-
workingdata %>%
filter(deceased1825days == "Deceased") %>%
count(diag_ccs_code2, diag_ccs2, sort=TRUE)
head(ccs2_1825, 5)
#ccs code2 among deceased in after1825 days
ccs2after1825 <-
workingdata %>%
filter(deceasedafter1825days == "Deceased") %>%
count(diag_ccs_code2, diag_ccs2, sort=TRUE)
head(ccs2after1825, 5)
```
#### 3: CCS code 3 {.tabset}
```{r view for CCS among deceased by ccs3, echo=FALSE, warning=FALSE}
#ccs code3 among deceased in 30 days
ccs3_30 <-
workingdata %>%
filter(deceased30days == "Deceased") %>%
count(diag_ccs_code3, diag_ccs3, sort=TRUE)
head(ccs3_30, 5)
#ccs code3 among deceased in 31-180 days
ccs3_180 <-
workingdata %>%
filter(deceased180days == "Deceased") %>%
count(diag_ccs_code3, diag_ccs3, sort=TRUE)
head(ccs3_180, 5)
#ccs code3 among deceased in 181-365 days
ccs3_365 <-
workingdata %>%
filter(deceased365days == "Deceased") %>%
count(diag_ccs_code3, diag_ccs3, sort=TRUE)
head(ccs3_365, 5)
#ccs code3 among deceased in 366-1825 days
ccs3_1825 <-
workingdata %>%
filter(deceased1825days == "Deceased") %>%
count(diag_ccs_code3, diag_ccs3, sort=TRUE)
head(ccs3_1825, 5)
#ccs code3 among deceased in after1825 days
ccs3after1825 <-
workingdata %>%
filter(deceasedafter1825days == "Deceased") %>%
count(diag_ccs_code3, diag_ccs3, sort=TRUE)
head(ccs3after1825, 5)
```
#### 4: CCS code 4 {.tabset}
```{r view for CCS among deceased by ccs4, echo=FALSE, warning=FALSE}
#ccs code4 among deceased in 30 days
ccs4_30 <-
workingdata %>%
filter(deceased30days == "Deceased") %>%
count(diag_ccs_code4, diag_ccs4, sort=TRUE)
head(ccs4_30, 5)
#ccs code4 among deceased in 31-180 days
ccs4_180 <-
workingdata %>%
filter(deceased180days == "Deceased") %>%
count(diag_ccs_code4, diag_ccs4, sort=TRUE)
head(ccs4_180, 5)
#ccs code4 among deceased in 181-365 days
ccs4_365 <-
workingdata %>%
filter(deceased365days == "Deceased") %>%
count(diag_ccs_code4, diag_ccs4, sort=TRUE)
head(ccs4_365, 5)
#ccs code4 among deceased in 366-1825 days
ccs4_1825 <-
workingdata %>%
filter(deceased1825days == "Deceased") %>%
count(diag_ccs_code4, diag_ccs4, sort=TRUE)
head(ccs4_1825, 5)
#ccs code3 among deceased in after1825 days
ccs4after1825 <-
workingdata %>%
filter(deceasedafter1825days == "Deceased") %>%
count(diag_ccs_code4, diag_ccs4, sort=TRUE)
head(ccs4after1825, 5)
```
#### 5: CCS code 5 {.tabset}
```{r view for CCS among deceased by ccs5, echo=FALSE, warning=FALSE}
#ccs code5 among deceased in 30 days
ccs5_30 <-
workingdata %>%
filter(deceased30days == "Deceased") %>%
count(diag_ccs_code5, diag_ccs5, sort=TRUE)
head(ccs5_30, 5)
#ccs code5 among deceased in 31-180 days
ccs5_180 <-
workingdata %>%
filter(deceased180days == "Deceased") %>%
count(diag_ccs_code5, diag_ccs5, sort=TRUE)
head(ccs5_180, 5)
#ccs code5 among deceased in 181-365 days
ccs5_365 <-
workingdata %>%
filter(deceased365days == "Deceased") %>%
count(diag_ccs_code5, diag_ccs5, sort=TRUE)
head(ccs5_365, 5)
#ccs code5 among deceased in 366-1825 days
ccs5_1825 <-
workingdata %>%
filter(deceased1825days == "Deceased") %>%
count(diag_ccs_code5, diag_ccs5, sort=TRUE)
head(ccs5_1825, 5)
#ccs code5 among deceased in after1825 days
ccs5after1825 <-
workingdata %>%
filter(deceasedafter1825days == "Deceased") %>%
count(diag_ccs_code5, diag_ccs5, sort=TRUE)
head(ccs5after1825, 5)
```
```{r view diet data, echo=FALSE, warning=FALSE}
#summary(workingdata$diet_plant)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# -2.999 -0.665 -0.067 0.073 0.652 9.889 4686
#summary(workingdata$diet_highprotfat)
#Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# -3.066 -0.670 -0.124 0.026 0.551 12.416 4686
#summary(workingdata$diet_highcarb)
#Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#-4.379 -0.752 -0.204 -0.072 0.454 7.284 4686
#summary(workingdata$diet_saladwine)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# -4.579 -0.579 0.041 0.102 0.727 6.268 4686
```
# Diet
## Diet Data {.tabset}
```{r cat diet data, echo=FALSE, warning=FALSE}
#plant diet
#workingdata$diet_plant <- na.omit(workingdata$diet_plant)
workingdata$plantdiet <-
ifelse(workingdata$diet_plant <=0.073, "low",
ifelse(workingdata$diet_plant >0.073 & workingdata$diet_plant <=0.652, "mid", "high"
))
#high protein fat diet
#workingdata$diet_highprotfat <- na.omit(workingdata$diet_highprotfat)
workingdata$protdiet <-
ifelse(workingdata$diet_highprotfat <=-0.124, "low",
ifelse(workingdata$diet_highprotfat >-0.124 & workingdata$diet_highprotfat <=0.551, "mid", "high"
))
#carb diet
#workingdata$diet_highcarb <-na.omit(workingdata$diet_highcarb)
workingdata$carbdiet <-
ifelse(workingdata$diet_highcarb <=-0.204, "low",
ifelse(workingdata$diet_highcarb >-0.204 & workingdata$diet_highcarb <=0.454, "mid", "high"
))
#salad wine diet
#workingdata$diet_saladwine <- na.omit(workingdata$diet_saladwine)
workingdata$saladwinediet <-
ifelse(workingdata$diet_saladwine <=0.041, "low",
ifelse(workingdata$diet_saladwine >0.041 & workingdata$diet_saladwine <0.727, "mid", "high"
))
#head(workingdata$saladwinediet, 100)
```
```{r factor diet data, echo=FALSE, warning=FALSE}
workingdata$protdiet <-as.factor(workingdata$protdiet)
workingdata$carbdiet <-as.factor(workingdata$carbdiet)
workingdata$saladwinediet <-as.factor(workingdata$saladwinediet)
workingdata$plantdiet <-as.factor(workingdata$plantdiet)
#head(workingdata$saladwinediet, 100)
#class(workingdata$saladwinediet)
#factor
```
### 1: High Carb Diet {.tabset}
```{r view for diet data high carb, echo=FALSE, warning=FALSE}
#high carb
highcarb= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, carbdiet) %>%
group_by(carbdiet) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(highcarb,10)
#workingdata$highcarb= as.numeric(workingdata$deceased)
#workingdata$ccs1= as.numeric(workingdata$ccs1)
cor.test(workingdata$deceased, workingdata$diet_highcarb, method ="pearson")
```
### 2: High Protein Diet {.tabset}
```{r view for diet data prot fat, echo=FALSE, warning=FALSE}
#high protein fat
highprotfat= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, protdiet) %>%
group_by(protdiet) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(highprotfat,10)
cor.test(workingdata$deceased, workingdata$diet_highprotfat, method ="pearson")
```
### 3: High Plant Diet {.tabset}
```{r view for diet data plant, echo=FALSE, warning=FALSE}
#high plant
highplant= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, plantdiet) %>%
group_by(plantdiet) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(highplant,10)
cor.test(workingdata$deceased, workingdata$diet_plant, method ="pearson")
```
### 4: High Salad/Wine Diet {.tabset}
```{r view for diet data salad wine, echo=FALSE, warning=FALSE}
#salad wine
saladwine= workingdata %>% filter (workingdata$deceased ==1) %>% select (deceased, saladwinediet) %>%
group_by(saladwinediet) %>% summarise(counts=n()) %>% ungroup() %>% arrange(desc(counts))
head(saladwine,10)
cor.test(workingdata$deceased, workingdata$diet_saladwine, method ="pearson")
```
Based on the Pearson's Correlation test, all diet has some correlation to the rate of deceased with high carb diet having the strongest correlation among the surveyed diets.
# Prediction model
```{r facotring data cleanup for prediction model , echo=FALSE, warning=FALSE}
workingdata$hbp_self_q1 <- as.factor(workingdata$hbp_self_q1)
workingdata$hbpmed_totyrs <-as.factor(workingdata$hbpmed_totyrs)
workingdata$mammo_ever_q1 <-as.factor(workingdata$mammo_ever_q1)
workingdata$sit_hrs <- as.factor (workingdata$sit_hrs)
workingdata$sleep_hrs <-as.factor(workingdata$sleep_hrs)
workingdata$serv_fats_q1 <-as.factor(workingdata$serv_fats_q1)
workingdata$vit_reg_no <-as.factor(workingdata$vit_reg_no)
workingdata$ses_quartile_ind <-as.factor(workingdata$ses_quartile_ind)
workingdata$brca <-as.factor(workingdata$brca)
#workingdata$deceased<-as.integer(workingdata$deceased)
```
```{r setting up prediction model, echo=FALSE, warning=FALSE, include= FALSE}
modeldata <-
workingdata %>%
select(deceased30days, deceased, deceased180days, deceased365days, deceased1825days, deceasedafter1825days, agedata, diet_highcarb, diet_highprotfat, diet_plant, diet_saladwine, major_diag_cat_cde, payer_cat_cde, payer_coverage_typ, total_charges_amt, patient_disposition_cde, meno_stattype, oralcntr_yrs, oralcntr_ever_q1, bmi_q1, hbp_self_q1, hbpmed_totyrs, mammo_ever_q1, allex_life_hrs, allex_hrs_q1, sit_hrs, sleep_hrs, serv_fats_q1, dief_ethnic, vit_mulvit_q1, vit_reg_no, smoke_lifeexpo, smoke_totpackyrs, smoke_totyrs, alchl_g_dayrecen, diag_ccs_code1,facility_zip5_cde, participant_race, ses_quartile_ind, brca)
na.omit(modeldata)
modeldata[is.na(modeldata)]= 0
modeldata <-modeldata[complete.cases(modeldata),]
#modeldata= modeldata %>% mutate_each(funs(as.factor))
#modeldata <-as.numeric(modeldata$deceased)
#modeldata <-as.factor(modeldata$deceasedtimepoints)
#sum(!complete.cases(modeldata))
#4744 missing values
#str(modeldata)
#fill in missing values with median of each variable
#=modeldata
#for(i in 1:ncol(modeldata)) {
# modeldata[ , i][is.na(modeldata[ , i])] <- 0, na.rm=TRUE)
#}
#need to change char to factor and add levels
```
## Overall Random Forest Prediction Model {.tabset}
### Performance
```{r random forest seed, echo=FALSE, warning=FALSE}
library(datasets)
library(randomForest)
library(caTools)
library(mlr)
library(pROC)
#model3<-makeClassifTask(id="deceased predict", data=modeldata, target="deceased")
#model3test <-subsetTask(model3, subset=data_valid)
#model3_RF<-randomForest(deceased ~., data=modeldata[traindata,], mtry=sqrt(50), ntree=500, strata=modeldata$deceased[traindata])
#varImpPlot(model3_RF)
#importance(model3_RF)
modeldata$deceased <-as.factor(modeldata$deceased)
set.seed(207)
index=sample(1:nrow(modeldata),floor(0.75*nrow(modeldata)))
train=modeldata[index,]
test=modeldata[-index,]
#ind<-sample.split(modeldata, SplitRatio= 0.75)
#train<-modeldata [ind==TRUE]
#test <-modeldata [ind==FALSE]
#train<-na.omit(train)
#test<-na.omit(test)
task<-makeClassifTask(data=train, target="deceased")
lrnr<-makeLearner("classif.randomForest")
rdesc <-makeResampleDesc("Holdout",split=0.75)
mod<-train(lrnr, task)
rf<-randomForest(deceased ~., data=train, proximity=TRUE)
mod
mod$err.rate[,1]
pred<-predict(mod$learner.model, task=task, subset=test)
res<-resample(learner=lrnr, task=task, resampling=rdesc)
#get prediction on test set
#getPredictionResponse(res$pred)
#compute accuracy
performance (res$pred, acc)
```
```{r random forest test1, echo=FALSE, warning=FALSE}
library(pROC)
model1 <-randomForest(deceased ~ ., data=train, importance=TRUE)
#model1
#Call:
# randomForest(formula = deceased ~ ., data = train, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 5
# OOB estimate of error rate: 6.68%
#Confusion matrix:
# 0 1 class.error
#0 239 0 0.0000000
#1 38 292 0.1151515
```
### Model/ Variable Importance
```{r random forest test2, echo=FALSE, warning=FALSE}
model2 <-randomForest(deceased ~ ., data=train, ntree=500, mtry=5, importance=TRUE)
model2
```
```{r random forest pred, echo=FALSE, warning=FALSE}
predTrain <-predict(model2, train, type="class")