-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathR-Analysis-Skeleton.R
171 lines (91 loc) · 5.07 KB
/
R-Analysis-Skeleton.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# This script is for the Data Analysis in R section of the Practical Computing Skills for Biomedical Researchers boot camp
# load tidyverse library ---------
### Download data and skeleton script from GitHub into project ----
# Read in downloaded data using readr package
# Show the first few lines of the data and the dimensions
# Optionally bring up data in a viewer window.
# Create adults only dataset
# remove the original nh object
# investigate Race (categorical variable)
#create factor variables
### Descriptive Statistics -------------
# measures of the center
# histogram to see distribution
#measures of spread
# summary of dataframe
#descriptive stats for discrete variables
### ** EXERCISE 1 ** --------------
# ** YOUR TURN **
#A. Calculate the median Income in the NHA dataset
#B. Find the variance of Weight
#C. What is the range of Pulse?
#D. Create a density plot showing BMI colored and faceted by Education level
### T-tests ---------
# Exploratory data analysis
#normality is best assessed using qq plots for each sex
# equal variance, independent samples t-test
# if we didn't have normality met, run wilcox.test
### ** EXERCISE 2 ** ----------
# ** YOUR TURN **
# Still using the adults (`nha`) dataset, use a test of 2 means to assess whether single or married/cohabitating people (using the RelationshipStatus variable) drink more alcohol (AlcoholYear). (Please feel free to ask for help)
#A. First, create plot to explore the distribution of the variables and the relationship between these two variables.
#B. Next, consider test assumptions to decide what analysis to run.
#C. Run the analysis. Is the association between relationship status and alcohol consumption statistically significant? If so, which group drinks more often?
### ANOVA and LM ----------
# t-test with equal variance for BMI explained by relationship status
# Same question run in a linear model framework
# same question run as an ANOVA
### ANOVA with 3 groups ----------
# levels of smoking status
# linear model for the relationship between smoking status and BMI
#change reference category to Never
# re-create the fit and ANOVA summary
# check out Tukey's multiple comparisons
# plot results
# plot results without NA bc ANOVA automatically removed those
### Linear model with 2 continuous variables -----------
## Create LM fit object for the relationship between height and weight.
#plot these results
# check assumptions of LM
### Multiple regression --------
#t-test for Testosterone ~ PhysActive
#lm for Testosterone ~ PhysActive
#lm for Testosterone ~ PhysActive + Age
#lm for Testosterone ~ PhysActive + Age + Gender
### ** EXERCISE 3 ** ----------
# ** YOUR TURN **
#The `Work` variable is coded "Looking" (n=159), "NotWorking" (n=1317), and "Working" (n=2230).
#A. Fit a linear model of `Income` against `Work`. Assign this to an object called `fit`. What does the `fit` object tell you when you display it directly?
#B. Run an `anova()` to get the ANOVA table. Is the model significant?
#C. Run a Tukey test to get the pairwise contrasts. (Hint: `TukeyHSD()` on `aov()` on the fit). What do you conclude?
#D. Instead of thinking of this as ANOVA, think of it as a linear model. After you've thought about it, get some `summary()` statistics on the fit. Do these results jibe with the ANOVA model?
### DISCRETE VARIABLE ANALYSES -------
### Contingency tables ------
#cross tabulation of Gender and Diabetes
# Add marginal totals
# Get the proportional table
# proportional table over the first (row) margin only.
#chi square for diabetes and gender
# fisher's exact test for diabetes and gender
# relationship between race and health insurance
# plot for categorical data
### Logistic regression --------
#Look at levels of Race. The default ordering is alphabetical
# Let's relevel that where the group with the highest rate of insurance is "baseline"
#mutate to relevel Race with White as reference category
#logistic regression of insurance ~ race
#logistic regression of Insured with lots of predictors
### ** EXERCISE 4 ** -------
# ** YOUR TURN **
#1. What's the relationship between diabetes and participating in rigorous physical activity or sports?
#A. Create a contingency table with Diabetes status in rows and physical activity status in columns.
#B. Display that table with margins.
#C. Show the proportions of diabetics and nondiabetics, separately, who are physically active or not.
# - Is this relationship significant?
# - Create a mosaic plot to visualize the relationship
#2. Model the same association in a logistic regression framework to assess the risk of diabetes using physical activity as a predictor.
#<!-- - First, make Diabetes a factor variable if you haven't already (`nha$Diabetes <- factor(nha$Diabetes)`). -->
#A. Fit a model with just physical activity as a predictor, and display a model summary.
#B. Add gender to the model, and show a summary.
#C. Continue adding weight and age to the model. What happens to the gender association?
#D. Continue and add income to the model. What happens to the original association with physical activity?