-
Notifications
You must be signed in to change notification settings - Fork 3
/
sims.Rmd
260 lines (223 loc) · 6.7 KB
/
sims.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: OLS Simulations
author: Jeffrey Arnold
date: April 12, 2017
editor_options:
chunk_output_type: console
---
$$
\DeclareMathOperator{diag}{diag}
$$
# Prerequisites
Before starting, load the necessary packages and define some functions that will be used later.
```{r message=FALSE}
library("tidyverse")
library("stringr")
library("broom")
library("modelr")
```
A useful result is that the covariance of a matrix ($\Sigma$) can be decomposed into a correlation matrix ($R$) and a vector of standard deviations ($\sigma$),
$$
\Sigma = \sigma' R \sigma
$$
This is useful because it can be easier to reason about correlation between random variables and their scales separately.
In this application, results will depend on the correlation, but it will be useful to understand the relationship separately,
```{r sdcor2cov}
sdcor2cov <- function(s, r = diag(length(s))) {
s <- diag(s, nrow = length(s), ncol = length(s))
s %*% r %*% s
}
```
The $n \times n$ covariance matrix $\Sigma$ can be decomposed into the $n \times 1$ standard deviation vector $\sigma$ and correlation matrix $R$,
$$
\Sigma = \diag(\sigma) R \diag(\sigma)
$$
Generate a data frame with possibly correlated, multivariate normal variables.
The multivariate normal distribution is specified in terms of a mean (`mu`),
standard deviation for variable (`sigma`), and the correlation matrix between variables,
(`R`):
```{r mvnorm_df}
mvnorm_df <- function(n, mu, sigma = rep(1, length(mu)), R = diag(length(sd)),
empirical = TRUE) {
as_tibble(MASS::mvrnorm(n, mu = mu, Sigma = sdcor2cov(sigma, R),
empirical = empirical))
}
```
# Classical Model
$$
\begin{aligned}[t]
\vec{y} &= \mat{X} \vec{\beta} + \vec{\epsilon} \\
\epsilon_i &\sim N(0, \sigma^2)
\end{aligned}
$$
The following function will simulate data from a linear model with i.i.d. normally distributed errors.
```{r}
sim_lm_normal <- function(data, formula, beta, sigma = 1) {
# Simulate from model
n <- nrow(data)
# generate and add y variable
E_y <- model.matrix(formula, data = data) %*% beta
eps <- rnorm(n, mean = 0, sd = sigma)
# append y and expected value of y to the data frame
data[["y"]] <- E_y + eps
data[["ev"]] <- E_y
data
}
```
Set some parameters. We'll draw samples from the following model,
$$
\begin{aligned}[t]
y &= x_1 + \epsilon \\
\epsilon &\sim N(0, 1) \\
x_1 &\sim N(0, 1)
\end{aligned}
$$
Note that we are sampling $x_1$ from a normal distribution for convenience.
The covariates of a normal distribution ***do not need to be normally distributed**.
```{r}
n <- 100
k <- 1
beta <- c(0, rep(1, k))
X <- mvnorm_df(n, mu = 0, sigma = 1)
sigma_y <- 1
```
```{r}
ex <- sim_lm_normal(X, ~ V1, beta, sigma = sigma_y)
ggplot(ex, aes(x = V1, y = y)) +
geom_point() + geom_smooth(method = "lm")
```
Write a function to
```{r}
est_ols <- function(data, formula, conf.level = 0.95) {
lm(formula, data = data) %>%
tidy(conf.int = TRUE, conf.level = conf.level)
}
```
This runs `lm` with `formula` and returns a tidy data frame.
```{r}
data <- sim_lm_normal(X, ~ V1, beta)
est_ols(data, y ~ V1)
```
First, we want to simulate from a sample of size `n` and run a linear regression on it.
```{r}
sim1 <- function(n, .iter = NULL) {
# set parameters
k <- 1
beta <- c(0, rep(1, k))
sigma_y <- 1
f <- ~ V1
# simulate data
data <- sim_lm_normal(mvnorm_df(n, mu = rep(0, k), sigma = rep(1, k)),
f, beta)
# estimate model
out <- est_ols(data, update(f, y ~ .))
# add simulation id
out[[".iter"]] <- .iter
# output results
out
}
```
```{r}
sim1(100)
```
Run a simulation function (`FUN`) `m` times with one set of settings.
```{r}
run_one_sim <- function(.iter, .f, ...) {
# empty list to store results
ret <- vector(mode = "list", length = .iter)
for (i in seq_len(.iter)) {
# f and ... allows us to pass any function and any args (through ...)
# to f.
ret[[i]] <- .f(..., .iter = i)
}
# stack list of data frames -> data frame
bind_rows(ret)
# map_dfr(seq_len(m), function(.sim) f(..., .sim = .sim))
}
```
Run multiple simulations with different options
```{r}
run_sims <- function(.iter, .f, .args = list(), ...) {
# empty list to store results
ret <- vector(mode = "list", length = length(.args))
# simulations - run .iter for each parameter.
for (i in seq_along(.args)) {
ret[[i]] <- do.call(run_one_sim,
c(list(.f = .f, .iter = .iter), .args[[i]]))
ret[[i]][[".arg"]] <- rerun(.iter, .args[[i]])
ret[[i]][[".sim"]] <- names(.args)[[i]]
}
# stack list of data frames -> data frame
bind_rows(ret)
}
```
For example:
```{r}
iter <- 2 ^ 11
sim1_args <- list(list(n = 16), list(n = 32), list(n = 64), list(n = 128),
list(n = 256), list(n = 512), list(n = 1024))
sim1_result <- run_sims(iter, sim1, sim1_args)
```
```{r}
beta <- 1
sim1_sample <-
sim1_result %>%
filter(term != "(Intercept)") %>%
mutate(n = map_dbl(.arg, "n")) %>%
group_by(n) %>%
summarise(beta_hat_mean = mean(estimate),
beta_hat_sd = sd(estimate),
se_mean = mean(std.error),
se_sd = sd(std.error),
coverage = mean(conf.low <= beta & conf.high >= beta)) %>%
mutate(beta_hat_bias = (beta_hat_mean - beta),
se_bias = (se_mean - beta_hat_sd))
sim1_sample
```
```{r}
ggplot(aes(x = x))
```
## Sample size, P-values, and Power
Draw sample of size `n` from normal distribution with mean `mean` and standard deviation `sd`,
calculate a t-test against 0.
```{r}
sim_mean <- function(n, mean = 0, sd = 1, .iter = NULL) {
out <- select(tidy(t.test(rnorm(n, mean = mean, sd = sd))),
estimate, p.value)
out[[".iter"]] <- .iter
out
}
```
Now run `.iter` of these t-tests
```{r}
sim_means <- function(n, .iter, mean = 0, sd = 1) {
map_df(seq_len(.iter), ~ sim_mean(n, mean = mean, sd = sd, .iter = .iter))
}
```
Run simulations with `mean = 0` and `n = 10`:
```{r}
n <- 100
alpha <- 0.05
sim2_results <- sim_means(n, mean = 0.5, .iter = 1000) %>%
mutate(significant = (p.value < alpha))
```
Plot the distribution of p-values:
```{r}
ggplot(sim2_results, aes(x = p.value)) +
geom_density()
```
Plot the distribution of estimates:
```{r}
ggplot(sim2_results, aes(x = estimate)) +
geom_density()
```
Plot the distribution of estimate for "significant" and "insignificant" tests:
```{r}
ggplot(sim2_results, aes(x = estimate, fill = significant)) +
geom_histogram()
```
## Extensions
- As you vary `n`, what happens to Type I errors, distribution of `p-values`,
and distribution of estimates conditional on a significant p-value?
- Set `mean = 0.2`. As you vary `n`, what happens to Type II errors, distribution of `p-values`,
and distribution of estimates conditional on a significant p-value?