sims.Rmd

---
title: OLS Simulations
author: Jeffrey Arnold
date: April 12, 2017
editor_options: 
  chunk_output_type: console
---

$$
\DeclareMathOperator{diag}{diag}
$$

# Prerequisites

Before starting, load the necessary packages and define some functions that will be used later.

```{r message=FALSE}
library("tidyverse")
library("stringr")
library("broom")
library("modelr")
```

A useful result is that the covariance of a matrix ($\Sigma$) can be decomposed into a correlation matrix ($R$) and a vector of standard deviations ($\sigma$),
$$
\Sigma = \sigma' R \sigma
$$
This is useful because it can be easier to reason about correlation between random variables and their scales separately.
In this application, results will depend on the correlation, but it will be useful to understand the relationship separately,
```{r sdcor2cov}
sdcor2cov <- function(s, r = diag(length(s))) {
  s <- diag(s, nrow = length(s), ncol = length(s))
  s %*% r %*% s
}
```
The $n \times n$ covariance matrix $\Sigma$ can be decomposed into the $n \times 1$ standard deviation vector $\sigma$ and correlation matrix $R$,
$$
\Sigma = \diag(\sigma) R \diag(\sigma)
$$

Generate a data frame with possibly correlated, multivariate normal variables.
The multivariate normal distribution is specified in terms of a mean (`mu`), 
standard deviation for variable (`sigma`), and the correlation matrix between variables,
(`R`):
```{r mvnorm_df}
mvnorm_df <- function(n, mu, sigma = rep(1, length(mu)), R = diag(length(sd)),
                      empirical = TRUE) {
  as_tibble(MASS::mvrnorm(n, mu = mu, Sigma = sdcor2cov(sigma, R), 
                          empirical = empirical))
}
```


# Classical Model

$$
\begin{aligned}[t]
\vec{y} &= \mat{X} \vec{\beta} + \vec{\epsilon} \\
\epsilon_i &\sim N(0, \sigma^2)
\end{aligned}
$$

The following function will simulate data from a linear model with i.i.d. normally distributed errors.

```{r}
sim_lm_normal <- function(data, formula, beta, sigma = 1) {
  # Simulate from model
  n <- nrow(data)
  # generate and add y variable
  E_y <- model.matrix(formula, data = data) %*% beta
  eps <- rnorm(n, mean = 0, sd = sigma)
  # append y and expected value of y to the data frame
  data[["y"]] <- E_y + eps
  data[["ev"]] <- E_y
  data
}
```

Set some parameters. We'll draw samples from the following model,
$$
\begin{aligned}[t]
y &= x_1 + \epsilon \\
\epsilon &\sim N(0, 1) \\
x_1 &\sim N(0, 1)
\end{aligned}
$$
Note that we are sampling $x_1$ from a normal distribution for convenience.
The covariates of a normal distribution ***do not need to be normally distributed**.

```{r}
n <- 100
k <- 1
beta <- c(0, rep(1, k))
X <- mvnorm_df(n, mu = 0, sigma = 1)
sigma_y <- 1
```

```{r}
ex <- sim_lm_normal(X, ~ V1, beta, sigma = sigma_y)
ggplot(ex, aes(x = V1, y = y)) + 
  geom_point() + geom_smooth(method = "lm")
```

Write a function to 
```{r}
est_ols <- function(data, formula, conf.level = 0.95) {
  lm(formula, data = data) %>%
    tidy(conf.int = TRUE, conf.level = conf.level)
}
```

This runs `lm` with `formula` and returns a tidy data frame.
```{r}
data <- sim_lm_normal(X, ~ V1, beta)
est_ols(data, y ~ V1)
```


First, we want to simulate from a sample of size `n` and run a linear regression on it.
```{r}
sim1 <- function(n, .iter = NULL) {
  # set parameters 
  k <- 1
  beta <- c(0, rep(1, k))
  sigma_y <- 1
  f <-  ~ V1
  # simulate data
  data <- sim_lm_normal(mvnorm_df(n, mu = rep(0, k), sigma = rep(1, k)),
                        f, beta)
  # estimate model
  out <- est_ols(data, update(f, y ~ .))
  # add simulation id
  out[[".iter"]] <- .iter
  # output results
  out
}

```

```{r}
sim1(100)
```

Run a simulation function (`FUN`) `m` times with one set of settings.
```{r}
run_one_sim <- function(.iter, .f, ...) {
  # empty list to store results
  ret <- vector(mode = "list", length = .iter)
  for (i in seq_len(.iter)) {
    # f and ... allows us to pass any function and any args (through ...)
    # to f.
    ret[[i]] <- .f(..., .iter = i)
  }
  # stack list of data frames -> data frame
  bind_rows(ret)
  # map_dfr(seq_len(m), function(.sim) f(..., .sim = .sim))
}
```

Run multiple simulations with different options
```{r}
run_sims <- function(.iter, .f, .args = list(), ...) {
  # empty list to store results
  ret <- vector(mode = "list", length = length(.args))
  # simulations - run .iter for each parameter.
  for (i in seq_along(.args)) {
    ret[[i]] <- do.call(run_one_sim, 
                        c(list(.f = .f, .iter = .iter), .args[[i]]))
    ret[[i]][[".arg"]] <- rerun(.iter, .args[[i]])
    ret[[i]][[".sim"]] <- names(.args)[[i]]
  }
  # stack list of data frames -> data frame
  bind_rows(ret)
}
```

For example:
```{r}
iter <- 2 ^ 11
sim1_args <- list(list(n = 16), list(n = 32), list(n = 64), list(n = 128),
                  list(n = 256), list(n = 512), list(n = 1024))
sim1_result <- run_sims(iter, sim1, sim1_args)
```

```{r}
beta <- 1
sim1_sample <- 
  sim1_result %>%
  filter(term != "(Intercept)") %>%
  mutate(n = map_dbl(.arg, "n")) %>%
  group_by(n) %>%
  summarise(beta_hat_mean = mean(estimate),
            beta_hat_sd = sd(estimate),
            se_mean = mean(std.error),
            se_sd = sd(std.error),
            coverage = mean(conf.low <= beta & conf.high >= beta)) %>%
  mutate(beta_hat_bias = (beta_hat_mean - beta),
         se_bias = (se_mean - beta_hat_sd))
sim1_sample
```

```{r}
ggplot(aes(x = x))
```


## Sample size, P-values, and Power

Draw sample of size `n` from normal distribution with mean `mean` and standard deviation `sd`,
calculate a t-test against 0.
```{r}
sim_mean <- function(n, mean = 0, sd = 1, .iter = NULL) {
  out <- select(tidy(t.test(rnorm(n, mean = mean, sd = sd))), 
                estimate, p.value)
  out[[".iter"]] <- .iter
  out
}
```

Now run `.iter` of these t-tests
```{r}
sim_means <- function(n, .iter, mean = 0, sd = 1) {
  map_df(seq_len(.iter), ~ sim_mean(n, mean = mean, sd = sd, .iter = .iter))
}
```

Run simulations with `mean = 0` and `n = 10`:
```{r}
n <- 100
alpha <- 0.05
sim2_results <- sim_means(n, mean = 0.5,  .iter = 1000) %>%
  mutate(significant = (p.value < alpha))
```

Plot the distribution of p-values:
```{r}
ggplot(sim2_results, aes(x = p.value)) +
  geom_density()
```

Plot the distribution of estimates:
```{r}
ggplot(sim2_results, aes(x = estimate)) +
  geom_density()
```

Plot the distribution of estimate for "significant" and "insignificant" tests:
```{r}
ggplot(sim2_results, aes(x = estimate, fill = significant)) +
  geom_histogram()
```

## Extensions

- As you vary `n`, what happens to Type I errors, distribution of `p-values`,
    and distribution of estimates conditional on a significant p-value?
- Set `mean = 0.2`. As you vary `n`, what happens to Type II errors, distribution of `p-values`,
    and distribution of estimates conditional on a significant p-value?