-
-
Notifications
You must be signed in to change notification settings - Fork 3
/
part-01-11-power.qmd
466 lines (346 loc) · 14.8 KB
/
part-01-11-power.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
# Power calculation in network studies
In survey and study design, calculating the required sample size is critical. Nowadays, tools and methods for calculating the required sample size abound; nonetheless, sample size calculation for studies involving social networks is still underdeveloped. This chapter will illustrate how we can use computer simulations to estimate the required sample size. Chapter @part2-power provides a general overview of power analysis.
## Example 1: Spillover effects in egocentric studies[^credit-ego-power]
Suppose we want to run an intervention over a particular population, and we are interested in the effects of such intervention on the egos' alters. In economics, this problem, which they call the "spillover effect," is actively studied.
We assume that alters only get exposed if egos acquire the behavior for the power calculation. Furthermore, for this first run, we will assume that there is no social reinforcement or influence between alters. We will later relax this assumption. To calculate power, we will do the following:
1. Simulate egos' behavior following a logit distribution.
2. Randomly drop some egos as a result of attrition.
3. Simulate alters' behavior using their egos as the treatment.
4. Fit a logistic regression based on the previous model.
5. Accept/reject the null and store the result.
The previous steps will be repeated 500 for each value of $n$ we analyze. We will finalize by plotting power against sample sizes. Let's first start by writing down the simulation parameters:
```{r part-01-power-params1}
# Design
n_sims <- 500 # Number of simulations
n_a <- 4 # Number of alters
sizes <- # Sizes to try
seq(from = 100, to = 200, by = 25)
# Assumptions
odds_h_1 <- 2.0 # Odds of Increase/
attrition <- .3
baseline <- .2 # Low prevalence in 1s
# Parameters
alpha <- .05
beta_pow <- 0.2
```
As we discuss in @part2-power, it is always a good idea to encapsulate the simulation into a function:
```{r part-01-power-simfun}
# The odds turned to a prob
theta_h_1 <- plogis(log(odds_h_1))
# Simulation function
sim_data <- function(n) {
# Treatment assignment
tr <- c(rep(1, n/2), rep(0, n/2))
# Step 1: Sampling population of egos
y_ego <- runif(n) < c(
rep(theta_h_1, n/2),
rep(0.5, n/2)
)
# Step 2: Simulating attrition
todrop <- order(runif(n))[1:(n * attrition)]
y_ego <- y_ego[-todrop]
tr <- tr[-todrop]
n <- n - length(todrop)
# Step 3: Simulating alter's effect. We assume the same as in
# ego
tr_alter <- rep(y_ego * tr, n_a)
y_alter <- runif(n * n_a) < ifelse(tr_alter, theta_h_1, 0.5)
# Step 4: Computing test statistic
res_ego <- tryCatch(glm(y_ego ~ tr, family = binomial("logit")), error = function(e) e)
res_alter <- tryCatch(glm(y_alter ~ tr_alter, family = binomial("logit")), error = function(e) e)
if (inherits(res_ego, "error") | inherits(res_alter, "error"))
return(c(ego = NA, alter = NA))
# Step 5: Reject?
c(
ego = summary(res_ego)$coefficients["tr", "Pr(>|z|)"] < alpha,
alter = summary(res_alter)$coefficients["tr_alter", "Pr(>|z|)"] < alpha
)
}
```
Now that we have the data generating function, we can run the simulations to approximate statistical power given the sample size. The results will be stored in the matrix `spower`. Since we are simulating data, it is crucial to set the seed so we can reproduce the results.
```{r part-01-powersim1, warning=FALSE}
# We always set the seed
set.seed(88)
# Making space, and running!
spower <- NULL
for (s in sizes) {
# Run the simulation for size s
simres <- rowMeans(replicate(n_sims, sim_data(s)), na.rm = TRUE)
# And store the results
spower <- rbind(spower, simres)
}
```
The following figure shows the approximate power for finding effects at both levels, ego and alter:
```{r part-01-power-plot1, warning=FALSE}
library(ggplot2)
spower <- rbind(
data.frame(size = sizes, power = spower[,"ego"], type = "ego"),
data.frame(size = sizes, power = spower[,"alter"], type = "alter")
)
spower |>
ggplot(aes(x = size, y = power, colour = type)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Number of Egos", y = "Approx. Power", colour = "Node type") +
geom_hline(yintercept = 1 - beta_pow)
```
As shown in Chapter @part2-power, we can use a linear regression model to predict sample size as a function of statistical power:
```{r part-01-power-ols-example1}
# Fitting the model
power_model <- glm(
size ~ power + I(power^2),
data = spower, family = gaussian(), subset = type == "alter"
)
summary(power_model)
# Predict
predict(power_model, newdata = data.frame(power = .8), type = "response") |>
ceiling()
```
From the figure, it becomes apparent that, although there is not enough power to identify effects at the ego level, because each ego brings in five alters, the alter sample size is high enough that we can reach above 0.8 statistical power with relatively small sample size.
[^credit-ego-power]: The original problem was posed by [Dr. Shinduk Lee](https://faculty.utah.edu/u6037777-SHINDUK_LEE/hm/index.hml) from the School of Nursing at the University of Utah.
## Example 2: Spillover effects pre-post effect
Now the dynamics are different. Instead of having a treated and control group, we have a single group over which we will measure behavioral change. We will simulate individuals in their initial state, still 0/1, and then simulate that the intervention will make them more likely to have $y = 1.$ We will also assume that subjects generally don't change their behavior and that the baseline prevalence of zeros is higher. The simulation steps are as follows:
1. For each individual in the population, draw the underlying probability that $y = 1$. With that probability, assign the value of $y$. This applies to both ego and alter.
2. Randomly drop some egos, and their corresponding alters due to attrition.
3. Simulate alters' behavior using their egos as the treatment. Both ego and alter's underlying probability are increased by the chosen odds.
4. To control for the underlying probability that an individual has $y = 1$, we use conditional logistic regression (also known as matched case-control logit,) to estimate the treatment effects.
5. Accept/reject the null and store the result.
```{r part-01-params2}
beta_pars <- c(4, 6)
odds_h_1 <- 2.0
```
```{r part-01-sim2}
# Simulation function
library(survival)
sim_data_prepost <- function(n) {
# Step 1: Sampling population of egos
y_ego_star <- rbeta(n, beta_pars[1], beta_pars[2])
y_ego_0 <- runif(n) < y_ego_star
# Step 2: Simulating attrition
todrop <- order(runif(n))[1:(n * attrition)]
y_ego_0 <- y_ego_0[-todrop]
n <- n - length(todrop)
y_ego_star <- y_ego_star[-todrop]
# Step 3: Simulating alter's effect. We assume the same as in
# ego
y_alter_star <- rbeta(n * n_a, beta_pars[1], beta_pars[2])
y_alter_0 <- runif(n * n_a) < y_alter_star
# Simulating post
y_ego_1 <- runif(n) < plogis(qlogis(y_ego_star) + log(odds_h_1))
tr_alter <- as.integer(rep(y_ego_1, n_a))
y_alter_1 <- runif(n * n_a) < plogis(qlogis(y_alter_star) + log(odds_h_1) * tr_alter) # So only if ego did something
# Step 4: Computing test statistic
y_ego_0 <- as.integer(y_ego_0)
y_ego_1 <- as.integer(y_ego_1)
y_alter_0 <- as.integer(y_alter_0)
y_alter_1 <- as.integer(y_alter_1)
d <- data.frame(
y = c(y_ego_0, y_ego_1),
tr = c(rep(0, n), rep(1, n)),
g = c(1:n, 1:n)
)
res_ego <- tryCatch(
clogit(y ~ tr + strata(g), data = d, method = "exact"),
error = function(e) e
)
d <- data.frame(
y = c(y_alter_0, y_alter_1),
tr = c(rep(0, n * n_a), tr_alter),
g = c(1:(n * n_a), 1:(n * n_a))
)
res_alter <- tryCatch(
clogit(y ~ tr + strata(g), data = d, method = "exact"),
error = function(e) e
)
if (inherits(res_ego, "error") | inherits(res_alter, "error"))
return(c(ego = NA, alter = NA))
# Step 5: Reject?
c(
# ego = res_ego$p.value < alpha,
ego = summary(res_ego)$coefficients["tr", "Pr(>|z|)"] < alpha,
alter = summary(res_alter)$coefficients["tr", "Pr(>|z|)"] < alpha,
ego_test = coef(res_ego),
alter_glm = coef(res_alter)
)
}
```
```{r part-01-powersim2, warning=FALSE, cache = TRUE}
# We always set the seed
set.seed(88)
# Making space and running!
spower <- NULL
for (s in sizes) {
# Run the simulation for size s
simres <- rowMeans(
replicate(n_sims, sim_data_prepost(s)),
na.rm = TRUE
)
# And store the results
spower <- rbind(spower, simres)
}
```
```{r part-01-power-plot2, warning=FALSE}
library(ggplot2)
spowerd <- rbind(
data.frame(size = sizes, power = spower[,"ego"], type = "ego"),
data.frame(size = sizes, power = spower[,"alter"], type = "alter")
)
spowerd |>
ggplot(aes(x = size, y = power, colour = type)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Number of Egos", y = "Approx. Power", colour = "Node type") +
geom_hline(yintercept = 1 - beta_pow)
```
As shown in Chapter @part2-power, we can use a linear regression model to predict sample size as a function of statistical power:
```{r part-01-power-ols-example2}
# Fitting the model
power_model <- glm(
size ~ power + I(power^2),
data = spowerd, family = gaussian(), subset = type == "alter"
)
summary(power_model)
# Predict
predict(power_model, newdata = data.frame(power = .8), type = "response") |>
ceiling()
```
## Example 3: First difference
Now, instead of looking at a dichotomous outcome, let's evaluate what happens if
the variable is continuous. The effects we are interested to identify are the ego and alter effect, $\gamma_{ego}$ and $\gamma_{alter}$, respectively. Furthermore, the data generating process is
\begin{align*}
y_{itg} & = \alpha_i + \kappa_g + X_i\beta + \varepsilon_{itg} \\
y_{itg} & = \alpha_i + \kappa_g + X_i\beta + D_{i}^{ego}\gamma_{ego} + D_i^{alter}\gamma_{alter} + \varepsilon_{itg}
\end{align*}
Where $D_i^{ego/alter}$ is an indicator variable. Here, ego and alter behavior are correlated through a fixed effect. In other words,
within each group, we are assuming that there's a shared baseline prevalence of
the outcome. The main difference is that ego and alter may have different results
regarding the effect size of the treatment. Another way of approaching the group-level
correlation could be through an autocorrelation process, like in a spatial Autocorrelated
model; nonetheless, estimating such models is computationally expensive, so we
opted to use the former.
For simplicity, we assume that there is no time effect. Two essential components here,
$\alpha_i$ and $\kappa_g$ are individual and group-level unobserved fixed effects.
The most straightforward approach here is to use a first difference estimator:
$$
(y_{it+1g} - y_{itg}) = D_{i}^{ego}\gamma_{ego} + D_i^{alter}\gamma_{alter} + \varepsilon'_i, \quad \varepsilon'_i = \varepsilon_{it+1g} - \varepsilon_{itg}
$$
By taking the first difference, the fixed effects are removed from the equation,
and we can proceed with a regular linear model.
```{r part-01-params3}
effect_size_ego <- 0.5
effect_size_alter <- 0.25
sizes <- seq(10, 100, by = 10)
```
```{r part-01-sim3}
# Simulation function
sim_data_prepost <- function(n) {
# Applying attrition
n <- floor(n * (1 - attrition))
# Step 1: Sampling fixed effects
alpha_i <- rnorm(n * (n_a + 1))
kappa_g <- rep(rnorm(n_a + 1), n)
# Step 2: Generating the outcome at t = 1
is_ego <- rep(c(1, rep(0, n_a)), n)
is_alter <- 1 - is_ego
y_0 <- alpha_i + kappa_g + rnorm(n * (n_a + 1))
y_1 <- alpha_i + kappa_g +
is_ego * effect_size_ego +
is_alter * effect_size_alter +
rnorm(n * (n_a + 1))
# Step 4: Computing test statistic
res <- tryCatch(
glm(I(y_1 - y_0) ~ -1 + is_ego + is_alter, family = gaussian("identity")),
error = function(e) e
)
if (inherits(res, "error"))
return(c(ego = NA, alter = NA))
# Step 5: Reject?
c(
# ego = res_ego$p.value < alpha,
ego = summary(res)$coefficients["is_ego", "Pr(>|t|)"] < alpha,
alter = summary(res)$coefficients["is_alter", "Pr(>|t|)"] < alpha,
coef(res)[1],
coef(res)[2]
)
}
```
```{r part-01-powersim3, warning=FALSE, cache = TRUE}
# We always set the seed
set.seed(88)
# Making space and running!
spower <- NULL
for (s in sizes) {
# Run the simulation for size s
simres <- rowMeans(
replicate(n_sims, sim_data_prepost(s)),
na.rm = TRUE
)
# And store the results
spower <- rbind(spower, simres)
}
```
```{r part-01-power-plot3, warning=FALSE}
library(ggplot2)
spowerd <- rbind(
data.frame(size = sizes, power = spower[,"ego"], type = "ego"),
data.frame(size = sizes, power = spower[,"alter"], type = "alter")
)
spowerd |>
ggplot(aes(x = size, y = power, colour = type)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
labs(x = "Number of Egos", y = "Approx. Power", colour = "Node type") +
geom_hline(yintercept = 1 - beta_pow) +
labs(
caption = sprintf(
"Ego effect: %.2f; Alter effect: %.2f", effect_size_ego, effect_size_alter)
)
```
From the inferential point of view, we could still use a demean operator to estimate
individual-level effects. In particular, we would require to use the demean operator at the group level and then fit a fixed effect model to estimate individual-level parameters.
<!--
## Example 4: Autocorrelated behavior
In this scenario, ego and alter behavior are correlated. To model this, we will
use a Spatial Autocorrelation Model [SAR.] SAR models are common in Spatial
statistics. A general form is:
$$
\mathbf{y} = \rho W \mathbf{y} + \mathbf{X} \beta + \varepsilon,\quad \varepsilon \sim \text{MVN}(0, \sigma^2 I_n),
$$
\noindent where $\rho$ is the correlation coefficient, $W$ is a row-stochastic
a square matrix and $\varepsilon$ is Multivariate Normal distributed. A row-stochastic
matrix is one where the rows add up to one. In social network analysis, $W$ is usually
operationalized as the inverse of the geodesic matrix, thus, the closer two
individuals are, the higher the influence they have over each other. In our case,
such a system can be simulated using the following equation:
$$
\mathbf{y} = (I_n - \rho W)^{-1}\mathbf{X} \beta + (I_n - \rho W)^{-1}\varepsilon
$$
```{r gen-obs, eval = FALSE}
# Creating a social network. Assuming full connection,
# We can use this as a baseline
library(Matrix)
n <- 500
N <- n * (n_a + 1)
G <- matrix(1, n_a + 1, n_a + 1)
diag(G) <- 0
G <- as(G, "dgCMatrix")
G <- kronecker(
as(diag(n), "dgCMatrix"), G)
G <- G / rowSums(G)
# set.seed(44)
Sig <- 1
epsilon <- rnorm(N, sd = Sig)
rho <- 2
I_n <- as(matrix(0, N, N), "dgCMatrix")
diag(I_n) <- 1
y <- solve(I_n - rho * G) %*% epsilon |> as.vector()
```
```{r, eval=FALSE}
library(spatialreg)
ans <- lagsarlm(
res ~ 1,
listw = spdep::mat2listw(G),
data = data.frame(res = y)
)
```
-->