This repository has been archived by the owner on Dec 30, 2023. It is now read-only.
forked from ajdamico/asdfree
-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-trendy.Rmd
470 lines (309 loc) · 26.4 KB
/
04-trendy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
# Statistically Significant Trends with Multiple Years of Complex Survey Data {-}
[](https://travis-ci.org/asdfree/trendy) [](https://ci.appveyor.com/project/ajdamico/trendy)
*Contributed by Thomas Yokota <<[email protected]>>*
Palermo Professor [Vito Muggeo](mailto:[email protected]) wrote the joinpoint analysis section of the code below to demonstrate that [the `segmented` package](https://cran.r-project.org/web/packages/segmented/index.html) eliminates the need for [external (registration-only, windows-only, workflow-disrupting) software](http://surveillance.cancer.gov/joinpoint/download.html). [`survey` package](http://r-survey.r-forge.r-project.org/survey/) creator and professor [Dr. Thomas Lumley](mailto:[email protected]) wrote [the `svypredmeans` function](https://cran.r-project.org/web/packages/survey/NEWS) to replicate SUDAAN's `PREDMARG` command and match the CDC to the decimal. [Dr. Richard Lowry, M.D.](mailto:[email protected]) at the Centers for Disease Control & Prevention wrote [the original linear trend analysis](http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_conducting_trend_analyses.pdf) and then answered our infinite questions. Thanks to everyone.
The purpose of this analysis is to make statistically valid statements such as, *"there was a significant linear decrease in the prevalence of high school aged americans who have ever smoked a cigarette across the period 1999-2011"* with complex sample survey data.
This step-by-step walkthrough exactly reproduces the statistics presented in the [Center for Disease Control & Prevention's (CDC) linear trend analysis](http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_conducting_trend_analyses.pdf), using free and open source methods rather than proprietary or restricted software.
The example below displays only linearized designs (created with [the `svydesign` function](http://r-survey.r-forge.r-project.org/survey/html/svydesign.html)). For more detail about how to reproduce this analysis with a replicate-weighted design (created with [the `svrepdesign` function](http://r-survey.r-forge.r-project.org/survey/html/svrepdesign.html)), see the methods note below section #4.
## Data Importation {-}
Prior to running this analysis script, the Youth Risk Behavioral Surveillance System (YRBSS) 1991-2011 single-year files must all be loaded as R data files (.rda) on your local machine. Running the download automation script will create the appropriate files. If you need assistance with the data-loading step, first review [the main YRBSS blog post](http://www.asdfree.com/search/label/youth%20risk%20behavior%20surveillance%20system%20%28yrbss%29).
```{r eval = FALSE }
options( survey.lonely.psu = "adjust" )
library(survey)
library(lodown)
# retrieve a listing of all available extracts for the youth risk behavioral surveillance system
yrbss_cat <- get_catalog( "yrbss" , output_dir = file.path( path.expand( "~" ) , "YRBSS" ) )
# limit the catalog to only years 2005-2015
yrbss_cat <- subset( yrbss_cat , year %in% seq( 2005 , 2015 , 2 ) )
# download the yrbss microdata
lodown( "yrbss" , yrbss_cat )
```
## Load Required Packages, Options, External Functions {-}
```{r eval = FALSE }
# install.packages( c( "segmented" , "ggplot2" , "ggthemes" , "texreg" ) )
# Muggeo V. (2008) Segmented: an R package to fit regression models with broken-line relationships. R News, 8, 1: 20-25.
library(segmented)
library(ggplot2)
library(ggthemes)
library(texreg)
```
## Harmonize and Stack Multiple Years of Survey Data {-}
This step is clearly dataset-specific. In order for your trend analysis to work, you'll need to figure out how to align the variables from multiple years of data into a trendable, stacked `data.frame` object.
```{r eval = FALSE }
# initiate an empty `y` object
y <- NULL
# loop through each year of YRBSS microdata
for ( year in seq( 2005 , 2015 , 2 ) ){
# load the current year
x <- readRDS(
file.path( path.expand( "~" ) , "YRBSS" ,
paste0( year , " main.rds" ) ) )
# tack on a `year` column
x$year <- year
if( year == 2005 ) x$raceeth <- NA
# stack that year of data alongside the others,
# ignoring mis-matching columns
y <- rbind( x[ c( "q2" , "q3" , "q4" , "qn10" , "year" , "psu" , "stratum" , "weight" , "raceeth" ) ] , y )
# clear the single-year of microdata from RAM
rm( x )
}
# convert every column to numeric type
y[ , ] <- sapply( y[ , ] , as.numeric )
# construct year-specific recodes so that
# "ever smoked a cigarette" // grade // sex // race-ethnicity align across years
y <-
transform(
y ,
rode_with_drunk_driver = qn10 ,
raceeth =
ifelse( year == 2005 ,
ifelse( q4 %in% 6 , 1 ,
ifelse( q4 %in% 3 , 2 ,
ifelse( q4 %in% c( 4 , 7 ) , 3 ,
ifelse( q4 %in% c( 1 , 2 , 5 , 8 ) , 4 , NA ) ) ) ) ,
ifelse( raceeth %in% 5 , 1 ,
ifelse( raceeth %in% 3 , 2 ,
ifelse( raceeth %in% c( 6 , 7 ) , 3 ,
ifelse( raceeth %in% c( 1 , 2 , 4 , 8 ) , 4 , NA ) ) ) ) ) ,
grade = ifelse( q3 == 5 , NA , as.numeric( q3 ) ) ,
sex = ifelse( q2 %in% 1:2 , q2 , NA )
)
# again remove unnecessary variables, keeping only the complex sample survey design columns
# plus independent/dependent variables to be used in the regression analyses
y <- y[ c( "year" , "psu" , "stratum" , "weight" , "rode_with_drunk_driver" , "raceeth" , "sex" , "grade" ) ]
# set female to the reference group
y$sex <- relevel( factor( y$sex ) , ref = "2" )
# set ever smoked=yes // white // 9th graders as the reference groups
for ( i in c( 'rode_with_drunk_driver' , 'raceeth' , 'grade' ) ) y[ , i ] <- relevel( factor( y[ , i ] ) , ref = "1" )
```
## Construct a Multi-Year Stacked Complex Survey Design Object {-}
Before constructing a multi-year stacked design object, check out `?contr.poly` - this function implements polynomials used in our trend analysis during step #6. For more detail on this subject, see [page 216 of Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences](https://www.google.com/search?q=The+polynomials+we+have+used+as+predictors+to+this+point+are+natural+polynomials%2C+generated+from+the+linear+predictor+by+centering+and+the+powering+the+linear+predictor.&ie=utf-8&oe=utf-8) By Jacob Cohen, Patricia Cohen, Stephen G. West, Leona S. Aiken *"The polynomials we have used as predictors to this point are natural polynomials, generated from the linear predictor by centering and the powering the linear predictor."*
```{r eval = FALSE }
# extract a linear contrast vector of length eleven,
# because we have eleven distinct years of yrbss data `seq( 2005 , 2015 , 2 )`
c6l <- contr.poly( 6 )[ , 1 ]
# also extract a quadratic (squared) contrast vector
c6q <- contr.poly( 6 )[ , 2 ]
# just in case, extract a cubic contrast vector
c6c <- contr.poly( 6 )[ , 3 ]
# for each record in the data set, tack on the linear, quadratic, and cubic contrast value
# these contrast values will serve as replacement for the linear `year` variable in any regression.
# year^1 term (linear)
y$t6l <- c6l[ match( y$year , seq( 2005 , 2015 , 2 ) ) ]
# year^2 term (quadratic)
y$t6q <- c6q[ match( y$year , seq( 2005 , 2015 , 2 ) ) ]
# year^3 term (cubic)
y$t6c <- c6c[ match( y$year , seq( 2005 , 2015 , 2 ) ) ]
# construct a complex sample survey design object
# stacking multiple years and accounting for `year` in the nested strata
des <-
svydesign(
id = ~psu ,
strata = ~interaction( stratum , year ) ,
data = y ,
weights = ~weight ,
nest = TRUE
)
```
Now we've got a multi-year stack of complex survey designs with linear, quadratic, and cubic contrast values appended. If you'd like more detail about stacking multiple years of complex survey data, review [the CDC's manual](http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_combining_data.pdf) on the topic. Hopefully we won't need anything beyond cubic, but let's find out.
**Methods note** about how to stack replication designs: This is only relevant if you are trying to create a `des` like the object above but just have replicate weights and do not have the clustering information (psu). It is straightforward to construct a replication design *from* a linearized design (see [`as.svrepdesign`](http://r-survey.r-forge.r-project.org/survey/html/as.svrepdesign.html)). However, [for privacy reasons, going in the opposite direction is much more challenging](http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html). Therefore, you'll need to do some dataset-specific homework on how to best *stack* multiple years of a replicate-weighted design you construct a multiple-year-stacked survey design like the object above.
If you'd like to experiment with how the two approaches differ (theoretically, very little), these publicly-available survey data sets include both replicate weights and, separately, clustering information:
[Medical Expenditure Panel Survey](http://www.asdfree.com/search/label/medical%20expenditure%20panel%20survey%20%28meps%29)
[National Health and Nutrition Examination Survey](http://www.asdfree.com/search/label/national%20health%20and%20nutrition%20examination%20survey%20%28nhanes%29)
[Consumer Expenditure Survey](http://www.asdfree.com/search/label/consumer%20expenditure%20survey%20%28ce%29)
In most cases, omitting the `year` variable from the `strata = ~interaction( stratum , year )` construction of `des` above will make your standard errors larger (conservative) -> ergo -> you can probably just `rbind( file_with_repweights_year_one , file_with_repweights_year_two , ... )` so long as the survey design has not changed in structure over the time period that you are analyzing. Once you have the rbound replicate weights object for every year, you could just construct one huge multi-year `svrepdesign` object. Make sure you include `scale`, `rscales`, `rho`, and whatever else the `svrepdesign()` call asks for. If you are worried you missed something, check `attributes( your_single_year_replication_design_object )`. This solution is likely to be a decent approach in most cases.
If you need to be very conservative with your computation of trend statistical significance, you might attempt to re-construct fake clusters for yourself using a regression. Search for "malicious" in [this confidentiality explanation document](https://github.com/ajdamico/asdfree/blob/master/Confidentiality/how%20to%20create%20de-identified%20replicate%20weights.R). The purpose here, though, isn't to identify individual respondents in the dataset, it's to get a variable like `psu` above that gives you reasonable standard errors. Look for the object `your.replicate.weights` in that script. You could reconstruct a fake psu for each record in your data set with something as easy as..
```{r eval = FALSE }
# # fake_psu should be a one-record-per-person vector object
# # that can immediately be appended onto your data set.
# fake_psu <- kmeans( your.replicate.weights , 20 )
```
..where 20 is the (completely made up) number of clusters x strata. Hopefully the methodology documents (or the people who wrote them) will at least tell you how many clusters there were in the original sample, even if the clusters themselves were not disclosed. At the point you've made fake clusters, they will surely be worse than the real clusters (i.e. conservative standard errors) and you can construct a multiple-year survey design with:
```{r eval = FALSE }
# des <- svydesign( id = ~ your_fake_psus , strata = ~ year , data = y , weights = ~ weight , nest = TRUE )
```
This approach will probably be conservative probably.
## Review the unadjusted results {-}
Here's the change over time for smoking prevalence among youth. Unadjusted prevalence rates (Figure 1) suggest a significant change in smoking prevalence.
```{r eval = FALSE }
# immediately remove records with missing smoking status
des_ns <- subset( des , !is.na( rode_with_drunk_driver ) )
# calculate unadjusted, un-anythinged "ever smoked" rates by year
# note that this reproduces the unadjusted "ever smoked" statistics at the top of
# pdf page 6 of http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_conducting_trend_analyses.pdf
unadjusted <- svyby( ~ rode_with_drunk_driver , ~ year , svymean , design = des_ns , vartype = c( 'ci' , 'se' ) )
# coerce that result into a `data.frame` object
my_plot <- data.frame( unadjusted )
# plot the unadjusted decline in smoking
ggplot( my_plot , aes( x = year, y = rode_with_drunk_driver1 ) ) +
geom_point() +
geom_errorbar( aes( ymax = ci_u.rode_with_drunk_driver1 , ymin = ci_l.rode_with_drunk_driver1 ) , width = .2 ) +
geom_line() +
theme_tufte() +
ggtitle( "Figure 1. Unadjusted smoking prevalence 1999-2011" ) +
theme( plot.title = element_text( size = 9 , face = "bold" ) )
```
## Calculate the Number of Joinpoints Needed {-}
Using the orthogonal coefficients (linear, quadratic, cubic terms) that we previously added to our `data.frame` object before constructing the multi-year stacked survey design, let's now determine how many joinpoints will be needed for a trend analysis.
Epidemiological models typically control for possible confounding variables such as sex and race, so let's add them in with the linear, cubic, and quadratic year terms.
Calculate the "ever smoked" binomial regression, adjusted by sex, age, race-ethnicity, and a linear year contrast.
```{r eval = FALSE }
linyear <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t6l ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 ) ,
family = quasibinomial
)
summary( linyear )
```
The linear year-contrast variable `t11l` is hugely significant here. Therefore, there is probably going to be some sort of trend. A linear trend does not need joinpoints. Not one, just zero joinpoints. If the linear term were the only significant term (out of linear, quadratic, cubic, etc.), then we would not need to calculate a joinpoint. In other words, we would not need to figure out where to best break our time trend into two, three, or even four segments.
The linear trend is significant, so we should keep going.
-----
**Interpretation note** about segments of time: The linear term `t11l` was significant, so we probably have a significant linear trend somewhere to report. Now we need to figure out when that significant linear trend started and when it ended. It might be semantically true that there was a significant linear decrease in high school aged smoking over the entire period of our data 1991-2011; however, it's inexact, unrefined to give up after only detecting a linear trend. The purpose of the following few steps is really to *cordon off* different time points from one another. As you'll see later, there actually was not any detectable decrease from 1991-1999. The entirety of the decline in smoking occurred over the period from 1999-2011. So these next (methodologically tricky) steps serve to provide you and your audience with a more careful statement of statistical significance. It's not technically wrong to conclude that smoking declined over the period of 1991-2011, it's just verbose.
Think of it as the difference between "humans first walked on the moon in the sixties" and "humans first walked on the moon in 1969" - both statements are correct, but the latter exhibits greater scientific precision.
-----
Calculate the "ever smoked" binomial regression, adjusted by sex, age, race-ethnicity, and both linear and quadratic year contrasts. Notice the addition of `t11q`.
```{r eval = FALSE }
quadyear <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t6l + t6q ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 ) ,
family = quasibinomial
)
summary( quadyear )
```
The linear year-contrast variable is hugely significant here but the quadratic year-contrast variable is also significant. Therefore, we should use joinpoint software for this analysis. A significant quadratic trend needs one joinpoint.
Since both linear and quadratic terms are significant, we should move ahead and test whether the cubic term is also significant.
Calculate the "ever smoked" binomial regression, adjusted by sex, age, race-ethnicity, and linear, quadratic, and cubic year contrasts. Notice the addition of `t11c`.
```{r eval = FALSE }
cubyear <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t6l + t6q + t6c ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 ) ,
family = quasibinomial
)
summary( cubyear )
```
The cubic year-contrast term is *not* significant in this model. Therefore, we should stop testing the shape of this line. In other words, we can stop at a quadratic trend and *do not* need a cubic trend. That means we can stop at a single joinpoint. Remember: a linear trend requires zero joinpoints, a quadratic trend typically requires one joinpoint, a cubic trend usually requires two, and on and on.
Note: if the cubic trend *were* significant, then we would increase the number of joinpoints to *two* instead of *one* but since the cubic term is not significant, we should stop with the previous regression. If we keep getting significant trends, we ought to continue testing whether higher terms continue to be significant. So `year^4` requires three joinpoints, `year^5` requires four joinpoints, and so on. If these terms continued to be significant, we would need to return to step #4 and add additional `year^n` terms to the model.
Just for coherence's sake, let's assemble all of these results into a single table where you can see the linear, quadratic, and cubic models side-by-side. The quadratic trend best describes the relationship between prevalence of smoking and change-over-time. The decision to test beyond linear trends, however, is a decision for the individual researcher to make. It is a decision that can be driven by theoretical issues, existing literature, or the availability of data.
<center>
```{r eval = FALSE }
htmlreg(list(linyear , quadyear , cubyear), doctype = F, html.tag = F, inline.css = T,
head.tag = F, body.tag = F, center = F, single.row = T, caption = "Table 1. Testing for linear trends")
```
</center>
## Calculate the Adjusted Prevalence and Predicted Marginals {-}
First, calculate the survey-year-independent predictor effects and store these results into a separate object.
```{r eval = FALSE }
marginals <-
svyglm(
formula = I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade ,
design = des_ns ,
family = quasibinomial
)
```
Second, run these marginals through the `svypredmeans` function written by Dr. Thomas Lumley. For any archaeology fans out there, this function emulates the `PREDMARG` statement in the ancient language of SUDAAN.
```{r eval = FALSE }
( means_for_joinpoint <- svypredmeans( marginals , ~factor( year ) ) )
```
Finally, clean up these results a bit in preparation for a joinpoint analysis.
```{r eval = FALSE }
# coerce the results to a data.frame object
means_for_joinpoint <- as.data.frame( means_for_joinpoint )
# extract the row names as the survey year
means_for_joinpoint$year <- as.numeric( rownames( means_for_joinpoint ) )
# must be sorted, just in case it's not already
means_for_joinpoint <- means_for_joinpoint[ order( means_for_joinpoint$year ) , ]
# rename columns so they do not conflict with variables in memory
names( means_for_joinpoint ) <- c( 'mean' , 'se' , 'yr' )
# the above line is only because the ?segmented function (used below)
# does not work if an object of the same name is also in memory.
another_plot <- means_for_joinpoint
another_plot$ci_l.mean <- another_plot$mean - (1.96 * another_plot$se)
another_plot$ci_u.mean <- another_plot$mean + (1.96 * another_plot$se)
ggplot(another_plot, aes(x = yr, y = mean)) +
geom_point() +
geom_errorbar(aes(ymax = ci_u.mean, ymin = ci_l.mean), width=.2) +
geom_line() +
theme_tufte() +
ggtitle("Figure 2. Adjusted smoking prevalence 1999-2011") +
theme(plot.title = element_text(size=9, face="bold"))
```
## Identify the Breakpoint/Changepoint {-}
The original CDC analysis recommended some [external software from the National Cancer Institute](http://surveillance.cancer.gov/joinpoint/), which only runs on selected platforms. Dr. Vito Muggeo wrote this within-R solution using his [segmented](https://cran.r-project.org/web/packages/segmented/index.html) package available on CRAN. Let's take a look at how confident we are in the value at each adjusted timepoint. Carrying out a trend analysis requires creating new weights to fit a piecewise linear regression. Figure 3 shows the relationship between variance at each datum and weighting. Larger circles display greater uncertainty and therefore lower weight.
```{r eval = FALSE }
ggplot( means_for_joinpoint , aes( x = yr , y = mean ) ) +
geom_point( aes( size = se ) ) +
theme_tufte() +
ggtitle( "Figure 3. Standard Error at each timepoint\n(smaller dots indicate greater confidence in each adjusted value)"
)
```
First, create that weight variable.
```{r eval = FALSE }
means_for_joinpoint$wgt <- with( means_for_joinpoint, ( mean / se ) ^ 2 )
```
Second, fit a piecewise linear regression.
```{r eval = FALSE }
# estimate the 'starting' linear model with the usual "lm" function using the log values and the weights.
o <- lm( log( mean ) ~ yr , weights = wgt , data = means_for_joinpoint )
```
Now that the regression has been structured correctly, estimate the year that our complex survey trend should be broken into two segments (the changepoint/breakpoint/joinpoint).
```{r eval = FALSE }
# the segmented() function uses a random process in its algorithm.
# setting the random seed ensures reproducibility
set.seed( 2015 )
# add a segmented variable (`yr` in this example) with 1 breakpoint
os <- segmented( o , ~yr )
# `os` is now a `segmented` object, which means it includes information on the fitted model,
# such as parameter estimates, standard errors, residuals.
summary( os )
```
See the `Estimated Break-Point(s)` in that result? That's the critical number from this joinpoint analysis.
Note that the above number is not an integer! The [R `segmented` package](https://cran.r-project.org/web/packages/segmented/index.html) uses an iterative procedure (described in the article below) and therefore between-year solutions are returned. The joinpoint software implements two estimating algorithms: the grid-search and the Hudson algorithm. For more detail about these methods, see [Muggeo V. (2003) Estimating regression models with unknown break-points. Statistics in Medicine, 22: 3055-3071.](http://onlinelibrary.wiley.com/doi/10.1002/sim.1545/abstract).
```{r eval = FALSE }
# figuring out the breakpoint year was the purpose of this joinpoint analysis.
( your_breakpoint <- round( as.vector( os$psi[, "Est." ] ) ) )
# so. that's a joinpoint. that's where the two line segments join. okay?
# obtain the annual percent change (APC=) estimates for each time point
slope( os , APC = TRUE )
```
The returned CIs for the annual percent change (APC) may be different from the ones returned by [NCI's Joinpoint Software](surveillance.cancer.gov/joinpoint/); for further details, check out [Muggeo V. (2010) A Comment on `Estimating average annual per cent change in trend analysis' by Clegg et al., Statistics in Medicine; 28, 3670-3682. Statistics in Medicine, 29, 1958-1960.](http://onlinelibrary.wiley.com/doi/10.1002/sim.3850/abstract)
This analysis returned similar results to the NCI's Joinpoint Regression Program by estimating a changepoint at `year=1999` - and, more precisely, that the start of that decreasing trend in smoking prevalence happened at an APC of -3.92 percent. That is, `slope2` from the output above.
## Make statistically defensible statements about linear trends with complex survey data {-}
After identifying the change point for smoking prevalence, we can create two regression models (one for each time segment). (If we had two joinpoints, we would need three regression models.) The first model covers the years leading up to (and including) the changepoint (i.e., 1991 to 1999). The second model includes the years from the changepoint forward (i.e., 1999 to 2011). So start with 1991, 1993, 1995, 1997, 1999, the five year-points before (and including 1999).
```{r eval = FALSE }
# calculate a five-timepoint linear contrast vector
c3l <- contr.poly( 3 )[ , 1 ]
# tack the five-timepoint linear contrast vectors onto the current survey design object
des_ns <- update( des_ns , t3l = c3l[ match( year , seq( 2005 , 2009 , 2 ) ) ] )
pre_91_99 <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t3l ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 & year <= 2009 ) ,
family = quasibinomial
)
summary( pre_91_99 )
```
Reproduce the sentence on [pdf page 6 of the original document](http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_conducting_trend_analyses.pdf#page=6). **In this example, T5L_L had a p-value=0.52261 and beta=0.03704. Therefore, there was "no significant change in the prevalence of ever smoking a cigarette during 1991-1999."**
Then move on to 1999, 2001, 2003, 2005, 2007, 2009, and 2011, the seven year-points after (and including 1999).
```{r eval = FALSE }
# calculate a seven-timepoint linear contrast vector
c4l <- contr.poly( 4 )[ , 1 ]
# tack the seven-timepoint linear contrast vectors onto the current survey design object
des_ns <- update( des_ns , t4l = c4l[ match( year , seq( 2009 , 2015 , 2 ) ) ] )
post_99_11 <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t4l ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 & year >= 2009 ) ,
family = quasibinomial
)
summary( post_99_11 )
```
Reproduce the sentence on [pdf page 6 of the original document](http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_conducting_trend_analyses.pdf#page=6). **In this example, T7L_R had a p-value<0.0001 and beta=-0.99165. Therefore, there was a "significant linear decrease in the prevalence of ever smoking a cigarette during 1999-2011."**
Note also that the 1999-2011 time period saw a linear decrease, which supports the APC estimate in step #8. Here's everything displayed as a single coherent table.
```{r eval = FALSE }
htmlreg(list(pre_91_99, post_99_11), doctype = F, html.tag = F, inline.css = T,
head.tag = F, body.tag = F, center = F, single.row = T, caption = "Table 2. Linear trends pre-post changepoint")
```
*This analysis may complement qualitative evaluation on prevalence changes observed from surveillance data by providing quantitative evidence, such as when a change point occurred. This analysis does not explain why or how changes in trends occur.*