chis.Rmd

# California Health Interview Survey (CHIS) {-}

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) <img src='https://img.shields.io/badge/Tested%20Locally-Windows%20Laptop-brightgreen' alt='Local Testing Badge'>

California's National Health Interview Survey (CHIS), a healthcare survey for the nation's largest state.

* One adult, one teenage (12-17), and one child table, each with one row per sampled respondent.

* A complex survey designed to generalize to the civilian non-institutionalized population of California.

* Released annually since 2011, and biennially since 2001.

* Administered by the [UCLA Center for Health Policy Research](http://healthpolicy.ucla.edu/).

---

## Recommended Reading {-}

Four Strengths & Weaknesses:

✔️ [Neighborhood-level estimates](https://healthpolicy.ucla.edu/our-work/askchis-ne)

✔️ [Oversamples allow targeted research questions](https://healthpolicy.ucla.edu/sites/default/files/2023-09/whatsnewchis2021-2022_final_09182023.pdf)

❌ [Low response rates compared to nationwide surveys](https://www.cdc.gov/brfss/annual_data/2023/pdf/2023-dqr-508.pdf#page=4)

❌ [Two-year data periods reduces precision of trend analyses](https://healthpolicy.ucla.edu/sites/default/files/2023-09/chis-2021-2022-sample-design_final_09072023.pdf)

<br>

Three Example Findings:

1. [In 2021, adults with limited English proficiency were less likely to use video or telephone telehealth](http://doi.org/10.1001/jamanetworkopen.2024.10691).

2. [The share of non-citizen kids reporting excellent health increased from 2013-2015 to 2017-2019](https://calbudgetcenter.org/resources/california-sees-health-gains-for-undocumented-residents-after-medi-cal-expansion/).

3. [Adults working from home had worse health behaviors and mental health than other workers in 2021](https://doi.org/10.1002/ajim.23556).

<br>

Two Methodology Documents:

> [CHIS 2021-2022 Methodology Report Series, Report 1: Sample Design DESIGN](https://healthpolicy.ucla.edu/sites/default/files/2023-09/chis_2021-2022_methodologyreport1_sampledesign_final_09112023.pdf)

> [CHIS 2021-2022 Methodology Report Series, Report 5: Weighting and Variance Estimation](https://healthpolicy.ucla.edu/sites/default/files/2023-09/chis_2021-2022_methodologyreport5_weighting_final_09192023.pdf)

<br>

One Haiku:

```{r}
# strike gold, movie star
# play, third wish cali genie
# statewide health survey
```

---

## Function Definitions {-}

Define a function to unzip and import each Stata file:

```{r eval = FALSE , results = "hide" }
library(haven)

chis_import <-
	function( this_fn ){
		
		these_files <- unzip( this_fn , exdir = tempdir() )

		stata_fn <- grep( "ADULT\\.|CHILD\\.|TEEN\\." , these_files , value = TRUE )

		this_tbl <- read_stata( stata_fn )

		this_df <- data.frame( this_tbl )

		names( this_df ) <- tolower( names( this_df ) )

		# remove labelled classes
		labelled_cols <- 
			sapply( this_df , function( w ) class( w )[1] == 'haven_labelled' )

		this_df[ labelled_cols ] <-
			sapply( this_df[ labelled_cols ] , as.numeric )

		this_df
	}
```	

---

## Download, Import, Preparation {-}

1. Register at the UCLA Center for Health Policy Research at https://healthpolicy.ucla.edu/user/register.

2. Choose Year: `2022`, Age Group: `Adult` and `Teen` and `Child`, File Type: `Stata`.

3. Download the 2022 Adult, Teen, and Child Stata files (version `Oct 2023`).

Import the adult, teen, and child stata tables into `data.frame` objects:
```{r eval = FALSE , results = "hide" }
chis_adult_df <- 
	chis_import( file.path( path.expand( "~" ) , "adult_stata_2022.zip" ) )

chis_teen_df <- 
	chis_import( file.path( path.expand( "~" ) , "teen_stata_2022.zip" ) )

chis_child_df <- 
	chis_import( file.path( path.expand( "~" ) , "child_stata_2022.zip" ) )
```

Harmonize the general health condition variable across the three `data.frame` objects:
```{r eval = FALSE , results = "hide" }
chis_adult_df[ , 'general_health' ] <-
	c( 1 , 2 , 3 , 4 , 4 )[ chis_adult_df[ , 'ab1' ] ]

chis_teen_df[ , 'general_health' ] <- chis_teen_df[ , 'tb1_p1' ]

chis_child_df[ , 'general_health' ] <-
	c( 1 , 2 , 3 , 4 , 4 )[ chis_child_df[ , 'ca6' ] ]
```

Add four age categories across the three `data.frame` objects:
```{r eval = FALSE , results = "hide" }
chis_adult_df[ , 'age_categories' ] <-
	ifelse( chis_adult_df[ , 'srage_p1' ] >= 65 , 4 , 3 )

chis_teen_df[ , 'age_categories' ] <- 2

chis_child_df[ , 'age_categories' ] <- 1
```

Harmonize the usual source of care variable across the three `data.frame` objects:
```{r eval = FALSE , results = "hide" }
chis_adult_df[ , 'no_usual_source_of_care' ] <-
	as.numeric( chis_adult_df[ , 'ah1v2' ] == 2 )

chis_teen_df[ , 'no_usual_source_of_care' ] <-
	as.numeric( chis_teen_df[ , 'tf1v2' ] == 2 )

chis_child_df[ , 'no_usual_source_of_care' ] <-
	as.numeric( chis_child_df[ , 'cd1v2' ] == 2 )
```

Add monthly fruit and vegetable counts to the adult `data.frame` object, blanking the other two:
```{r eval = FALSE , results = "hide" }
chis_adult_df[ , 'adult_fruits_past_month' ] <- chis_adult_df[ , 'ae2' ]

chis_adult_df[ , 'adult_veggies_past_month' ] <- chis_adult_df[ , 'ae7' ]

chis_teen_df[ , c( 'adult_fruits_past_month' , 'adult_veggies_past_month' ) ] <- NA

chis_child_df[ , c( 'adult_fruits_past_month' , 'adult_veggies_past_month' ) ] <- NA
```

Specify which variables to keep in each of the `data.frame` objects, then stack them:
```{r eval = FALSE , results = "hide" }
variables_to_keep <-
	c(
		grep( '^rakedw' , names( chis_adult_df ) , value = TRUE ) ,
		'general_health' , 'age_categories' , 'adult_fruits_past_month' , 
		'adult_veggies_past_month' , 'srsex' , 'povll2_p1v2' , 'no_usual_source_of_care'
	)

chis_df <- 
	rbind( 
		chis_child_df[ variables_to_keep ] , 
		chis_teen_df[ variables_to_keep ] , 
		chis_adult_df[ variables_to_keep ] 
	)
```

### Save Locally \ {-}

Save the object at any point:

```{r eval = FALSE , results = "hide" }
# chis_fn <- file.path( path.expand( "~" ) , "CHIS" , "this_file.rds" )
# saveRDS( chis_df , file = chis_fn , compress = FALSE )
```

Load the same object:

```{r eval = FALSE , results = "hide" }
# chis_df <- readRDS( chis_fn )
```

### Survey Design Definition {-}
Construct a complex sample survey design:

```{r eval = FALSE , results = "hide" }
library(survey)

chis_design <-
	svrepdesign(
		data = chis_df , 
		weights = ~ rakedw0 , 
		repweights = "rakedw[1-9]" ,
		type = "other" , 
		scale = 1 ,
		rscales = 1 , 
		mse = TRUE
	)
```

### Variable Recoding {-}

Add new columns to the data set:
```{r eval = FALSE , results = "hide" }
chis_design <- 
	update( 
		chis_design , 
		
		one = 1 ,
		
		gender = factor( srsex , levels = 1:2 , labels = c( 'male' , 'female' ) ) ,
		
		age_categories =
			factor( 
				age_categories , 
				levels = 1:4 , 
				labels = 
					c( 'children under 12' , 'teens age 12-17' , 'adults age 18-64' , 'seniors' )
			) ,
		
		general_health =
			factor(
				general_health ,
				levels = 1:4 ,
				labels = c( 'Excellent' , 'Very good' , 'Good' , 'Fair/Poor' )
			)
	)
```

---

## Analysis Examples with the `survey` library \ {-}

### Unweighted Counts {-}

Count the unweighted number of records in the survey sample, overall and by groups:
```{r eval = FALSE , results = "hide" }
sum( weights( chis_design , "sampling" ) != 0 )

svyby( ~ one , ~ general_health , chis_design , unwtd.count )
```

### Weighted Counts {-}
Count the weighted size of the generalizable population, overall and by groups:
```{r eval = FALSE , results = "hide" }
svytotal( ~ one , chis_design )

svyby( ~ one , ~ general_health , chis_design , svytotal )
```

### Descriptive Statistics {-}

Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svymean( ~ povll2_p1v2 , chis_design )

svyby( ~ povll2_p1v2 , ~ general_health , chis_design , svymean )
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svymean( ~ gender , chis_design )

svyby( ~ gender , ~ general_health , chis_design , svymean )
```

Calculate the sum of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svytotal( ~ povll2_p1v2 , chis_design )

svyby( ~ povll2_p1v2 , ~ general_health , chis_design , svytotal )
```

Calculate the weighted sum of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svytotal( ~ gender , chis_design )

svyby( ~ gender , ~ general_health , chis_design , svytotal )
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svyquantile( ~ povll2_p1v2 , chis_design , 0.5 )

svyby( 
	~ povll2_p1v2 , 
	~ general_health , 
	chis_design , 
	svyquantile , 
	0.5 ,
	ci = TRUE 
)
```

Estimate a ratio:
```{r eval = FALSE , results = "hide" }
svyratio( 
	numerator = ~ adult_fruits_past_month , 
	denominator = ~ adult_veggies_past_month , 
	chis_design ,
	na.rm = TRUE
)
```

### Subsetting {-}

Restrict the survey design to seniors:
```{r eval = FALSE , results = "hide" }
sub_chis_design <- subset( chis_design , age_categories == 'seniors' )
```
Calculate the mean (average) of this subset:
```{r eval = FALSE , results = "hide" }
svymean( ~ povll2_p1v2 , sub_chis_design )
```

### Measures of Uncertainty {-}

Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:
```{r eval = FALSE , results = "hide" }
this_result <- svymean( ~ povll2_p1v2 , chis_design )

coef( this_result )
SE( this_result )
confint( this_result )
cv( this_result )

grouped_result <-
	svyby( 
		~ povll2_p1v2 , 
		~ general_health , 
		chis_design , 
		svymean 
	)
	
coef( grouped_result )
SE( grouped_result )
confint( grouped_result )
cv( grouped_result )
```

Calculate the degrees of freedom of any survey design object:
```{r eval = FALSE , results = "hide" }
degf( chis_design )
```

Calculate the complex sample survey-adjusted variance of any statistic:
```{r eval = FALSE , results = "hide" }
svyvar( ~ povll2_p1v2 , chis_design )
```

Include the complex sample design effect in the result for a specific statistic:
```{r eval = FALSE , results = "hide" }
# SRS without replacement
svymean( ~ povll2_p1v2 , chis_design , deff = TRUE )

# SRS with replacement
svymean( ~ povll2_p1v2 , chis_design , deff = "replace" )
```

Compute confidence intervals for proportions using methods that may be more accurate near 0 and 1. See `?svyciprop` for alternatives:
```{r eval = FALSE , results = "hide" }
svyciprop( ~ no_usual_source_of_care , chis_design ,
	method = "likelihood" )
```

### Regression Models and Tests of Association {-}

Perform a design-based t-test:
```{r eval = FALSE , results = "hide" }
svyttest( povll2_p1v2 ~ no_usual_source_of_care , chis_design )
```

Perform a chi-squared test of association for survey data:
```{r eval = FALSE , results = "hide" }
svychisq( 
	~ no_usual_source_of_care + gender , 
	chis_design 
)
```

Perform a survey-weighted generalized linear model:
```{r eval = FALSE , results = "hide" }
glm_result <- 
	svyglm( 
		povll2_p1v2 ~ no_usual_source_of_care + gender , 
		chis_design 
	)

summary( glm_result )
```

---

## Replication Example {-}

This matches the proportions and counts from [AskCHIS](https://healthpolicy.ucla.edu/our-work/askchis). The standard errors do not match precisely, but the team at UCLA confirmed [this survey design definition](https://healthpolicy.ucla.edu/sites/default/files/2023-10/sample-code-to-analyze-chis-data.pdf) to be correct, and that the minor standard error and confidence interval differences should not impact any analyses from a statistical perspective:

```{r eval = FALSE , results = "hide" }

chis_adult_design <-
	svrepdesign(
		data = chis_adult_df , 
		weights = ~ rakedw0 , 
		repweights = "rakedw[1-9]" ,
		type = "other" , 
		scale = 1 ,
		rscales = 1 , 
		mse = TRUE
	)
	
chis_adult_design <-
	update(
		chis_adult_design ,
		ab1 = 
			factor( 
				ab1 , 
				levels = 1:5 , 
				labels = c( 'Excellent' , 'Very good' , 'Good' , 'Fair' , 'Poor' )
			)
	)
	
this_proportion <- svymean( ~ ab1 , chis_adult_design )

stopifnot( round( coef( this_proportion ) , 3 ) == c( 0.183 , 0.340 , 0.309 , 0.139 , 0.029 ) )

this_count <- svytotal( ~ ab1 , chis_adult_design )

stopifnot( 
	round( coef( this_count ) , -3 ) == c( 5414000 , 10047000 , 9138000 , 4106000 , 855000 )
)
```

---

## Analysis Examples with `srvyr` \ {-}

The R `srvyr` library calculates summary statistics from survey data, such as the mean, total or quantile using [dplyr](https://github.com/tidyverse/dplyr/)-like syntax. [srvyr](https://github.com/gergness/srvyr) allows for the use of many verbs, such as `summarize`, `group_by`, and `mutate`, the convenience of pipe-able functions, the `tidyverse` style of non-standard evaluation and more consistent return types than the `survey` package. [This vignette](https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html) details the available features. As a starting point for CHIS users, this code replicates previously-presented examples:

```{r eval = FALSE , results = "hide" }
library(srvyr)
chis_srvyr_design <- as_survey( chis_design )
```
Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
chis_srvyr_design %>%
	summarize( mean = survey_mean( povll2_p1v2 ) )

chis_srvyr_design %>%
	group_by( general_health ) %>%
	summarize( mean = survey_mean( povll2_p1v2 ) )
```