ces.Rmd

# Consumer Expenditure Survey (CES) {-}

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) <a href="https://github.com/asdfree/ces/actions"><img src="https://github.com/asdfree/ces/actions/workflows/r.yml/badge.svg" alt="Github Actions Badge"></a> 

A household budget survey designed to guide major economic indicators like the Consumer Price Index.

* One table of survey responses per quarter with one row per sampled household (consumer unit). Additional tables containing one record per expenditure.

* A complex sample survey designed to generalize to the civilian non-institutional U.S. population.

* Released annually since 1996.

* Administered by the [Bureau of Labor Statistics](http://www.bls.gov/).

---

## Recommended Reading {-}

Four Example Strengths & Limitations:

✔️ [Detailed expenditure categories](https://www.bls.gov/cex/cecomparison.htm#cedc)

✔️ [Respondents diary spending for two consecutive 1-week periods](https://www.bls.gov/respondents/cex/)

❌ [Measures purchases but not consumption](https://www.bls.gov/opub/hom/cex/concepts.htm)

❌ [Consumer unit definition differs from households or families in other surveys](https://www.bls.gov/opub/mlr/2021/article/consumer-expenditure-survey-methods-symposium-and-microdata-users-workshop-2020.htm)

<br>

Three Example Findings:

1. [In 2022, one third of total nationwide expenditures were attributed to housing-related expenses](https://www.bls.gov/opub/reports/consumer-expenditures/2022/home.htm).

2. [Between 2015 and early 2022, male household heads consumed a greater proportion of resources (33%) compared to female household heads (28%), who, in turn, consume more than children (23%)](https://doi.org/10.1007/s11150-024-09739-0).

3. [In 2020, if income increased by $100, spending on all food and alcohol increased by $14 on average](http://dx.doi.org/10.22004/ag.econ.344014).

<br>

Two Methodology Documents:

> [Consumer Expenditure Surveys Public Use Microdata Getting Started Guide](https://www.bls.gov/cex/pumd-getting-started-guide.htm)
 
> [Wikipedia Entry](https://en.wikipedia.org/wiki/Consumer_Expenditure_Survey)

<br>

One Haiku:

```{r}
# price indices and
# you spent how much on beans, jack?
# pocketbook issues
```

---

## Download, Import, Preparation {-}

Download both the prior and current year of interview microdata:
```{r eval = FALSE , results = "hide" }
library(httr)

tf_prior_year <- tempfile()

this_url_prior_year <- "https://www.bls.gov/cex/pumd/data/stata/intrvw22.zip"

dl_prior_year <- GET( this_url_prior_year , user_agent( "email@address.com" ) )

writeBin( content( dl_prior_year ) , tf_prior_year )

unzipped_files_prior_year <- unzip( tf_prior_year , exdir = tempdir() )

tf_current_year <- tempfile()

this_url_current_year <- "https://www.bls.gov/cex/pumd/data/stata/intrvw23.zip"

dl_current_year <- GET( this_url_current_year , user_agent( "email@address.com" ) )

writeBin( content( dl_current_year ) , tf_current_year )

unzipped_files_current_year <- unzip( tf_current_year , exdir = tempdir() )

unzipped_files <- c( unzipped_files_current_year , unzipped_files_prior_year )
```

Import and stack all 2023 quarterly files plus 2024's first quarter:
```{r eval = FALSE , results = "hide" }
library(haven)

fmli_files <- grep( "fmli2[3-4]" , unzipped_files , value = TRUE )

fmli_tbls <- lapply( fmli_files , read_dta )

fmli_dfs <- lapply( fmli_tbls , data.frame )

fmli_dfs <- 
	lapply( 
		fmli_dfs , 
		function( w ){ names( w ) <- tolower( names( w ) ) ; w }
	)

fmli_cols <- lapply( fmli_dfs , names )

intersecting_cols <- Reduce( intersect , fmli_cols )

fmli_dfs <- lapply( fmli_dfs , function( w ) w[ intersecting_cols ] )

ces_df <- do.call( rbind , fmli_dfs )
```

Scale the weight columns based on the number of months in 2023:
```{r eval = FALSE , results = "hide" }
ces_df[ , c( 'qintrvyr' , 'qintrvmo' ) ] <-
	sapply( ces_df[ , c( 'qintrvyr' , 'qintrvmo' ) ] , as.numeric )

weight_columns <- grep( 'wt' , names( ces_df ) , value = TRUE )

ces_df <-
	transform(
		ces_df ,
		mo_scope =
			ifelse( qintrvyr %in% 2023 & qintrvmo %in% 1:3 , qintrvmo - 1 ,
			ifelse( qintrvyr %in% 2024 , 4 - qintrvmo , 3 ) )
	)

for ( this_column in weight_columns ){
	ces_df[ is.na( ces_df[ , this_column ] ) , this_column ] <- 0
	
	ces_df[ , paste0( 'popwt_' , this_column ) ] <-
		( ces_df[ , this_column ] * ces_df[ , 'mo_scope' ] / 12 )	
	
}
```

Combine previous quarter and current quarter variables into a single variable:
```{r eval = FALSE , results = "hide" }

expenditure_variables <- 
	gsub( "pq$" , "" , grep( "pq$" , names( ces_df ) , value = TRUE ) )

# confirm that for every variable ending in pq,
# there's the same variable ending in cq
stopifnot( all( paste0( expenditure_variables , 'cq' ) %in% names( ces_df ) ) )

# confirm none of the variables without the pq or cq suffix exist
if( any( expenditure_variables %in% names( ces_df ) ) ) stop( "variable conflict" )

for( this_column in expenditure_variables ){

	ces_df[ , this_column ] <-
		rowSums( ces_df[ , paste0( this_column , c( 'pq' , 'cq' ) ) ] , na.rm = TRUE )
	
	# annualize the quarterly spending
	ces_df[ , this_column ] <- 4 * ces_df[ , this_column ]
	
	ces_df[ is.na( ces_df[ , this_column ] ) , this_column ] <- 0

}
```

Append any interview survey UCC found at https://www.bls.gov/cex/ce_source_integrate.xlsx:

```{r eval = FALSE , results = "hide" }
ucc_exp <- c( "450110" , "450210" )

mtbi_files <- grep( "mtbi2[3-4]" , unzipped_files , value = TRUE )

mtbi_tbls <- lapply( mtbi_files , read_dta )

mtbi_dfs <- lapply( mtbi_tbls , data.frame )

mtbi_dfs <- 
	lapply( 
		mtbi_dfs , 
		function( w ){ names( w ) <- tolower( names( w ) ) ; w }
	)

mtbi_dfs <- lapply( mtbi_dfs , function( w ) w[ c( 'newid' , 'cost' , 'ucc' , 'ref_yr' ) ] )

mtbi_df <- do.call( rbind , mtbi_dfs )

mtbi_df <- subset( mtbi_df , ( ref_yr %in% 2023 ) & ( ucc %in% ucc_exp ) )

mtbi_agg <- aggregate( cost ~ newid , data = mtbi_df , sum )

names( mtbi_agg ) <- c( 'newid' , 'new_car_truck_exp' )

before_nrow <- nrow( ces_df )

ces_df <-
	merge(
		ces_df ,
		mtbi_agg ,
		all.x = TRUE
	)

stopifnot( nrow( ces_df ) == before_nrow )

ces_df[ is.na( ces_df[ , 'new_car_truck_exp' ] ) , 'new_car_truck_exp' ] <- 0
```

### Save Locally \ {-}

Save the object at any point:

```{r eval = FALSE , results = "hide" }
# ces_fn <- file.path( path.expand( "~" ) , "CES" , "this_file.rds" )
# saveRDS( ces_df , file = ces_fn , compress = FALSE )
```

Load the same object:

```{r eval = FALSE , results = "hide" }
# ces_df <- readRDS( ces_fn )
```

### Survey Design Definition {-}
Construct a multiply-imputed, complex sample survey design:

Separate the `ces_df` data.frame into five implicates, each differing from the others only in the multiply-imputed variables:
```{r eval = FALSE , results = "hide" }
library(survey)
library(mitools)

# create a vector containing all of the multiply-imputed variables
# (leaving the numbers off the end)
mi_vars <- gsub( "5$" , "" , grep( "[a-z]5$" , names( ces_df ) , value = TRUE ) )

# loop through each of the five variables..
for ( i in 1:5 ){

	# copy the 'ces_df' table over to a new temporary data frame 'x'
	x <- ces_df

	# loop through each of the multiply-imputed variables..
	for ( j in mi_vars ){
	
		# copy the contents of the current column (for example 'welfare1')
		# over to a new column ending in 'mi' (for example 'welfaremi')
		x[ , paste0( j , 'mi' ) ] <- x[ , paste0( j , i ) ]
		
		# delete the all five of the imputed variable columns
		x <- x[ , !( names( x ) %in% paste0( j , 1:5 ) ) ]

	}
	
	assign( paste0( 'imp' , i ) , x )

}

ces_design <- 
	svrepdesign( 
		weights = ~ finlwt21 , 
		repweights = "^wtrep[0-9][0-9]$" , 
		data = imputationList( list( imp1 , imp2 , imp3 , imp4 , imp5 ) ) , 
		type = "BRR" ,
		combined.weights = TRUE ,
		mse = TRUE
	)
```

### Variable Recoding {-}

Add new columns to the data set:
```{r eval = FALSE , results = "hide" }
ces_design <- 
	update( 
		ces_design , 
		
		one = 1 ,
		
		any_food_stamp = as.numeric( jfs_amtmi > 0 ) ,
		
		bls_urbn = factor( bls_urbn , levels = 1:2 , labels = c( 'urban' , 'rural' ) ) ,
		
		sex_ref = factor( sex_ref , levels = 1:2 , labels = c( 'male' , 'female' ) )
		
	)
```

---

## Analysis Examples with the `survey` library \ {-}

### Unweighted Counts {-}

Count the unweighted number of records in the survey sample, overall and by groups:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design , svyby( ~ one , ~ one , unwtd.count ) ) )

MIcombine( with( ces_design , svyby( ~ one , ~ bls_urbn , unwtd.count ) ) )
```

### Weighted Counts {-}
Count the weighted size of the generalizable population, overall and by groups:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design , svytotal( ~ one ) ) )

MIcombine( with( ces_design ,
	svyby( ~ one , ~ bls_urbn , svytotal )
) )
```

### Descriptive Statistics {-}

Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design , svymean( ~ totexp ) ) )

MIcombine( with( ces_design ,
	svyby( ~ totexp , ~ bls_urbn , svymean )
) )
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design , svymean( ~ sex_ref ) ) )

MIcombine( with( ces_design ,
	svyby( ~ sex_ref , ~ bls_urbn , svymean )
) )
```

Calculate the sum of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design , svytotal( ~ totexp ) ) )

MIcombine( with( ces_design ,
	svyby( ~ totexp , ~ bls_urbn , svytotal )
) )
```

Calculate the weighted sum of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design , svytotal( ~ sex_ref ) ) )

MIcombine( with( ces_design ,
	svyby( ~ sex_ref , ~ bls_urbn , svytotal )
) )
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design ,
	svyquantile(
		~ totexp ,
		0.5 , se = TRUE 
) ) )

MIcombine( with( ces_design ,
	svyby(
		~ totexp , ~ bls_urbn , svyquantile ,
		0.5 , se = TRUE ,
		ci = TRUE 
) ) )
```

Estimate a ratio:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design ,
	svyratio( numerator = ~ totexp , denominator = ~ fincbtxmi )
) )
```

### Subsetting {-}

Restrict the survey design to california residents:
```{r eval = FALSE , results = "hide" }
sub_ces_design <- subset( ces_design , state == '06' )
```
Calculate the mean (average) of this subset:
```{r eval = FALSE , results = "hide" }
MIcombine( with( sub_ces_design , svymean( ~ totexp ) ) )
```

### Measures of Uncertainty {-}

Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:
```{r eval = FALSE , results = "hide" }
this_result <-
	MIcombine( with( ces_design ,
		svymean( ~ totexp )
	) )

coef( this_result )
SE( this_result )
confint( this_result )
cv( this_result )

grouped_result <-
	MIcombine( with( ces_design ,
		svyby( ~ totexp , ~ bls_urbn , svymean )
	) )

coef( grouped_result )
SE( grouped_result )
confint( grouped_result )
cv( grouped_result )
```

Calculate the degrees of freedom of any survey design object:
```{r eval = FALSE , results = "hide" }
degf( ces_design$designs[[1]] )
```

Calculate the complex sample survey-adjusted variance of any statistic:
```{r eval = FALSE , results = "hide" }
MIcombine( with( ces_design , svyvar( ~ totexp ) ) )
```

Include the complex sample design effect in the result for a specific statistic:
```{r eval = FALSE , results = "hide" }
# SRS without replacement
MIcombine( with( ces_design ,
	svymean( ~ totexp , deff = TRUE )
) )

# SRS with replacement
MIcombine( with( ces_design ,
	svymean( ~ totexp , deff = "replace" )
) )
```

Compute confidence intervals for proportions using methods that may be more accurate near 0 and 1. See `?svyciprop` for alternatives:
```{r eval = FALSE , results = "hide" }
# MIsvyciprop( ~ any_food_stamp , ces_design ,
# 	method = "likelihood" )
```

### Regression Models and Tests of Association {-}

Perform a design-based t-test:
```{r eval = FALSE , results = "hide" }
# MIsvyttest( totexp ~ any_food_stamp , ces_design )
```

Perform a chi-squared test of association for survey data:
```{r eval = FALSE , results = "hide" }
# MIsvychisq( ~ any_food_stamp + sex_ref , ces_design )
```

Perform a survey-weighted generalized linear model:
```{r eval = FALSE , results = "hide" }
glm_result <- 
	MIcombine( with( ces_design ,
		svyglm( totexp ~ any_food_stamp + sex_ref )
	) )
	
summary( glm_result )
```

---

## Replication Example {-}
This example matches the _number of consumer units_ and the _Cars and trucks, new_ rows of [Table R-1](https://www.bls.gov/cex/tables/calendar-year/mean/cu-all-detail-2023.xlsx):
```{r eval = FALSE , results = "hide" }
result <-
	MIcombine( with( ces_design , svytotal( ~ as.numeric( popwt_finlwt21 / finlwt21 ) ) ) )

stopifnot( round( coef( result ) , -3 ) == 134556000 )

results <- 
	sapply( 
		weight_columns , 
		function( this_column ){
			sum( ces_df[ , 'new_car_truck_exp' ] * ces_df[ , this_column ] ) / 
			sum( ces_df[ , paste0( 'popwt_' , this_column ) ] )
		}
	)

stopifnot( round( results[1] , 2 ) == 2896.03 )

standard_error <- sqrt( ( 1 / 44 ) * sum( ( results[-1] - results[1] )^2 ) )

stopifnot( round( standard_error , 2 ) == 225.64 )

# note the minor differences
MIcombine( with( ces_design , svymean( ~ cartkn ) ) )
```