nbs.Rmd

# National Beneficiary Survey (NBS) {-}

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) <a href="https://github.com/asdfree/nbs/actions"><img src="https://github.com/asdfree/nbs/actions/workflows/r.yml/badge.svg" alt="Github Actions Badge"></a> 

The principal microdata for U.S. disability researchers interested in Social Security program performance.

* One table with one row per respondent.

* A complex sample designed to generalize to Americans between age 18 and full retirement age, covered by either Social Security Disability Insurance (SSDI) or Supplemental Security Income (SSI).

* Released at irregular intervals, with 2004, 2005, 2006, 2010, 2015, 2017, and 2019 available.

* Administered by the [Social Security Administration](http://www.ssa.gov/).

---

## Recommended Reading {-}

Four Example Strengths & Limitations:

✔️ [Instrument designed to reduce challenges related to communication, stamina, cognitive barriers](https://www.ssa.gov/disabilityresearch/documents/NBS_R5_UsersGuideReport_508C.pdf#page=31)

✔️ [Longitudinal 2019 sample includes beneficiaries working at prior round (2017) interview](https://www.ssa.gov/disabilityresearch/documents/NBS_R7_DataQualityReport.pdf#page=15)

❌ [Not designed to produce regional or state-level estimates](https://aspe.hhs.gov/reports/disability-data-national-surveys-0#NBS)

❌ [May overstate beneficiary poverty status and understate beneficiary income](https://www.mathematica.org/publications/developing-income-related-statistics-on-federal-disability-beneficiaries-using-nationally)

<br>

Three Example Findings:

1. [Large gaps in income and expenditure between Social Security Disability Insurance recipient households and working households generally increase with the number of dependents](https://www.nber.org/programs-projects/projects-and-centers/retirement-and-disability-research-center/center-papers/nb23-07).

2. [The share of Social Security Disability Insurance beneficiaries who had work goals or work expectations rose from 34% in 2005 to 43% in 2015](https://www.mathematica.org/publications/declining-employment-among-a-growing-group-of-work-oriented-beneficiaries-2005-2015).

3. [In 2010, 9% of disabled-worker beneficiaries had a 4-year degree, 28% less than high school](https://www.ssa.gov/policy/docs/issuepapers/ip2015-01.html).

<br>

Two Methodology Documents:

> [National Beneficiary Survey: Disability Statistics, 2015](https://www.ssa.gov/policy/docs/statcomps/nbs/2015/nbs-statistics-2015.pdf)

> [National Beneficiary Survey - General Waves Round 7: User's Guide](https://www.ssa.gov/disabilityresearch/documents/NBS_R7_Users%20Guide%20Report.pdf)

<br>

One Haiku:

```{r}
# social safety net
# poverty acrobatics
# trap or trampoline
```

---

## Download, Import, Preparation {-}

Download and import the round 7 file:
```{r eval = FALSE , results = "hide" }
library(haven)

zip_tf <- tempfile()

zip_url <- "https://www.ssa.gov/disabilityresearch/documents/R7NBSPUF_STATA.zip"
	
download.file( zip_url , zip_tf , mode = 'wb' )

nbs_tbl <- read_stata( zip_tf )

nbs_df <- data.frame( nbs_tbl )

names( nbs_df ) <- tolower( names( nbs_df ) )

nbs_df[ , 'one' ] <- 1
```

### Save Locally \ {-}

Save the object at any point:

```{r eval = FALSE , results = "hide" }
# nbs_fn <- file.path( path.expand( "~" ) , "NBS" , "this_file.rds" )
# saveRDS( nbs_df , file = nbs_fn , compress = FALSE )
```

Load the same object:

```{r eval = FALSE , results = "hide" }
# nbs_df <- readRDS( nbs_fn )
```

### Survey Design Definition {-}
Construct a complex sample survey design:

```{r eval = FALSE , results = "hide" }
library(survey)

options( survey.lonely.psu = "adjust" )

# representative beneficiary sample
nbs_design <-
	svydesign(
		id = ~ r7_a_psu_pub , 
		strata = ~ r7_a_strata , 
		weights = ~ r7_wtr7_ben , 
		data = subset( nbs_df , r7_wtr7_ben > 0 ) 
	)
	
# cross-sectional successful worker sample
nbs_design <- 
	svydesign(
		id = ~ r7_a_psu_pub , 
		strata = ~ r7_a_strata , 
		weights = ~ r7_wtr7_cssws , 
		data = subset( nbs_df , r7_wtr7_cssws > 0 ) 
	)
	
# longitudinal successful worker sample
lngsws_design <-
	svydesign(
		id = ~ r7_a_psu_pub , 
		strata = ~ r7_a_strata , 
		weights = ~ r7_wtr7_lngsws , 
		data = subset( nbs_df , r7_wtr7_lngsws > 0 ) 
	)
	
```

### Variable Recoding {-}

Add new columns to the data set:
```{r eval = FALSE , results = "hide" }
nbs_design <- 
	update( 
		nbs_design , 
		
		male = as.numeric( r7_orgsampinfo_sex == 1 ) ,
		
		age_categories = 
			factor( 
				r7_c_intage_pub ,
				labels = 
					c( "18-25" , "26-40" , "41-55" , "56 and older" )
			)
		
	)
```

---

## Analysis Examples with the `survey` library \ {-}

### Unweighted Counts {-}

Count the unweighted number of records in the survey sample, overall and by groups:
```{r eval = FALSE , results = "hide" }
sum( weights( nbs_design , "sampling" ) != 0 )

svyby( ~ one , ~ age_categories , nbs_design , unwtd.count )
```

### Weighted Counts {-}
Count the weighted size of the generalizable population, overall and by groups:
```{r eval = FALSE , results = "hide" }
svytotal( ~ one , nbs_design )

svyby( ~ one , ~ age_categories , nbs_design , svytotal )
```

### Descriptive Statistics {-}

Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svymean( ~ r7_n_totssbenlastmnth_pub , nbs_design , na.rm = TRUE )

svyby( ~ r7_n_totssbenlastmnth_pub , ~ age_categories , nbs_design , svymean , na.rm = TRUE )
```

Calculate the distribution of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svymean( ~ r7_c_hhsize_pub , nbs_design , na.rm = TRUE )

svyby( ~ r7_c_hhsize_pub , ~ age_categories , nbs_design , svymean , na.rm = TRUE )
```

Calculate the sum of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svytotal( ~ r7_n_totssbenlastmnth_pub , nbs_design , na.rm = TRUE )

svyby( ~ r7_n_totssbenlastmnth_pub , ~ age_categories , nbs_design , svytotal , na.rm = TRUE )
```

Calculate the weighted sum of a categorical variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svytotal( ~ r7_c_hhsize_pub , nbs_design , na.rm = TRUE )

svyby( ~ r7_c_hhsize_pub , ~ age_categories , nbs_design , svytotal , na.rm = TRUE )
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
svyquantile( ~ r7_n_totssbenlastmnth_pub , nbs_design , 0.5 , na.rm = TRUE )

svyby( 
	~ r7_n_totssbenlastmnth_pub , 
	~ age_categories , 
	nbs_design , 
	svyquantile , 
	0.5 ,
	ci = TRUE , na.rm = TRUE
)
```

Estimate a ratio:
```{r eval = FALSE , results = "hide" }
svyratio( 
	numerator = ~ r7_n_ssilastmnth_pub , 
	denominator = ~ r7_n_totssbenlastmnth_pub , 
	nbs_design ,
	na.rm = TRUE
)
```

### Subsetting {-}

Restrict the survey design to currently covered by Medicare:
```{r eval = FALSE , results = "hide" }
sub_nbs_design <- subset( nbs_design , r7_c_curmedicare == 1 )
```
Calculate the mean (average) of this subset:
```{r eval = FALSE , results = "hide" }
svymean( ~ r7_n_totssbenlastmnth_pub , sub_nbs_design , na.rm = TRUE )
```

### Measures of Uncertainty {-}

Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:
```{r eval = FALSE , results = "hide" }
this_result <- svymean( ~ r7_n_totssbenlastmnth_pub , nbs_design , na.rm = TRUE )

coef( this_result )
SE( this_result )
confint( this_result )
cv( this_result )

grouped_result <-
	svyby( 
		~ r7_n_totssbenlastmnth_pub , 
		~ age_categories , 
		nbs_design , 
		svymean ,
		na.rm = TRUE 
	)
	
coef( grouped_result )
SE( grouped_result )
confint( grouped_result )
cv( grouped_result )
```

Calculate the degrees of freedom of any survey design object:
```{r eval = FALSE , results = "hide" }
degf( nbs_design )
```

Calculate the complex sample survey-adjusted variance of any statistic:
```{r eval = FALSE , results = "hide" }
svyvar( ~ r7_n_totssbenlastmnth_pub , nbs_design , na.rm = TRUE )
```

Include the complex sample design effect in the result for a specific statistic:
```{r eval = FALSE , results = "hide" }
# SRS without replacement
svymean( ~ r7_n_totssbenlastmnth_pub , nbs_design , na.rm = TRUE , deff = TRUE )

# SRS with replacement
svymean( ~ r7_n_totssbenlastmnth_pub , nbs_design , na.rm = TRUE , deff = "replace" )
```

Compute confidence intervals for proportions using methods that may be more accurate near 0 and 1. See `?svyciprop` for alternatives:
```{r eval = FALSE , results = "hide" }
svyciprop( ~ male , nbs_design ,
	method = "likelihood" )
```

### Regression Models and Tests of Association {-}

Perform a design-based t-test:
```{r eval = FALSE , results = "hide" }
svyttest( r7_n_totssbenlastmnth_pub ~ male , nbs_design )
```

Perform a chi-squared test of association for survey data:
```{r eval = FALSE , results = "hide" }
svychisq( 
	~ male + r7_c_hhsize_pub , 
	nbs_design 
)
```

Perform a survey-weighted generalized linear model:
```{r eval = FALSE , results = "hide" }
glm_result <- 
	svyglm( 
		r7_n_totssbenlastmnth_pub ~ male + r7_c_hhsize_pub , 
		nbs_design 
	)

summary( glm_result )
```

---

## Replication Example {-}

This example matches the percentages and t-tests from the final ten rows of [Exhibit 4](https://www.ssa.gov/disabilityresearch/documents/TTW5_2_BeneChar.pdf#page=20):

```{r eval = FALSE , results = "hide" }
ex_4 <-
	data.frame(
		variable_label =
			c( 'coping with stress' , 'concentrating' , 
			'getting around outside of the home' , 
			'shopping for personal items' , 'preparing meals' , 
			'getting into or out of bed' , 'bathing or dressing' , 
			'getting along with others' , 
			'getting around inside the house' , 'eating' ) ,
		variable_name =
			c( "r3_i60_i" , "r3_i59_i" , "r3_i47_i" , "r3_i53_i" , 
			"r3_i55_i" , "r3_i49_i" , "r3_i51_i" , "r3_i61_i" , 
			"r3_i45_i" , "r3_i57_i" ) ,
		overall =
			c( 61 , 58 , 47 , 39 , 37 , 34 , 30 , 27 , 23 , 14 ) ,
		di_only =
			c( 60 , 54 , 47 , 36 , 35 , 36 , 30 , 23 , 24 , 13 ) ,
		concurrent =
			c( 63 , 63 , 47 , 43 , 41 , 34 , 33 , 31 , 23 , 15 ) ,
		concurrent_vs_di =
			c( F , T , F , F , F , F , F , T , F , F ) ,
		ssi =
			c( 61 , 62 , 47 , 40 , 39 , 33 , 29 , 31 , 22 , 15 ) ,
		ssi_vs_di =
			c( F , T , F , F , F , F , F , T , F , F )
	)
```		

Download, import, and recode the round 3 file:
```{r eval = FALSE , results = "hide" }
r3_tf <- tempfile()

r3_url <- "https://www.ssa.gov/disabilityresearch/documents/nbsr3pufstata.zip"
	
download.file( r3_url , r3_tf , mode = 'wb' )

r3_tbl <- read_stata( r3_tf )

r3_df <- data.frame( r3_tbl )

names( r3_df ) <- tolower( names( r3_df ) )

r3_design <- 
	svydesign(
		id = ~ r3_a_psu_pub , 
		strata = ~ r3_a_strata , 
		weights = ~ r3_wtr3_ben , 
		data = subset( r3_df , r3_wtr3_ben > 0 ) 
	)
	
r3_design <-
	update(
		r3_design ,
		
		benefit_type =
			factor(
				r3_orgsampinfo_bstatus ,
				levels = c( 2 , 3 , 1 ) ,
				labels = c( 'di_only' , 'concurrent' , 'ssi' )
			)

	)
```

Calculate the final ten rows of exhibit 4 and confirm each statistics and t-test matches:
```{r eval = FALSE , results = "hide" }
for( i in seq( nrow( ex_4 ) ) ){

	this_formula <- as.formula( paste( "~" , ex_4[ i , 'variable_name' ] ) )

	overall_percent <- svymean( this_formula , r3_design )
	
	stopifnot( 100 * round( coef( overall_percent ) , 2 ) == ex_4[ i , 'overall_percent' ] )
	
	benefit_percent <- svyby( this_formula , ~ benefit_type , r3_design , svymean )
	
	stopifnot(
		all.equal( 
			100 * as.numeric( round( coef( benefit_percent ) , 2 ) ) , 
			as.numeric( ex_4[ i , c( 'di_only' , 'concurrent' , 'ssi' ) ] )
		)
	)
	
	ttest_formula <- as.formula( paste( ex_4[ i , 'variable_name' ] , "~ benefit_type" ) )
	
	di_only_con_design <-
		subset( r3_design , benefit_type %in% c( 'di_only' , 'concurrent' ) )
	
	con_ttest <- svyttest( ttest_formula , di_only_con_design )

	stopifnot(
		all.equal( 
			as.logical( con_ttest$p.value < 0.05 ) , 
			as.logical( ex_4[ i , 'concurrent_vs_di' ] )
		)
	)
	
	di_only_ssi_design <-
		subset( r3_design , benefit_type %in% c( 'di_only' , 'ssi' ) )
	
	ssi_ttest <- svyttest( ttest_formula , di_only_ssi_design )

	stopifnot(
		all.equal(
			as.logical( ssi_ttest$p.value < 0.05 ) , 
			as.logical( ex_4[ i , 'ssi_vs_di' ] ) 
		)
	)

}

```

---

## Analysis Examples with `srvyr` \ {-}

The R `srvyr` library calculates summary statistics from survey data, such as the mean, total or quantile using [dplyr](https://github.com/tidyverse/dplyr/)-like syntax. [srvyr](https://github.com/gergness/srvyr) allows for the use of many verbs, such as `summarize`, `group_by`, and `mutate`, the convenience of pipe-able functions, the `tidyverse` style of non-standard evaluation and more consistent return types than the `survey` package. [This vignette](https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html) details the available features. As a starting point for NBS users, this code replicates previously-presented examples:

```{r eval = FALSE , results = "hide" }
library(srvyr)
nbs_srvyr_design <- as_survey( nbs_design )
```
Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
nbs_srvyr_design %>%
	summarize( mean = survey_mean( r7_n_totssbenlastmnth_pub , na.rm = TRUE ) )

nbs_srvyr_design %>%
	group_by( age_categories ) %>%
	summarize( mean = survey_mean( r7_n_totssbenlastmnth_pub , na.rm = TRUE ) )
```