22-enem.Rmd

# Exame Nacional do Ensino Medio (ENEM) {-}

[![Build Status](https://travis-ci.org/asdfree/enem.svg?branch=master)](https://travis-ci.org/asdfree/enem) [![Build status](https://ci.appveyor.com/api/projects/status/github/asdfree/enem?svg=TRUE)](https://ci.appveyor.com/project/ajdamico/enem)

*Contributed by Dr. Djalma Pessoa <<pessoad@gmail.com>>*

The Exame Nacional do Ensino Medio (ENEM) contains the standardized test results of most Brazilian high school students.

* An annual table with one row per student.

* Updated annually since 1998.

* Maintained by the Brazil's [Instituto Nacional de Estudos e Pesquisas Educacionais Anisio Teixeira (INEP)](http://www.inep.gov.br/)

## Simplified Download and Importation {-}

The R `lodown` package easily downloads and imports all available ENEM microdata by simply specifying `"enem"` with an `output_dir =` parameter in the `lodown()` function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

```{r eval = FALSE }
library(lodown)
lodown( "enem" , output_dir = file.path( path.expand( "~" ) , "ENEM" ) )
```

`lodown` also provides a catalog of available microdata extracts with the `get_catalog()` function. After requesting the ENEM catalog, you could pass a subsetted catalog through the `lodown()` function in order to download and import specific extracts (rather than all available extracts).

```{r eval = FALSE , results = "hide" }
library(lodown)
# examine all available ENEM microdata files
enem_cat <-
	get_catalog( "enem" ,
		output_dir = file.path( path.expand( "~" ) , "ENEM" ) )

# 2015 only
enem_cat <- subset( enem_cat , year == 2015 )
# download the microdata to your local computer
enem_cat <- lodown( "enem" , enem_cat )
```

## Analysis Examples with SQL and `RSQLite` \ {-}

Connect to a database:

```{r eval = FALSE }
library(DBI)
dbdir <- file.path( path.expand( "~" ) , "ENEM" , "SQLite.db" )
db <- dbConnect( RSQLite::SQLite() , dbdir )
```

```{r eval = FALSE }

```

### Variable Recoding {-}

Add new columns to the data set:
```{r eval = FALSE }
dbSendQuery( db , "ALTER TABLE microdados_enem_2015 ADD COLUMN female INTEGER" )

dbSendQuery( db , 
	"UPDATE microdados_enem_2015 
	SET female = 
		CASE WHEN tp_sexo = 2 THEN 1 ELSE 0 END" 
)

dbSendQuery( db , "ALTER TABLE microdados_enem_2015 ADD COLUMN fathers_education INTEGER" )

dbSendQuery( db , 
	"UPDATE microdados_enem_2015 
	SET fathers_education = 
		CASE WHEN q001 = 1 THEN '01 - nao estudou'
			WHEN q001 = 2 THEN '02 - 1 a 4 serie'
			WHEN q001 = 3 THEN '03 - 5 a 8 serie'
			WHEN q001 = 4 THEN '04 - ensino medio incompleto'
			WHEN q001 = 5 THEN '05 - ensino medio'
			WHEN q001 = 6 THEN '06 - ensino superior incompleto'
			WHEN q001 = 7 THEN '07 - ensino superior'
			WHEN q001 = 8 THEN '08 - pos-graduacao'
			WHEN q001 = 9 THEN '09 - nao estudou' ELSE NULL END" 
)
```

### Unweighted Counts {-}

Count the unweighted number of records in the SQL table, overall and by groups:
```{r eval = FALSE , results = "hide" }
dbGetQuery( db , "SELECT COUNT(*) FROM microdados_enem_2015" )

dbGetQuery( db ,
	"SELECT
		fathers_education ,
		COUNT(*) 
	FROM microdados_enem_2015
	GROUP BY fathers_education"
)
```

### Descriptive Statistics {-}

Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
dbGetQuery( db , "SELECT AVG( nota_mt ) FROM microdados_enem_2015" )

dbGetQuery( db , 
	"SELECT 
		fathers_education , 
		AVG( nota_mt ) AS mean_nota_mt
	FROM microdados_enem_2015 
	GROUP BY fathers_education" 
)
```

Calculate the distribution of a categorical variable:
```{r eval = FALSE , results = "hide" }
dbGetQuery( db , 
	"SELECT 
		uf_residencia , 
		COUNT(*) / ( SELECT COUNT(*) FROM microdados_enem_2015 ) 
			AS share_uf_residencia
	FROM microdados_enem_2015 
	GROUP BY uf_residencia" 
)
```

Calculate the sum of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
dbGetQuery( db , "SELECT SUM( nota_mt ) FROM microdados_enem_2015" )

dbGetQuery( db , 
	"SELECT 
		fathers_education , 
		SUM( nota_mt ) AS sum_nota_mt 
	FROM microdados_enem_2015 
	GROUP BY fathers_education" 
)
```

Calculate the 25th, median, and 75th percentiles of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
RSQLite::initExtension( db )

dbGetQuery( db , 
	"SELECT 
		LOWER_QUARTILE( nota_mt ) , 
		MEDIAN( nota_mt ) , 
		UPPER_QUARTILE( nota_mt ) 
	FROM microdados_enem_2015" 
)

dbGetQuery( db , 
	"SELECT 
		fathers_education , 
		LOWER_QUARTILE( nota_mt ) AS lower_quartile_nota_mt , 
		MEDIAN( nota_mt ) AS median_nota_mt , 
		UPPER_QUARTILE( nota_mt ) AS upper_quartile_nota_mt
	FROM microdados_enem_2015 
	GROUP BY fathers_education" 
)
```

### Subsetting {-}

Limit your SQL analysis to took mathematics exam with `WHERE`:
```{r eval = FALSE , results = "hide" }
dbGetQuery( db ,
	"SELECT
		AVG( nota_mt )
	FROM microdados_enem_2015
	WHERE in_presenca_mt = 1"
)
```

### Measures of Uncertainty {-}

Calculate the variance and standard deviation, overall and by groups:
```{r eval = FALSE , results = "hide" }
RSQLite::initExtension( db )

dbGetQuery( db , 
	"SELECT 
		VARIANCE( nota_mt ) , 
		STDEV( nota_mt ) 
	FROM microdados_enem_2015" 
)

dbGetQuery( db , 
	"SELECT 
		fathers_education , 
		VARIANCE( nota_mt ) AS var_nota_mt ,
		STDEV( nota_mt ) AS stddev_nota_mt
	FROM microdados_enem_2015 
	GROUP BY fathers_education" 
)
```

### Regression Models and Tests of Association {-}

Perform a t-test:
```{r eval = FALSE , results = "hide" }
enem_slim_df <- 
	dbGetQuery( db , 
		"SELECT 
			nota_mt , 
			female ,
			uf_residencia
		FROM microdados_enem_2015" 
	)

t.test( nota_mt ~ female , enem_slim_df )
```

Perform a chi-squared test of association:
```{r eval = FALSE , results = "hide" }
this_table <-
	table( enem_slim_df[ , c( "female" , "uf_residencia" ) ] )

chisq.test( this_table )
```

Perform a generalized linear model:
```{r eval = FALSE , results = "hide" }
glm_result <- 
	glm( 
		nota_mt ~ female + uf_residencia , 
		data = enem_slim_df
	)

summary( glm_result )
```

## Analysis Examples with `dplyr` \ {-}

The R `dplyr` library offers an alternative grammar of data manipulation to base R and SQL syntax. [dplyr](https://github.com/tidyverse/dplyr/) offers many verbs, such as `summarize`, `group_by`, and `mutate`, the convenience of pipe-able functions, and the `tidyverse` style of non-standard evaluation. [This vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:

```{r eval = FALSE , results = "hide" }
library(dplyr)
library(dbplyr)
dplyr_db <- dplyr::src_sqlite( dbdir )
enem_tbl <- tbl( dplyr_db , 'microdados_enem_2015' )
```
Calculate the mean (average) of a linear variable, overall and by groups:
```{r eval = FALSE , results = "hide" }
enem_tbl %>%
	summarize( mean = mean( nota_mt ) )

enem_tbl %>%
	group_by( fathers_education ) %>%
	summarize( mean = mean( nota_mt ) )
```

---

## Replication Example {-}

```{r eval = FALSE , results = "hide" }
dbGetQuery( db , "SELECT COUNT(*) FROM microdados_enem_2015" )
```