To learn more about TJI, visit our website at www.texasjusticeinitiative.org
The data itself lives in our data.world account
Many different datasets and files are used by the TJI website and our analyses. All non-manual data processing steps live in this repo.
All scripts/notebooks to clean, scrape, merge or otherwise process data files. There are two main folders:
- data_scraping/ - reads data from anywhere on the internet and writes csvs to TJI's data.world account
- data_cleaning/ - files should both be READ FROM and WRITTEN TO the TJI data.world account. Any dataset not on data.world should be scraped or manually added to data.world first.
- The output of a cleaning script should be a file whose name begins with
clean_
.
- The output of a cleaning script should be a file whose name begins with
To regenerate data for the TJI website (repo) requires two steps:
- Run these notebooks to generate the cleaned Officer Involved Shooting (OIS) datasets:
- data_cleaning/clean_ois_civilians_shot.ipynb
- data_cleaning/clean_ois_officers_shot.ipynb
- Run this notebook to generate the cleaned Custodial Death Report data:
- data_cleaning/clean_cdr.ipynb
- Notes:
- The raw data is manually maintained by Eva in Google Drive and automatically synced to data.world, but this data needs to be cleaned before it is ready for analysis or website use.
- These notebooks both read from and write to data.world -- see later in this README for details.
- This will read the cleaned datasets and generate several output files on your local machine:
cdr_compressed.json
cdr_full.csv
shot_civilians_compressed.json
shot_civilians_full.csv
shot_officers_compressed.json
shot_officers_full.csv
- Moves these files into the data/ folder of website repo, and create a PR.
Data cleaning and compression for OIS and CDR data are currently automated via a daily cronjob. See the automation documentation for details.
To do testing in this repo, please follow the instructions in the instructions guide
Project: Texas Deaths in Custody from 2005-present - tji/tx-deaths-in-custody-2005-2015
- File:
cleaned_custodial_death_reports.csv
- Description: All Texas custodial deaths since 2005 (a "custodial death" is a death in jail, prison, custody, or the process of arrest -- see Wikipedia)
- Generation pipeline:
- (Manual) TJI staff manually parse and enter the data into a master spreadsheet,
CDR Reports All.xlsx
, in Google Drive, which is synced to data.world here - A member of TJI runs this notebook to create the final file:
data_cleaning/clean_cdr.ipynb
- (Manual) TJI staff manually parse and enter the data into a master spreadsheet,
- Quirks
- The Texas Department of Criminal Justice, which runs Texas prisons and a few state jails, until 2013 did NOT file custodial death reports for prisoners that died in an inpatient setting. In practice, this means that a good number of deaths from natural causes of state prisoners were not reported from 2005-2012 (you can see this clearly in the exploratory analysis here). Thus, if you simply plot custodial deaths over time, you'll see a jump from 2012 to 2013 for this reason.
- The form that was used to report custodial deaths changed in 2016, and by 2017 all records use the new form. The forms differ, but many questions are nearly the same. You can see the forms in this repo. The cleaning script attempts to match fields and options across form versions so the output file only has data that is consistent across all versions. See the
form_version
column in the output file to see what version was used for entering that record. - Diligent collection of custodial deaths in texas began in 2005, but inconsistent data exists as far back as 1980. To see these older files, explore the
older_versions
tab of raw data file here.
Project: Officer Involved Shootings tji/officer-involved-shootings
- File:
shot_civilians.csv
- Description: Civilians shot by police, late 2015 - present
- Generation pipeline:
- A TJI bot monitors the Texas Attorney General's website for new OIS reports.
- New reports are emailed to TJI staff.
- TJI staff manually parse and enter the data into a master spreadsheet,
OIS.xlsx
, in Google Drive, which is synced to data.world here - A member of TJI runs this notebook to create the final file:
data_cleaning/clean_ois_civilians_shot.ipynb
- Quirks
- There is one record for every shot civilian. Thus, if a single incident results in multiple civilians shot, there will be multiple rows with largely duplicate information (e.g. address, date, officer details, etc). Incident-level analysis should de-duplicate, say by matching on date and address.
- It's hard to know exactly how many officers were on scene. In theory, there are two pieces of information in each record that reveal this information. First, there is a checkbox on the form called "multiple officers involved," which is checked about 80% of the time. Second, there are spaces in the form for the details (agency, gender, race, age, etc) of each officer involved. However, when "multiple officers involved" is checked, only ~half the time do details for more than one officer exist. Similarly, sometimes "multiple officers involved" is NOT checked, yet details for multiple officers exist. It's unclear what to make of this information.
- File:
shot_officers.csv
- Description: Peace officers shot in the line of duty, late 2015 - present
- Generation pipeline:
- Identical to
shot_civilians.csv
above, except that in the last step, a different notebook is run:data_cleaning/clean_ois_officers_shot.ipynb
- Identical to
- Quirks
- Analogous to the previous file, there is one record for every shot officer. Thus, if a single incident results in officers civilians shot, there will be multiple rows with largely duplicate information (e.g. address, date, civilian details, etc). Incident-level analysis should de-duplicate, say by matching on date and address.
Project: Auxiliary Datasets tji/auxiliary-datasets
- File:
texas_counties.csv
- Description: List of Texas counties and their "seat" city
- Generation pipeline:
- Run this notebook:
data_scraping/scrape_texas_county_names.ipynb
- Data is fetched from this Wikipedia page
- Run this notebook:
- File:
census_data_by_county.csv
- Description: Extensive US Census data (2010 and 2016), one row per Texas county.
- Generation pipeline:
- Run this notebook:
data_scraping/scrape_census_data_by_county.ipynb
- Data is fetched from the US Census QuickFacts (e.g. here)
- Run this notebook:
- File:
num_officers_by_agency.csv
- Description: Number of officers in each Texas police department
- Generation pipeline:
- TJI staff request data from TCOLE
- TCOLE emails an excel file, which TJI staff place in Google Drive (
TCOLE.xlsx
) - The first/only sheet of the Excel file is uploaded to data.world as
raw_num_officers_by_agency.csv
- Run this notebook to generate the final data file:
data_cleaning/clean_num_officers_by_agency.ipynb
- File:
agencies_and_counties.csv
- Description: List of texas police agencies (names are normalized) and the county they belong to
- Generation pipeline:
- This is also generated in the flow that creates
num_officers_by_agency.csv
above, via the same notebook:data_cleaning/clean_num_officers_by_agency.ipynb
- This is also generated in the flow that creates
- File:
list_of_texas_officers.csv
- Description: Names, agencies, and demographics of all police officers in Texas.
- Generation pipeline:
- TJI staff request data from TCOLE
- TCOLE sends an excel file, which TJI staff place in Google Drive (
Current appointed POs with certs and service time and gender- Ruth.xlsx
) - Excel file uploaded to data.world as
raw_list_of_texas_officers.csv
- Run this notebook:
data_cleaning/clean_list_of_texas_officers.ipynb
- File:
ucr_crime_by_county_2016.xls
(and2015
and2014
) - Description: Crime by county in Texas from the FBI's Uniform Crime Report
- Generation pipeline: