This repository contains the contents of an analysis into the impact of wildfire smoke and the potential relationship to asthma rates in the city of Norman, OK. Specifically, we extract wildfire data from the USGS repository and historical AQI data from the US Environmental Protection Agency (EPA) Air Quality Service (AQS) API build a perform EDA on wildfires in and around Norman, and develop a predictive model to estimate smoke from 2025-2050. Additionally, we extract survey asthma data from the Oklahoma State Department of Health Statistics website, and develop a predictive model to forecast asthma rates in Norman using the forecasted smoke estimates. A full project report is located in the repository titled "final_report.pdf".
To perform the wildfire smoke analysis, the complete wildfire dataset was retrieved from a US government repository. The downloaded data schema can be found below in the data schema section labelled "USGS_Wildland_Fire_Combined_Dataset.json", however, this JSON was too large to store directly on GIT. Preliminary processing was performed to filter the wildfire data to wildfires occuring between 1961-2021, and within 650 (0r 1800) miles from Norman, OK. The distance was computed using a wildfire user module developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program.
The AQI Data Acquistion notebook extracted the historical AQI data from the US Environmental Protection Agency (EPA) Air Quality Service (AQS) API. The documentation for the API provides definitions of the different call parameter and examples of the various calls that can be made to the API. Specifically, this analyses makes API requests for AQI data from Cleveland county (home to Norman), and nearby Oklahoma county (home to Oklahoma City) in Oklahoma.
The asthma data acquisition involved querying the Oklahoma State Department of Health Statistics website. This data is licensed for usage in "...monitoring the health of the people of Oklahoma" (full license here). The extracted data is present in the "data_raw/" folder of this repository, consisting of the extracted asthma and smoking data from 2000, 2003-2010, and 2011-2023.
Modeling ideation in "asthma_modeling.ipynb" was performed with collaboration from fellow student Jake Flynn, while AQI Estimate analysis was performed with collaboration from fellow student Sid Gurajala. Lastly, this assignment, and significant portions of the "data_acquisition_aqi.ipynb.ipynb" and "data_acquisition_wildfire.ipnyb" files were developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the Creative Commons CC-BY license. Revision 1.2 - September 16, 2024
To reproduce the entire analysis, follow the guide detailed below.
Visit the USGS website and download the "USGS_Wildland_Fire_Combined_Dataset.json", and place it in the "data_raw/" folder of this repository.
Execute all the cells in "data_acquisition_wildfire.ipynb", which loads and formats the wildfire dataset into a much more compatible format.
Execute all the cells in the "data_acquisition_aqi.ipynb", which performs the api call to extract the AQI data from the US Environmental Protection Agency (EPA) Air Quality Service (AQS) API
Execute all the cells in the "data_processing_wildfire.ipynb", which filters the wildfire data and computes the yearly smoke estimate scores.
Execute all the cells in the "visualization_smoke.ipynb", which generates the three part 1 output plots related to wildfire smoke.
Execute all the cells in the "modeling_smoke.ipynb", which performs the modeling and forecasting of the smoke estimate.
Since the raw asthma data is already contained in the "data_raw/" folder, start by executing the "data_processing_asthma.ipynb", which cleans, parses, and performs preliminary EDA on the asthma data.
Execute all the cells in the "modeling_asthma.ipynb", which creates the linear model to predict smoke estimate from asthma, and forecasts the asthma predicitons.
Note that all files below denoted with (*) have been omitted from the actual repository as they are too large to upload on git. For the files in the data_intermediate, a subset denoted by the suffix "_SMALL" has been included for representation purposes.
├── data_clean/ # Folder containing the cleaned data
│ ├── asthma_non-smoker_survey_cleaned.csv # CSV with the final, cleaned and processed survey non-smoker asthma data
│ ├── forecasted_smoke_estimates.csv # CSV with the forecasted smoke impact estimates from 2025-2050
│ ├── norman_aqi_yearly_average.csv # CSV with the final, cleaned and processed AQI estimate data
│ └── norman_wildfires_SI_yearly_average.csv # CSV with the final, cleaned and processed widfire smoke estimate data
├── data_intermediate/ # Folder containing the intermediate data (Note the real files were too large to be stored on GIT, so the files included here are smaller, sample files)
│ ├── *full_wildfires_SMALL.json # JSON with all of the extracted wildfire data
│ └── *norman_wildfires_SI_SMALL.json # JSON with the processed and filtered wildfire data
├── data_raw/ # Folder containing the raw data
│ ├── oklahoma_brfss_2000_raw.csv # CSV with raw survey asthma / smoker data from 2000
│ ├── oklahoma_brfss_2003-2010_raw.csv # CSV with raw survey asthma / smoker data from 2003-2010
│ ├── oklahoma_brfss_2011-2023_raw.csv # CSV with raw survey asthma / smoker data from 2011-2023
│ └── *USGS_Wildland_Fire_CombinedDataset.json # JSON containing all of the raw wildfire data (Note this file was to large to upload, but can be downloaded directly from [here](https://www.sciencebase.gov/catalog/item/61aa537dd34eb622f699df81))
├── notebooks/ # Source code
│ ├── data_acquisition_aqi.ipynb # Notebook to make the api calls to extract, store, and process the aqi data
│ ├── data_acquisition_wildfire.ipynb # Notebook to extract the wildfire data from the raw JSON
│ ├── data_processing_asthma.ipynb # Notebook to perform the data formatting, processing, and filtering of the asthma data
│ ├── data_processing_wildfire.ipynb # Notebook to perform the data processing and smoke estimates for the wildfire smoke data
│ ├── modeling_asthma.ipynb # Notebook to generate the linear regression prediction model, and forecast asthma from the smoke estimate
│ ├── modeling_smoke.ipynb # Notebook to perform the predictive modeling of the smoke estimates
│ └── visualization_smoke.ipynb # Notebook to perform the plotting of the three smoke figures from the original analysis
├── final_report.pdf # The comprehensive final report
├── LICENSE # License documentation
├── .gitignore # git ignore for the repo
└── README.md # README for the repo
| Column Name | Data Type | Data Description
| ------------------------------------------
| 'Year' | 'int' | The year
| 'Percentage' | 'float64' | The percentage of non-smoker survey respondants with asthma
| Column Name | Data Type | Data Description
| ------------------------------------------
| 'Year' | 'int' | The year (since this is the forecast, from 2025-2050)
| 'Smoke_Estimate' | 'float64' | The predicted Smoke Estimate score
| 'Upper_bound' | 'float64' | The 95% CI upper bound of the Smoke Estimate score
| 'Lower_bound' | 'float64' | The 95% CI lower bound of the Smoke Estimate score
| Column Name | Data Type | Data Description
| ------------------------------------------
| 'Year' | 'int' | The year
| 'Average_AQI_Estimate' | 'float64' | The estimated average daily AQI for that year
| Column Name | Data Type | Data Description
| ------------------------------------------
| 'Year' | 'int' | The year
| 'Smoke_Estimate' | 'float64' | The estimated smoke score for that year
Note that this JSON contains a large number of fields, most of which were defined in the original dataset, and most of which we do not end up using. Listed below are all the relevant attribtues which were used during analysis.
{
"type": "object",
"description": "Full fire data for a specific fire",
"properties":{
"attributes":{
"type": "object",
"description": "Full fire data for a specific fire",
"properties":{
"items":{
"OBJECT_ID": {
"type": "int",
"description": "The ID for this specific fire"
},
"FIRE_YEAR": {
"type": "int",
"description": "The year the fire occured"
},
"GIS_Acres": {
"type": "float64",
"description": "The amount of burned GIS acres"
}
}
}
},
"geometry":{
"type": "object",
"description": "GIS Ring data for the fire location",
"properties":{
"attributes":{
"rings": {
"type": "array",
"description": "The lat and lon of the point"
}
}
}
}
}
}
Note that this JSON contains 32 fields, 30 of which were defined in the original dataset, and most of which we do not end up using. This json also has a very similar structure to that of the full_wildfires.json described above, but with a few new attributes. Listed below are all the relevant attributes which were used during analysis.
{
"type": "array",
"description": "Full fire data for a specific fire",
"properties":{
"items":{
"OBJECT_ID": {
"type": "int",
"description": "The ID for this specific fire"
},
"FIRE_YEAR": {
"type": "int",
"description": "The year the fire occured"
},
"GIS_Acres": {
"type": "float64",
"description": "The amount of burned GIS acres"
},
"distance_from_norman": {
"type": "float64",
"description": "The distance of the fire from Norman, in miles"
},
"Smoke_Impact": {
"type": "float64",
"description": "The computed Smoke Impact score"
}
}
}
}
In their raw form as downloaded from the OK2Share website, these datasets were not in a clean tabular format, and therefore required extensive post processing (see data_processing_asthma.csv). Shown below is the sample input format (and actual raw data of 2000) to get a sense of data format. Visit the OK2Share website and select the relevant smoking and asthma fields in the BRFSS crosstab survey data request to obtain an html version.
| Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | Column 6
| ----------------------------------------------------------------------------------------------
| Year of interview 2000 | | | | |
| | | | Currently smoking | |
| | | | No | Yes | Total
| Reported current asthma | | | | |
| No | n | | 761 | 194 | 955
| | N | | 521502 | 131652 | 653154
| | % | | 79.8 | 20.2 | 100
| | CI | | ( 77.0 - 82.7) | ( 17.3 - 23.0) | n/a
| Yes | n | | 50 | 15 | 65
| | N | | 29322 | 10388 | 39710
| | % | | 73.8 | 26.2 | 100
| | CI | | ( 61.6 - 86.1) | ( 13.9 - 38.4) | n/a
| Total | n | | 811 | 209 | 1020
| | N | | 550824 | 142040 | 692864
| | % | | 79.5 | 20.5 | 100
| | CI | | ( 76.7 - 82.3) | ( 17.7 - 23.3) | n/a