health_dev: Subnational reproductive, maternal, newborn, child and adolescent health and development atlas for India, version 1.1
2022-07-26
Table 1. Files and their descriptions within the health_dev GitHub repository for the paper Subnational reproductive, maternal, newborn, child, and adolescent health and development atlas for India.
Name | Type | Description |
---|---|---|
out | Folder | Folder contain the prediction and uncertainty gridded datasets (raster files) produced from the prediction R script and the out- of-sample cross validation summary statistics (csv files) from the validation R script. |
rda | Folder | Folder to contain INLA objects and the model summary statistics (saved as rda files) produced from the modelling R script. Files within this folder will be required to run the prediction and validation R scripts. |
shp | Folder | This folder contains the shapefiles required to run all R scripts in this repository. These should be the administrative boundaries of the study area as polygons and the location of the clusters in the study area as points (lat/lon) These shapefiles can be obtained from the DHS program at www.dhsprogram.com. |
tif | Folder | This folder contains the raster files for all geospatial covariates. Files within this folder are required to run the prediction R script. Examples of geospatial covariate datasets can be found at www.hub.worldpop.org/project/categories?id=14. |
covariates | csv | This file contains a demo of the format of the data extracted from geospatial covariates considered when modelling the health and development indicators. This file is required to run all R scripts in this repository. Examples of geospatial covariate datasets can be found from https://hub.worldpop.org/project/categories?id=14. |
indicators | csv | This file contains a demo of health and development indicators to model. This file is required to run all R scripts in this repository. The indicators were extracted from the India NFHS-4 (National Family Health Survey 4) 2015-16 DHS (Demographic Health Survey) (1-3) database, which are publicly available after registration onto the Measure DHS website (www.dhsprogram.com). |
modelling | R | R script for modelling the health and development indicators. The files required to run this script are the covariates and indicator csv files and the files in the shp folder. This script outputs an INLA object and the model summary statistics (both saved as rda files). Further description of the methodology is given in the sections below. |
prediction | R | R script for predicting the health and development indicators. The files required to run this script are the covariates and indicators csv files, the files in the shp folder, the files in the tif folder, and the files in the rda folder. This script outputs a prediction gridded dataset (tif file) and an uncertainty gridded dataset (tif file) for target indicator and are saved to the out folder. |
validation | R | R script for out-of-sample (k- fold) validation for the models of the health and development indicators. The files required to run this script are the covariates and indicators csv files, the files in the shp folder, and the files in the rda folder. This script outputs k- fold summary statistics as csv files. Further description of the methodology is given in the sections below |
The geospatial covariate selection is two-staged. In the first stage, we check for multicollinearity amongst the geospatial covariates. In the second stage, we employ the back-ward stepwise model selection method.
To check for multicollinearity, a Pearson correlation matrix for the geospatial covariates is created and any pairs with a Pearson correlation coefficient are flagged. The flagged covariates are then individually fitted in non-Bayesian binomial generalised linear models (GLMs). The Bayesian information criteria (BIC) of the models are then calculated. The covariate in the model with a lower BIC is retained while the covariate in the model with the greater BIC is omitted for the target indicator. To further ensure that multicollinearity is not a problem between the remaining geospatial covariates, variance inflation factors (VIFs) are calculated. If any covariate returns a VIF > 4, it is omitted.
After checking for multicollinearity, a backward model selection algorithm is used to select the best (sub)set of geospatial covariates for the target indicator. The algorithm is as follows. The remaining geospatial covariates are fitted in a non-Bayesian binomial GLM and the BIC is calculated. A covariate is removed from the model and the BIC is recalculated. If the recalculated BIC is less than the previously calculated BIC, this subset of covariates is preferred. These steps are performed iteratively until the recalculated BIC is not less than the BIC calculated from the previous iteration. At this point, the best (sub)set of geospatial covariates have been attained and they will be used when constructing the Bayesian point-referenced spatial binomial GLM in INLA.
The constructed Bayesian point-referenced spatial binomial GLM is given as follows.
The number of occurrence of events of the target indicator within cluster locations for follows a Binomial distribution with the total number of surveys conducted within the cluster locations and the proportion of events happening in the cluster . With a logit link, is calculated with a linear combination of the fixed effects , spatial random effects and independent identical (iid) random effects .
The fixed effects are given by the geospatial covariates selected from the backward model selection algorithm mentioned above and is a vector of regression coefficients to be estimated. The spatial random effects follow a multivariate normal distribution with zero-mean and some covariance matrix . In this study, elements of the covariance matrix are calculated with the exponential covariance function. The exponential covariance function is calculated with the spatial variance , the spatial decay parameter and the Euclidean distance matrix between the cluster locations. The parameters and are unknown and are to be estimated in INLA. The iid random effects follow a normal distribution with a mean of zero and an unknown variance which will be estimated along with the other parameters mentioned above.
Additional components must be constructed before fitting the model in INLA. First a mesh of the study domain is constructed with the shape file and coordinates within the target indicator file. Using this mesh object, a stochastic partial differential equation (SPDE) object is defined with functions in INLA where the priors of the spatial decay parameter and spatial variance parameter is defined. With the mesh object, INLA stack “A” matrices are created and stacked with the INLA stack functions. Finally, these components, along with the model are fitted into the INLA function.
The prediction R script loads the generates posterior samples from the INLA object (saved from the modelling R script). Then it reads in the raster files corresponding to the geospatial covariates of the model for the target indicators and compiles it as a prediction data frame. Finally, the predicted values are computed from the prediction data frame, INLA mesh objects and INLA posterior sample objects, and are slotted to the cells in the raster file – producing the high-resolution (5x5km) prediction and uncertainty gridded datasets / surfaces as tif files.
The validation R script accesses the performance of the model constructed for the target indicator from the modelling R script with k-fold cross validations and compute evaluation metrics. The k-fold cross validation functions by first partitioning the dataset into k parts, then training the model with k-1 parts of the dataset and testing the trained model with the kth part of the dataset. The model is the Bayesian point-referenced spatial generalized linear model constructed in the modelling R script (i.e., with the same (sub)set of geospatial covariates) for the target indicator. For each fold, the following evaluation metrics are calculated:
the Pearson’s correlation coefficient, the root mean squared error, the mean absolute error, percentage bias, and the coverage rate. In the evaluate metrics above, is used to denote the observed values – i.e., the proportions of the target indicators partitioned for testing – and is used to denote the predicted mean values from the Bayesian point-referenced spatial binomial generalized linear model.
The notation is used to the denote the Pearson’s correlation coefficient where explicitly it is calculated with the covariance of the observed and predicted values, and the standard deviation of the observed and predicted values
Here, note that the vectors and where is the number of observations partitioned for testing. Better predictive performance is reflected from a greater Pearson’s correlation coefficient. The root mean squared error (RMSE), mean absolute error (MAE) and percentage bias have straightforward calculations that does not require additional explanation. Better predictive performance is reflected from smaller RMSE, MAE and percentage bias values. The coverage rate which ranges from 0 to 100. First, in the equation is defined as follow
where and represents the ith 0.025 quantile and 0.0975 quantile predicted value. To put it simply, is either 1 or 0, for , depending on some condition. This condition is if the observed value is within the 0.025 quantile and 0.0975 quantile of the predicted value, , otherwise . Better predictive performance is reflected from a higher coverage rate.
The validation R script returns csv files with the evaluation metrics calculated for each fold for the model of the target indicator being validated.
The work is funded by the Children’s Investment Foundation Fund (CIFF) (R-2009-05106). The authors acknowledge the support of the PMO Team at WorldPop and would like to thank EME and India Programme Team at CIFF for their inputs and continuous support, and all staff at CIFF who provided feedback at each stage of this work. Moreover, the authors would like to thank the DHS Program staff for their input on the construction of some of the indicators. This work was approved by the ethics and research governance committee at the University of Southampton (ERGO 64920).
Chan, H.M.T, Dreoni, I., Tejedor-Garavito, N., Kerr D., Bonnie, A., Tatem A.J. and Pezzulo, C. 2022. health_dev: Subnational reproductive, maternal, newborn, child and adolescent health and development atlas for India, version 1.1. WorldPop, University of Southampton. .
- International Institute for Population Sciences - IIPS/India and ICF. [Producers]. 2017. National Family Health Survey NFHS-4, [Datasets IABR74DT.dta; IACR74DT.dta; IAHR74DT.dta; IAIR74DT.dta; IAKR74DT.dta; IAMR74DT.dta; IAPR74DT.dta; IAGE71FL.shp], 2015-16: India. Mumbai: IIPS. ICF [Distributor], 2017. 6 International Institute for Population Sciences - IIPS/India and ICF. 2017. National Family Health Survey NFHS-4, 2015-16: India. Mumbai: IIPS. (www.dhsprogram.com)
- International Institute for Population Sciences (IIPS), I. and ICF., India National Family Health Survey NFHS-4 2015-16. Mumbai, India: IIPS and ICF. Available at http://dhsprogram.com/pubs/pdf/FR339/FR339.pdf. 2017
- The DHS Program Code Share Project, Code Library, DHS Program. DHS Program Github site. https://github.com/DHSProgram., in DHS Program Github site. 2022.