This repository contains research work on parallelizing MI (multiple imputation) algorithms using R.
Run-time is an issue for any MI algorithm, because there are several time consuming iterative steps involved, such as the number of iterations in sequential regression, the identifcation of nearest neighbors, the number of variables and variable selection schemes or iterative estimation procedures (Fisher Scoring). Parallelization is a promising way to decrease computation time. Here, the requirements and implementation for different operating systems are compared and currently available solutions in terms of usability and run-time are described.
The /src
directory contains the R code of the simulation studies conducted for this project. The /poster
directory contains the LaTeX code and a PDF file for the scientific poster, which was presented as part of the Statistical Analysis of Incomplete Data lecture by Dr. Florian Meinfelder and Paul Messer in the winter terms of 2021/2022 at Otto-Friedrich-University Bamberg.
The files /src/benchmark_core.R
and /src/benchmark_differentM.R
can be seen as the main studies of this project. The results of these benchmarks are on the poster. The file src/utils.R
contains helper funcions, while src/parallel_functions.R
contains the wrapper functions for the parallel executions of MI algorithms. The mice
package is used for calculating imputations. Lastly, src/dataGenerator.R
contains the code to create the simulated data, that is used in this study. The helper scripts are sourced by the main benchmark scripts.
The src/C++/
directory contatins the C++ source code and the R code for a comparing implementation. Here, stochastic regression is implemented from scratch and used to perform multipile imputation. The results are documented in this folder.
The src/examples
directory contains explorative code to get familiar with the topic.
We also provided the .RData
files for the benchmark results in case you don't want to run the whole benchmark on your own system. This files are located in the /data
folder.
In the /docs
folder we provided a guide to set up Linux Ubuntu and R under WSL on a windows machine. If you want to run the FORK variants of the parallel algorithms you need a Unix operating system. The rest is just fine on a windows system.