Code repository for: 'The effect of dataset size on neural network performance within systematic reviewing'
A repository of code accompanying a study into the effect of dataset size on neural network performance within systematic reviewing. The code here can be used to reproduce the simulation study included in the study. In the simulation study, the systematic review process contained in ASReview is applied to dataset samples of different sizes, using a neural network classifier. The results here were generated using ASReview v0.17.
Running this simulation study requires Python 3.6+. After installing Python, ASReview can be installed using
pip install asreview
Gensim is also required to run the simulation, it can be installed with
pip install --upgrade gensim
Three different systematic review datasets were used to perform the simulation study:
- Nudging - Systematic review study performed by Nagtegaal et al. on nudging healthcare professionals towards evidence based medicine: Dataset - Paper
- Software - Systematic review study performed by Hall et al. on software fault detection: Dataset - Paper
- Depression - Systematic review study performed by Brouwer et al. on depressive relapse: Dataset - Paper
Smaller datasets were sampled out of the original datasets, the samples used in the simulation study are included in this repository. The full datasets are not included here, but can be obtained from the links above.
This section can be skipped if the dataset samples included in this repository are used
The original datasets should be placed in the data folder, and the files named 'Brouwer_2019.csv', 'Hall_2012.csv' and 'Nagtegaal_2019.csv'
Then run the data_generation notebook contained in the scripts folder to generate the samples out of the original dataset.
The commands needed to run the simulation are all included within the jobs.sh file, running this file will perform the full simulation. Warning: running the full simulation can take multiple days. The simulation process can be safely interrupted by using the keyboard interrupt and can be resumed by running jobs.sh again.
The metrics used to evaluate the simulation outcome are written by the shell script to the tables folder (contained in output). Plots for visual analysis can be generated by running the results notebook in the scripts folder.
The scripts in this repository are MIT licensed.