eICU Preprocessing

Citation

This repository is a modification of the data preprocessing code used in Benchmarking machine learning models on multi-centre eICU critical care dataset by Seyedmostafa Sheikhalishahi, Vevake Balaraman and Venet Osmani.

The original repositiory - eICU_Benchmark_updated

Dataset being preprocessed - eICU paper by Tom J. Pollard et. al.

The modification of the eICU_Benchmark_updated code is to adhere to the exclusion criteria and features used in the paper: Multitask learning and benchmarking with clinical time series data by Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, Greg Ver Steeg, and Aram Galstyan. This paper performs clinical predictions on the MIMIC-III dataset (MIMIC-III paper).

Hence, the term eligible patients refer to patients in the eICU dataset who adhere to the exclusion criteria in Harutyunyan et al.

Structure

There are only 2 scripts to run:

data_extraction_root.py - extracting timeseries data for each eligible patient (the data to be trained on)
create_tasks.py - creating listfiles which store the ground truths for clinical tasks (the labels)

Data extraction

Ensure the eICU dataset CSVs are available on the disk.

Clone the repository.
The following command will:

Create a directory for each eligible patient stay
Writes patients demographics into pats.csv
Nursecharting info into nc.csv
Lab measurments into lab.csv
Nurseassessment into na.csv
Merges these four csv files into one timeseries.csv for each patient stay
Create a root/ directory within the task directory
Truncate data only to a patient's ICU stay, rename files to unique episode number, clips features, and throw out patients with under 15 records
Creates timeseries_info.csv file which stores information to determine eligibility for each task

python data_extraction_root.py --eicu_dir "directory of csv files" --output_dir "directory to save the extracted data" --task_dir "directory to save modified data and task listfiles"

Creating labels

The following command will:

Create train, test, val listfiles for each specified task. Listfiles typically contain 3 components: 1. stay - the patient timeseries file name 2. period_length - amount of time model is allowed to look at (except IHM which is always 48 hrs) 3. y_true - the ground truth (except for pheno as it has multple labels)
Create regional splits for each task specified

python create_labels.py --eicu_dir "directory of csv files" --task_dir "directory to save modified data and task listfiles" [--regional] [--all | --ihm | --los | --decomp | --pheno ] [--skip_labeling]

Use --all for all tasks, --regional to perform regional splits. If you want to only do a specific subset of tasks, use the corresponding flags. If the root directory has already been fully populated, use --skip_labeling.

Linking to root

For training, validation, and testing you need to have the data in the same folder as the task folder (train and validation data in train/ and test data in test/).

This can be accomplished by symbolically linking a train and test folder to the root/ directory.

For linux, fill the ABSOLUTE_PATH_TO_SPLITS variable in symlink.sh then run:

chmod +x symlink.sh
./symlink.sh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Continual_Learning		Continual_Learning
data_extractor		data_extractor
resources		resources
README.md		README.md
create_tasks.py		create_tasks.py
data_extraction_root.py		data_extraction_root.py
symlink.sh		symlink.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eICU Preprocessing

Citation

Structure

Data extraction

Creating labels

Linking to root

About

Releases

Packages

Languages

kingrc15/EHRTransferBenchmark

Folders and files

Latest commit

History

Repository files navigation

eICU Preprocessing

Citation

Structure

Data extraction

Creating labels

Linking to root

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages