Risk Factors Identification and Prediction of Long COVID

Overview

A machine learning model built in Python to answer the following questions:

What are clinical and sociodemographic risk factors for long COVID?
Can we predict long COVID in patients?

Final project for CPSC 66 Machine Learning (Fall 2021), with Emma Jin and Victoria Song.

A summary of our project follows. Our presentation slides and formal research paper are also included in this repository.

Problem Context

While COVID-19 in its acute form is widely discussed and researched, long COVID—when some COVID symptoms persist even after COVID tests return negative—is poorly understood. Persisting symptoms reportedly include: loss of smell or taste, shortness of breath, heart problems, anxiety, headaches, and fatigue. As of July 2021, long COVID is officially considered a form of disability under the Americans with Disabilities Act (ADA).

Goals

Use machine learning to:

Identify long COVID risk factors
Predict long COVID in patients

Methods and Results

Datasets: COVID-19 Fall 2020 & Winter 2021 Community Supplement, MCBS (Medicare Current Beneficiary Survey)

20,000+ patients
390 columns

Data Pre-processing

Dimensionality reduction: Removed obviously irrelevant features such as 'user_id' and week of interview (feature selection) and consolidated functionally similar features (feature extraction), reducing the complexity of the data

Resampling: Upsampled long COVID–negative patients in training set due to low representation in original dataset, enhancing quality of training data

Building the ML model

Using scikit-learn, implemented several machine learning models, including:

Logistic regression
Decision tree
Random forest
SVM
Naive Bayes

Used validation curves and confusion matrices to tune models' hyperparameters, enhancing accuracy for Logistic Regression, SVM, and Naive Bayes.

Ultimately, our random forest model achieved 72.3% accuracy and 81.5% precision for predicting Long COVID in patients.

Analyzing Risk Factors

To find the most significant risk factors, we compared the potential risk factors by their Gini feature importance. Among the sociodemographic and clinical factors in our analysis, the top risk factor for long COVID was severity of COVID in its acute stage.

Installation and Usage

Clone this repository.

git clone https://github.com/elliot-d-kim/long-covid-prediction.git

Navigate to the project directory.

cd long-covid-prediction

Install Python 3.8
Install other dependencies.

pip install -r requirements.txt

Navigate to the code directory.

cd code

Open the JupyterLab notebook, which will open in the browser.

jupyter notebook

Run all code up until before Logistic Regression.
Build whichever models are of interest by running their respective sections.

Technologies Used

Python 3.8
scikit-learn
Pandas
NumPy
JupyterLab

Contributors

Meiqing Emma Jin
Xinwei Victoria Song
Elliot Kim

Acknowledgements

This project was the final project for Professor Ben Mitchell's CPSC 66 Machine Learning course in Fall 2021. Thanks to Professor Mitchell for his guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Risk Factors Identification and Prediction of Long COVID

Overview

Table of Contents

Problem Context

Goals

Methods and Results

Data Pre-processing

Building the ML model

Analyzing Risk Factors

Installation and Usage

Technologies Used

Contributors

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Risk Factors Identification and Prediction of Long COVID

Overview

Table of Contents

Problem Context

Goals

Methods and Results

Data Pre-processing

Building the ML model

Analyzing Risk Factors

Installation and Usage

Technologies Used

Contributors

Acknowledgements