A machine learning model built in Python to answer the following questions:
- What are clinical and sociodemographic risk factors for long COVID?
- Can we predict long COVID in patients?
Final project for CPSC 66 Machine Learning (Fall 2021), with Emma Jin and Victoria Song.
A summary of our project follows. Our presentation slides and formal research paper are also included in this repository.
- Problem Context
- Goals
- Methods and Results
- Installation and Usage
- Technologies Used
- Contributors
- Acknowledgements
While COVID-19 in its acute form is widely discussed and researched, long COVID—when some COVID symptoms persist even after COVID tests return negative—is poorly understood. Persisting symptoms reportedly include: loss of smell or taste, shortness of breath, heart problems, anxiety, headaches, and fatigue. As of July 2021, long COVID is officially considered a form of disability under the Americans with Disabilities Act (ADA).
Use machine learning to:
- Identify long COVID risk factors
- Predict long COVID in patients
Datasets: COVID-19 Fall 2020 & Winter 2021 Community Supplement, MCBS (Medicare Current Beneficiary Survey)
- 20,000+ patients
- 390 columns
Dimensionality reduction: Removed obviously irrelevant features such as 'user_id' and week of interview (feature selection) and consolidated functionally similar features (feature extraction), reducing the complexity of the data
Resampling: Upsampled long COVID–negative patients in training set due to low representation in original dataset, enhancing quality of training data
Using scikit-learn, implemented several machine learning models, including:
- Logistic regression
- Decision tree
- Random forest
- SVM
- Naive Bayes
Used validation curves and confusion matrices to tune models' hyperparameters, enhancing accuracy for Logistic Regression, SVM, and Naive Bayes.
Ultimately, our random forest model achieved 72.3% accuracy and 81.5% precision for predicting Long COVID in patients.
To find the most significant risk factors, we compared the potential risk factors by their Gini feature importance. Among the sociodemographic and clinical factors in our analysis, the top risk factor for long COVID was severity of COVID in its acute stage.
- Clone this repository.
git clone https://github.com/elliot-d-kim/long-covid-prediction.git
- Navigate to the project directory.
cd long-covid-prediction
- Install Python 3.8
- Install other dependencies.
pip install -r requirements.txt
- Navigate to the code directory.
cd code
- Open the JupyterLab notebook, which will open in the browser.
jupyter notebook
- Run all code up until before Logistic Regression.
- Build whichever models are of interest by running their respective sections.
- Python 3.8
- scikit-learn
- Pandas
- NumPy
- JupyterLab
- Meiqing Emma Jin
- Xinwei Victoria Song
- Elliot Kim
- This project was the final project for Professor Ben Mitchell's CPSC 66 Machine Learning course in Fall 2021. Thanks to Professor Mitchell for his guidance.