presentation.qmd

---
title: "Stroke Prediction Based on Demographics and Medical History"
subtitle: "INFO 511 - Fall 2024 - Final Project"
author: "Danielle Stea, Erika Kirkpatrick, Kai Shuen Neo, Sahand Motameni, Rohit Kalakala"
title-slide-attributes:
  data-background-image: images/watercolour_sys02_img34_teacup-ocean.jpg
  data-background-size: stretch
  data-background-opacity: "0.7"
  data-slide-number: none
format:
  revealjs:
    theme:  ['data/customtheming.scss']
  
editor: visual
execute:
  echo: false
jupyter: python3
---

```{python}
#| label: load-packages
#| include: false

# Load packages here
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
```

```{python}
#| label: setup
#| include: false
#| 
# Set up plot theme and figure resolution
sns.set_theme(style="whitegrid")
sns.set_context("notebook", font_scale=1.1)

import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = (6, 6 * 0.618)
```

```{python}
#| label: load-data
#| include: false
# Load data in Python
stroke = pd.read_csv("data/healthcare-dataset-stroke-data.csv")
```

## Introduction

-   Strokes can be a deadly medical condition, and even if the patient survives there can be life long consequences as a result of the stroke.

-   This is why we wanted to look into a dataset that may allow us to help predict the risk factors that cause a stroke.

## Dataset

-   The dataset used is from a study published in China in 2020.
-   It represents a case group of individuals who had a stroke, and a control group of those who did not.

## Data

-   Observations include age, gender, marital status, and employment status, as well as medical information such as BMI, glucose levels, smoking history, etc.

```{python}
#| label: loaddata
#| echo: false
import pandas as pd
stroke = pd.read_csv("data/healthcare-dataset-stroke-data.csv")
print(stroke.head())
```

## Data

-   Qualitative variables: gender, smoking status, employment status, marital status.
-   Quantitative variables: age, BMI, glucose index.

```{python}
#| label: loaddatainfo
#| echo: false
import pandas as pd
stroke = pd.read_csv("data/healthcare-dataset-stroke-data.csv")
print(stroke.info())
```

## Visualization for Stroke Distribution

-   Across the population, there is a higher proportion of people who do not suffer from stroke as compared to those who do suffer from stroke.

![](images/stroke.png){fig-align="center"}

## Visualization for Gender Distribution

-   As for gender analysis, there is a higher proportion of females as compared to males surveyed.

![](images/gender.png){fig-align="center"}

## Visualization for Residence Type Distribution

-   In terms of residence type, the distribution for rural vs urban are highly equivalent.

![](images/residence.png){fig-align="center"}

## Visualization for Marital Status Distribution

-   As for marital status, there is a higher proportion of people surveyed who are married, as compared to those who are not married.

![](images/marital_status.png){fig-align="center"}

## Visualization for Work Type Distribution

![](images/work_type.png){fig-align="center"}

## Visualization for Smoking Status Distribution

![](images/smoking.png){fig-align="center"}

# **Methods and Results**

## Exploratory Data Analysis

-   Missing values were dropped from the dataset.
-   Categorical variables were transformed into factors.

![](images/proportion_of_missing_values.png){fig-align="center"}

## Variable Correlation Heatmap

-   The correlation matrix of all variables, other the ID, is displayed on the heatmap.

![](images/variable_correlation_heatmap.png){fig-align="center"}

## **Machine Learning**

-   After preparing the dataset, we split it into training (80%) and testing (20%) sets.
-   We trained 5 different machine learning models on this dataset: Random Forest, Decision Tree, Support Vector Machine (SVM), Artificial Neural Network (ANN), Logistic Regression.
-   To evaluate the performance of these models, we employed 5-fold cross-validation.

## **Results**

![](images/model_performance.png){fig-align="center"}


## **Model Performance Evaluation**

-   If model simplicity, speed, and interpretability are essential, Logistic Regression is an excellent option.
-   However, if model robustness and predictive power are the priority, Random Forest would be the better choice.

# **Conclusion**

## Summary

-   There was not a particularly strong correlation between the stroke outcome and the covariates listed in the dataset.

-   The strongest correlation to a stroke outcome was with age at 0.23.

-   Other notable correlations to a stroke outcome were hypertension, heart disease and glucose level, all at 0.14.

## Conclusion

-   It appears that the strongest correlations to a stroke outcome occur with participants’ physical properties rather than their demographics or living status.
-   The model results show that individuals in this dataset could be classified by their stroke outcome based on the covariates age, gender, marital status, and employment status, BMI, glucose levels, and smoking history with a high degree of accuracy.
-   Since there did not appear to be a particularly strong correlation to any one covariate, it was likely a combination of these factors associated with a stroke outcome.

## Future work

-   To increase the population of people being surveyed as one of the limitations we encountered with this dataset is that there was actually a very small proportion of people who had a stroke compared to those that did not: 4.87% (N=249) of individuals had a stroke, while 95.1% (N=4861) did not.

-   This created an unequal distribution within both the training and testing dataset, which may have artificially boosted the classification accuracy.

## Future work

-   To increase the diversity of the population of people being surveyed as the context upon which the study was conducted was not able to return insights that we could consider to be generalizable across the general public internationally.

-   The dataset was from a study that was conducted in China, with the population studied mainly being the Chinese population. As such, observations gathered from the study would be considered more applicable towards the Asian population instead of the international population. 

## **References**

-   2020: Pathan, Muhammad Salman & Zhang, Jianbiao & John, Deepu & Nag, Avishek & Dev, Soumyabrata. (2020). Identifying Stroke Indicators Using Rough Sets. IEEE Access. 8. 10.1109/ACCESS.2020.3039439