Implemented a Support Vector Machine model for diagnosing thyroid disease.
- Specifically, the SVM is used to distinguish between 4 different thyroid states:
- Hyperthyroid
- Hypothyroid
- Euthyroid-sick
- Euthyroid (negative)
- In the U.S. alone, an estimated 27 million Americans have Thyroid disease
- More than half of this population are undiagnosed or misdiagnosed
- Women are seven times more likely to contract thyroid problems than men and nearly half of all women and a quarter of all men in the US will die with evidence of an inflamed thyroid.
- The symptoms of this disease often vary from person to person and are non-specific, so a correct diagnosis can easily be missed or misdiagnosed for irrelevant issues.
- Finding an accurate solution to this problem is a must
- Provide an efficient solution for healthcare practitioners via Support Vector Machines for diagnosing/classifying a particular thyroid disease that a person may have.
- This tool will cause an immense decrease in misdiagnoses as it is capable of distinguishing between problems of the thyroid gland and other illnesses in the body.
- As well as providing the ability to detect the disease before it forms into a more destructive anomaly.
The thyroid is a gland that produces thyroid hormone. This hormone regulates vital body functions:
- Breathing
- Body weight
- Heart rate
- Muscle Strength
-
Euthyroid: A normal functioning thyroid gland
-
Euthyroid Sick: The serum levels of thyroid hormones are low in clinically euthyroid patients with nonthyroidal systemic illness
-
Hypothyroidism: The thyroid fails to make enough thyroid hormones and slows down many of your body's functions
-
Hyperthyroidism: The thyroid makes more thyroid hormone than the body needs, which speeds up many of your body's functions.
- Goal: gain a solid understanding of the data and make corrections before training model
- Imputation of missing values
- For Continuous values: substituted in with the mode of the column
- Mode is the most frequently occurring number Robust to outliers
- For Categorical values (Sex): wasn’t many missing, so deleted those rows
- Why? Because the sex of the person matters in respect to their diagnosis
- For Continuous values: substituted in with the mode of the column
- Balanced data: leaned heavily towards the negative samples
- Once found, able to train model
Data can be found here: https://archive.ics.uci.edu/ml/datasets/thyroid+disease
- A geometric approach to classification and regression tasks
- Main Objective: to find a hyperplane that distinctly classifies the data points.
- Hyperplanes are decision boundaries that help classify the data points.
- If a data point falls on either side of the hyperplane, it is attributed to different classes.
- There exist many possible hyperplanes, but the one that has the maximum margin, or distance, between data points of the classes is ideal.
- This maximum marginal distance provides confidence that future data points are assigned to the correct class
- SVMs are typically used for binary classification; classifying between two classes
- Strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives.
- Essentially, it decomposes a multiclass classification problem into a multiple binary classification problems
- Obtained an accuracy of about 69%
- For further details on the entire process, visit the notebook here
Python Version: 3.6.5
Pandas: Powerful data structures for data analysis, time series, and statistics
pip install pandas
Numpy: a general-purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays
pip install numpy
Scikit-Learn: Simple and efficient tools for data mining and data analysis
pip install scikit-learn
Seaborn: A library for making statistical graphics in Python
pip install seaborn
Matplotlib: Python plotting package
pip install matplotlib