This script utilizes various machine learning models to predict the severity of mammographic masses based on features such as age, mass shape, margin, and density.
- pandas
- numpy
- sklearn (for preprocessing, model selection, and various classifiers)
- tensorflow (for Keras neural network model)
The dataset used in this script has the following columns:
- BI-RADS assessment
- Age
- Mass Shape
- Margin
- Density
- Severity
The data is loaded from a local path (/Users/amath/Downloads/MLCourse-2/mammographic_masses.data.txt
).
- Columns with unknown values (
?
) are treated as NaN. - Rows with any NaN values are removed from the dataset.
- Features are scaled using
StandardScaler
from scikit-learn.
- Decision Tree Classifier
- Random Forest Classifier
- Support Vector Machine (with various kernels: linear, rbf, sigmoid, poly)
- K-Nearest Neighbors (tested for k values ranging from 1 to 50)
- Multinomial Naive Bayes
- Logistic Regression
- Neural Network (using Keras)
- Input layer with 64 neurons (corresponding to 4 features)
- Dropout layer with 50% dropout rate
- Hidden layer with 64 neurons and ReLU activation
- Dropout layer with 50% dropout rate
- Output layer with 1 neuron and sigmoid activation (binary classification)
After training each model, the script prints the accuracy of the model using 10-fold cross-validation.
- Ensure you have all the necessary libraries installed.
- Replace the dataset path with the correct path on your machine.
- Run the script. After execution, you'll see the accuracy results for each model.
- Hyperparameter tuning for improved model accuracy.
- Exploration of additional preprocessing steps, like feature engineering.
- Inclusion of visualizations to understand the significance of each feature.
- Saving the best-performing model for future predictions.