This GitHub repository is dedicated to a machine learning project that focuses on multi-modal modeling using various data types, including text, categorical, and numerical data. The primary objective of this project is to predict the "readmitted" status of patients using the features provided in the "8k_diabetes_v2.csv" dataset.
The project is divided into several parts, as outlined in the project description:
-
Exploratory Data Analysis (EDA): In this part, we analyze the dataset to understand its characteristics, identify patterns, and gain insights into the data. EDA plays a crucial role in shaping our approach to modeling.
-
Pre-processing Categorical Data: Categorical data often require encoding and transformations before they can be used in machine learning models. This section explains how we pre-process categorical data and the justification behind the chosen methods.
-
Pre-processing Numerical Data: Handling missing data and other preprocessing steps for numerical features is discussed in this section. Different algorithms may require distinct treatment, and the reasons for these choices are explained.
-
Text Data Modeling with tf-idf: We implement a model to make predictions using text data, particularly the "diag_desc_combined" field, by employing the Term Frequency-Inverse Document Frequency (tf-idf) method.
-
Model Stacking: This section covers the incorporation of tf-idf predictions for the text field into downstream algorithms, utilizing model stacking techniques.
-
Experimentation with Multiple Algorithms: We explore multiple modeling algorithms, each chosen with specific reasons, and conduct experiments to assess their performance.
-
Final Model Selection: The final choice of the model is discussed, along with an analysis of its strengths and weaknesses. We also address where the model may not perform well and potential areas for improvement.
To get started with this project, follow these steps:
- Clone this repository to your local machine.
- Download the dataset "8k_diabetes_v2.csv" from the provided source and place it in the project directory.
- Refer to the Jupyter notebooks or Python scripts in this repository to explore the code and models.
- Follow the instructions within each notebook or script to run the code and perform your own analyses.
Make sure you have the following Python libraries installed to run the code in this project:
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
- NLTK
- TensorFlow
- XGBoost (or other preferred machine learning libraries)
If you'd like to contribute to this project, feel free to fork the repository and submit pull requests. We welcome improvements, bug fixes, or additional features.