- Introduction
- Dataset Overview
- Project Structure
- Installation
- Data Analysis (EDA)
- Preprocessing
- Feature
- Modeling
- Evaluation
- Conclusion
- Acknowledgements
Customer churn is a issue in the telecommunications industry. Understanding why customers leave and predicting churn can help companies come up with new strategies to retain customers. This builds a machine learning model to predict customer churn based on various data provided in the csv file.
The dataset contains information about a fictional tele communication company's customers (CX). It includes customer demographic info, account details, and the services they have subscribed to.
-
Features Include:
- CustomerID
- Demographic: Gender, Senior Citizen status, Partner, and Dependents.
- Account Information: Tenure, Contract type, Payment method, Monthly charges, and Total charges.
- Services: Phone service, Multiple lines, Internet service, Online security, Online backup, Device protection, Tech support, Streaming TV, and Streaming movies.
-
Target Variable:
- Churn: If the customer churned (Yes) or not (No).
ML-churn/
├── data/
│ └── telco_churn.csv
├── images/
│ ├── churn_distribution.png
│ ├── numerical_distributions.png
│ └── correlation_matrix.png
├── src/
│ ├── __init__.py
│ ├── data_loading.py
│ ├── eda.py
│ ├── preprocessing.py
│ ├── feature_engineering.py
│ ├── modeling.py
│ ├── evaluation.py
│ └── visualization.py
├── main.py
├── requirements.txt
└── README.md
- data/: dataset
- images/
- src/: modules
- main.py: main logic
- requirements.txt: packages needed to run this
- README.md
-
clone the repo
git clone https://github.com/INAHIDC/ML-churn.git cd ML-churn
-
Dependencies ;skip if you have it already
make sure to use a venv for this!!!!!!!!
pip install -r requirements.txt
-
run it
python main.py
Understanding the data is important before going into modeling.
We start by examining the distribution of the target variable.
- Observation: The dataset is imbalanced, with a higher number of customers who did not churn.
We plotted histograms for numerical features like tenure
, MonthlyCharges
, and TotalCharges
.
- Observation:
- Tenure: lots of customers are either new or have been with the company for a long time.
- MonthlyCharges: a wide range of monthly charges.
- TotalCharges: Similar to tenure, reflects the cumulative amount charged.
analyzed correlations between numerical variables.
- Observation: correlation between
TotalCharges
andtenure
, which makes sense as longer tenure typically results in higher total charges.
preprocessing include handling missing values, encoding categorical variables
- Replaced empty strings with
NaN
. - Dropped rows with missing values.
- Label: binary categorical variables.
- One Hot: categorical variables with more than two categories.
- standardization: added to numerical features to normalize the data.
- TotalServices: Summed up the number of services a customer has subscribed to.
two models to predict customer churn.
simple model to establish a baseline.
- Pros: Easy to understand.
- Cons: does not show complex relationships
An ensemble model to improve performance.
- Pros: nonlinear relationships.
- Cons: not understable/ or intuitive
evaluated the models using classification reports, confusion matrices, and ROC curves.
- Accuracy: Moderate.
- Precision and Recall: the model's performance on predicting churned CX.
- Confusion Matrix and ROC Curve: shows true positives, false positives etc.
- Accuracy: Improved compared to Logistic Regression.
- helps find which features contribute most to predictions.
- Confusion Matrix and ROC Curve: Showed better performance in distinguishing churned customers.
-
What do i think?
- Forest model outperformed Logistic Regression in predicting customer churn.
- Features like contract type, tenure, and monthly charges are important predictors.
-
Implications:
- Cx Retention: Focus on customers with month to month contracts and high monthly charges.
- Service Improvement: improve services that are linked to higher churn rates.
- Dataset Source: IBM Cognos Analytics Dataset.