Skip to content

small ML model to predict whether a customer will stop doing business with the company based on their demographic, account, and service usage information

Notifications You must be signed in to change notification settings

INAHIDC/ML-churn_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS4089 ProJect

Churn Prediction project

Professor : Dr. Schwartzman

churn_distribution png


Content

  1. Introduction
  2. Dataset Overview
  3. Project Structure
  4. Installation
  5. Data Analysis (EDA)
  6. Preprocessing
  7. Feature
  8. Modeling
  9. Evaluation
  10. Conclusion
  11. Acknowledgements

Customer churn is a issue in the telecommunications industry. Understanding why customers leave and predicting churn can help companies come up with new strategies to retain customers. This builds a machine learning model to predict customer churn based on various data provided in the csv file.


Dataset

The dataset contains information about a fictional tele communication company's customers (CX). It includes customer demographic info, account details, and the services they have subscribed to.

  • Features Include:

    • CustomerID
    • Demographic: Gender, Senior Citizen status, Partner, and Dependents.
    • Account Information: Tenure, Contract type, Payment method, Monthly charges, and Total charges.
    • Services: Phone service, Multiple lines, Internet service, Online security, Online backup, Device protection, Tech support, Streaming TV, and Streaming movies.
  • Target Variable:

    • Churn: If the customer churned (Yes) or not (No).

file tree

ML-churn/
├── data/
│   └── telco_churn.csv
├── images/
│   ├── churn_distribution.png
│   ├── numerical_distributions.png
│   └── correlation_matrix.png
├── src/
│   ├── __init__.py
│   ├── data_loading.py
│   ├── eda.py
│   ├── preprocessing.py
│   ├── feature_engineering.py
│   ├── modeling.py
│   ├── evaluation.py
│   └── visualization.py
├── main.py
├── requirements.txt
└── README.md
  • data/: dataset
  • images/
  • src/: modules
  • main.py: main logic
  • requirements.txt: packages needed to run this
  • README.md

TO START

  1. clone the repo

    git clone https://github.com/INAHIDC/ML-churn.git
    cd ML-churn
  2. Dependencies ;skip if you have it already

make sure to use a venv for this!!!!!!!!

pip install -r requirements.txt
  1. run it

    python main.py

Analysis

Understanding the data is important before going into modeling.

1. Churn Distribution

We start by examining the distribution of the target variable.

![Churn Distribution] churn_distribution png

  • Observation: The dataset is imbalanced, with a higher number of customers who did not churn.

2. Numerical Features Distribution

We plotted histograms for numerical features like tenure, MonthlyCharges, and TotalCharges.

![Numerical Distributions] numerical_distributions png

  • Observation:
    • Tenure: lots of customers are either new or have been with the company for a long time.
    • MonthlyCharges: a wide range of monthly charges.
    • TotalCharges: Similar to tenure, reflects the cumulative amount charged.

3. Correlation Matrix

analyzed correlations between numerical variables.

![Correlation Matrix]correlation_matrix png

  • Observation: correlation between TotalCharges and tenure, which makes sense as longer tenure typically results in higher total charges.

Data Preprocessing

preprocessing include handling missing values, encoding categorical variables

1. Handling Missing Values

  • Replaced empty strings with NaN.
  • Dropped rows with missing values.

2. Encoding Categorical Variables

  • Label: binary categorical variables.
  • One Hot: categorical variables with more than two categories.

3. Feature Scaling

  • standardization: added to numerical features to normalize the data.

Feature Engineering

  • TotalServices: Summed up the number of services a customer has subscribed to.

Modeling

two models to predict customer churn.

1. Logistic Regression

simple model to establish a baseline.

  • Pros: Easy to understand.
  • Cons: does not show complex relationships

2. Random Forest Classifier

An ensemble model to improve performance.

  • Pros: nonlinear relationships.
  • Cons: not understable/ or intuitive

Evaluation

evaluated the models using classification reports, confusion matrices, and ROC curves.

1. Logistic Regression Results

  • Accuracy: Moderate.
  • Precision and Recall: the model's performance on predicting churned CX.
  • Confusion Matrix and ROC Curve: shows true positives, false positives etc.

2. Random Forest

  • Accuracy: Improved compared to Logistic Regression.
  • helps find which features contribute most to predictions.
  • Confusion Matrix and ROC Curve: Showed better performance in distinguishing churned customers.

Conclusion

  • What do i think?

    • Forest model outperformed Logistic Regression in predicting customer churn.
    • Features like contract type, tenure, and monthly charges are important predictors.
  • Implications:

    • Cx Retention: Focus on customers with month to month contracts and high monthly charges.
    • Service Improvement: improve services that are linked to higher churn rates.

source

About

small ML model to predict whether a customer will stop doing business with the company based on their demographic, account, and service usage information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages