Identification of fraud using ML - Enron dataset

Brief :

The goal of this project is to use ML algorithms from the python-sk-learn library and identify Persons of Interest (POIs) in the given enron dataset.

The dataset has 146 datapoint with 21 features extracted from emails. To pass this particular project, I need a precision and recall greater than or equal to 0.3.

Steps Involved :

Understanding and cleaning dataset.
Removing outliers
Optimizing features and feature selection
Feature engineering
Scaling the features
Selecting algorithm
Parameter tuning.

Algorithms used :

Since this is a binary classification problem i.e is a given person a Person of interest or not, I decided to use:

For Feature selection I used both Tree-based feature selection and Select-K-Best. I ended up using Select-K-Best since that gave the best precision and accuracy for given number of features.

Parameters tuned :

Extra Trees classifier :

n_estimators - [5 , 10 , 15]
criterion = ['gini' , 'entropy']
min_sample_split = [2 , 3 , 4 , 5 , 10 ]

Logistic Regression :

max_iter = [100, 200 , 300 , 500 , 1000 , 10000]
penalty = ['l1' , 'l2'] (liblinear uses both l1 and l2 , liblinear with dual uses only l2 and newton-cg , sags , lbfgs use only l2).
solver = ['liblinear' , 'newton-cg' , 'sag', 'lbfgs'].
fit_intercept : [True , False]

Linear-SVM :

loss = ['hinge','squared_hinge']
max_iter = [100,200,500,1000,10000]
tolerance = [1e-2 , 1e-4 , 1e-6 , 1e-8 , 1e-10],
multi_class = ['ovr' , 'crammer_singer']

All of the above algorithms and parameters were tried once with and without - Principle Component analysis.

For parameter tuning I used :

Validating metrics :

For validating the results , I used many methods from Cross-validation. In the end I decided to use Stratified shuffle split to split the data, Since the POIs and Non-POIs are unevenly distributed, and I wanted a good classification with the training and testing dataset split(30% split for testing). Few others I tried were : Shuffle split, k-1 split, and simple test-train split.

Full report : report.pdf
Results of various tests : Enron Dataset - ML results.pdf

Additional mini-project : Text based mining.

I downloaded the body of the emails from the Enron dataset and performed text-based classification on the emails using Count-Vectorizer as well as TfIdf transformer. I got an accuracy of 50% when the dataset had equal amount of POIs and Non-POIs. When the ratio of POIs to Non-POIs was 1:3 or close , I got 25% accuracy. Anything beyond 30 points in dataset failed to converge and produced 0% accuracy. This is because the sampling of data I used has very skewed distribution of POIs and Non-POIs. Refer : get_poi_names.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
enron		enron
experiments		experiments
final_submission		final_submission
tools		tools
Enron Dataset - ML results.pdf		Enron Dataset - ML results.pdf
get_poi_names.py		get_poi_names.py
readme.md		readme.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identification of fraud using ML - Enron dataset

Brief :

Steps Involved :

Algorithms used :

Parameters tuned :

Validating metrics :

Full report : report.pdf
Results of various tests : Enron Dataset - ML results.pdf

Additional mini-project : Text based mining.

About

Releases

Packages

Languages

anirudhr95/Fraud-identification-using-ML---Enron-email-corpus

Folders and files

Latest commit

History

Repository files navigation

Identification of fraud using ML - Enron dataset

Brief :

Steps Involved :

Algorithms used :

Parameters tuned :

Validating metrics :

Full report : report.pdf Results of various tests : Enron Dataset - ML results.pdf

Additional mini-project : Text based mining.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Full report : report.pdf
Results of various tests : Enron Dataset - ML results.pdf

Packages