Specific for Classification of Websites as Fishy or Not
DATASET UNDERSTANDING: 11055 websites have been classified as Fishy (1) or Not Fishy (0) based on links to 31 known Fishy websites.
Classifier has been implemented from scratch to demonstrate understanding of underlying concepts. Algorithm Flow is as follows:
- Dataset (in csv format) has been converted into list
- Dataset has been split into Training Set for training the naive bayes classifier and Test Set for validation
- Seperation by Class is done to facilitate classification based on known instances(websites) of a class
- Summarization by Class helps in calculating mean and standard deviation associated with a class
- Based on the summaries, Class Probabilities associated with a class are calculated using Gaussian Probability. The class with highest probability is the Prediction.
- Accuracy can be gauged with the getAccuracy function
Upcoming improvements
- Increasing robustness of the model by training with multiple folds of train & test sets
- Introducing concepts of learning rates and epochs associated with Machine Learning to boost accuracy
- Evaluating Accuracy when Probability is calculated with Semi-supervised parameter estimation