Website classification based on text content

This project can classify a website from 11 different labels based on the text and metadata content of the website. The machine learning model used for classification is Logistic Regression. It was trained on the text and metadata content from 59K+ websites using TF-IDF features after basic text pre-processing. Accuracy of 77.9 % was achieved.The labels are :

Arts
Business
Computers
Games
Health
Home
News
Reference
Science
Shopping
Society

Setup and use

Clone the repository

$ git clone https://github.com/rachhek/website_classifier_app.git
$ cd website_classifier_app

Create a virtual environment and install the dependencies

$ pipenv shell
$ (website_classifier_app) pipenv install

Once the pipenv has finished installing all the dependencies, you can use classify_website.py to classify a website as follows:

$ (website_classifier_app) python classify_website.py https://etsy.com

Sample result

{
  "predicted_class": "Shopping",
  "prob_distribution": [
    ("Shopping", 89.578),
    ("Society", 2.086),
    ("Arts", 1.415),
    ("Computers", 1.15),
    ("Home", 1.11),
    ("Science", 1.068),
    ("Games", 0.932),
    ("Health", 0.711),
    ("Business", 0.69),
    ("Reference", 0.662),
    ("News", 0.598)
  ]
}

Result explanation

"predicted_class" : The class that was predicted.
"prob_distribution" : The probability distribution (in percentage) of the 11 classes for this website.

App Screenshot

Dependencies

The major dependencies are as follows.

pandas
beautifulsoup4
sklearn
requests
spacy

Check pipfile for details

Files and folders

- "./jupyter_notebooks" : contains Jupyter notebooks for code used for scraping the 59k+ websites, data preprocessing, data exploration, model creation and parameter tuning.
- "./model" : contains the tf-idf vectorizer file and trained logistic regression model
- "./classify_website.py" : the py file containing the main function
- "./lib.py": the py file containing the classes for scraper, text pre processing and prediction
- "./pipfile": file containing the dependencies for this pip environment.

Model Training Details

Dataset collection

The dataset used is : https://www.kaggle.com/patanjalichin1/dmoztoolsnet-categorical-data

Has 800K+ labelled websites with Major and sub categories.
11 Major categories : Arts, Business, computers, games, health, home, news, reference, science, shopping, society.
For this project, only 5500 websites per category were collected (Total of 59K+ websites)
Check ./jupyter_notebook/dmoz_dataset_exploration.ipynb and ./jupyter_notebook/dataset_creation.ipynb for details

Website scraping

Requests was used to download the website content and Beautifulsoup was used to parse the website
Title tags, Meta tags, heading tags and text content from the website was collected along with the count of html tags.
Check ./jupyter_notebook/dataset_creation.ipynb for details

Text preprocessing

Spacy library was used to remove the stopwords, lemmatize the words and other unneccesary stuff.
Check ./jupyter_notebooks/data_preprocessing_and_exploration.ipynb for details

Feature engineering and Model Training

TF-IDF features was created using sklearn\s TfidfVectorizer
Sklearn's logistic regression was used to train the model. Other models like Naive Bayes, Neural networks (LSTM RNN) were also used but the accuracy didn't improve significantly and the size of the model was also huge (100mb+). Thus, I decided to stick with logistic regression model because it gave me lightweight model and equally good accuracy as others.
Sklearn's gridsearch was used to tune the parameters.
Model has accuracy of 77.9%
Check ./jupyter_notebooks/classification_using_logistic_regression.ipynb for details
The confusion matrix is as follows:

Findings

The model is very good at classifying website categories related to Games, health and news. However, it seems high number of incorrect predictions are as follows:

Business websites being predicted as Shopping(161)
Reference websites being predicted as Science(113)
Shopping websites being predicted as Business (99)
Science websites being predicted as Business(80)

Future Ideas

The dmoz dataset has lots of sub-categories for every main category. This can be used to train the model that can classify even more classes.
Neural network models work better when there are lot of training data. If we can scrape all 800K websites from the dmoz dataset and used it to train the model, then the accuracy might improve
During the scraping process, I have also scraped all the HTML tags and their frequency of occurance. This data could be used as features to train models.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
images		images
jupyter_notebooks		jupyter_notebooks
model		model
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
classify_website.py		classify_website.py
lib.py		lib.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website classification based on text content

Setup and use

Dependencies

Files and folders

Model Training Details

Dataset collection

Website scraping

Text preprocessing

Feature engineering and Model Training

Findings

Future Ideas

References

About

Releases

Packages

Contributors 2

Languages

rachhek/website_classifier_app

Folders and files

Latest commit

History

Repository files navigation

Website classification based on text content

Setup and use

Dependencies

Files and folders

Model Training Details

Dataset collection

Website scraping

Text preprocessing

Feature engineering and Model Training

Findings

Future Ideas

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages