Url Classification

Authors


Breiner Carranza	Cristopher González	Derian Rodríguez

What does it consist of

Web Scrapping

This performs a tour of the text found in each of the urls, in case the url is invalid, it identifies it as such. This text extraction is done through BeautifulSoup, with which you can go through the tags found in the body of the page and extract the string data found in them, finally, all this is stored in a JSON file.

Multiprocess

Multiprocessing is of great importance in this app, due to the large amount of information that must be processed to obtain an effective result and it is necessary to distribute the workload in more than one process and thus obtain a more effective result. The multiprocess is implemented in two parts of the app, first it is used in the execution of web scrapping, in which 4 processes are used which apply web scrapping to 4 urls simultaneously. The second multiprocess application is in the analysis of the text previously extracted from the url, in the same way 4 processes are used to identify the words of each category per web page.

Bayes theorem

To apply this analysis we must use the 4 main data, which are: total games, total computation, total invalid pages and total links. Now we are going to be able to calculate the prior probability of the computer and games categories, this consists of dividing between the total of the category and the total of links, then by dividing between the total of words of each category and the total of this category (example: total game words/total games), the incidence probability of the categories is calculated. With the multiplication of the prior probability and the incidence probability, we obtain the probability of each category, with this we can compare who is greater and categorize it, in case their probabilities are equal, it is categorized as invalid. Another case in which it is categorized as invalid is that the number of words in both categories is less than 7.

Sample data on the web

A bar graph is created in which it shows us the amount identified in all the urls of the three categories, in this way the results obtained are shown much more clearly, other than that it is shown on the web being easier to access. If we click on one of the 3 bars, it will show us the urls corresponding to its category, in addition, if we click on one of the urls, it will show us the list of words with which it was identified for this category.

Instalation

How to clone

git clone https://github.com/cris-gs/url-classification.git

Install dependencies

Multiprocessing pip install multiprocessing

Pandas pip install pandas

Requests pip install requests

Bs4 pip install bs4

Dash pip install dash

Plotly pip install plotly

How to run it

First we must execute the web_scraping.py in which the web scraping will be applied to the urls, extracting the text and saving it in the json file datos.json

Second we must execute the main.py in which the words of each page will be evaluated, grouping them by category in the json file datos.json

Third, we are going to run urlClassification.py, it applies Bayesian analysis and saves the results to the json file dataCategories.json

Fourth and last, we are going to execute dashboard.py, it interprets the data in graphs, and gives us a url in the console, which we must open with the browser

This is executed in this sequence of steps so that the interaction time with the data is not so long due to the large number of urls that are analyzed, however, if we want to execute everything in the same step, we can call the functions directly in main.py and only run once.

Time Comparative

Web Scraping

Sequential time: 16126.5082452297 seconds.
Parallel time: 4278.2422530651 seconds.

Identify the words of each category by web page

Sequential time: 4.5470371246 seconds.
Parallel time: 1.9183907509 seconds.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
__pycache__		__pycache__
assets		assets
README.md		README.md
URL_Dataset.csv		URL_Dataset.csv
dashboard.py		dashboard.py
dataCategories.json		dataCategories.json
datos.json		datos.json
keywords.py		keywords.py
main.py		main.py
tempCodeRunnerFile.py		tempCodeRunnerFile.py
urlClassification.py		urlClassification.py
web_scraping.py		web_scraping.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Url Classification

Authors

What does it consist of

Web Scrapping

Multiprocess

Bayes theorem

Sample data on the web

Instalation

How to clone

Install dependencies

How to run it

Time Comparative

Web Scraping

Identify the words of each category by web page

About

Releases

Packages

Contributors 3

Languages

cris-gs/url-classification

Folders and files

Latest commit

History

Repository files navigation

Url Classification

Authors

What does it consist of

Web Scrapping

Multiprocess

Bayes theorem

Sample data on the web

Instalation

How to clone

Install dependencies

How to run it

Time Comparative

Web Scraping

Identify the words of each category by web page

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages