pyBrokk

This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead.

Why `pyBrokk`

There are some libraries and packages that can facilitate this job, from scraping text from a URL to returning it to a bag of words (BOW). However, to the extent of our knowledge, there is no sufficiently handy and straightforward package for this purpose. This package is a tailored combination of BeatifulSoup and CountVectorizer. BeautifulSoup widely used to pull different sources of data from HTML and XML pages, and CountVectorizer is a well-known package to convert a collection of texts to a matrix of token counts.

NOTE:

Some websites do not let users collect their data with web scraping tools. Make sure that your target websites do not refuse your request to collect data before applying this package.

Features

The pyBrokk package includes the following four functions:

create_id(): Takes a list of webpage urls formatted as strings as an input and returns a list of unique string identifiers for each webpage based on their url. The identifier is composed of the main webpage name followed by a number.
text_from_url() : Takes a list of urls and using Beautiful Soup extracts the raw text from each and creates a dictionary. The keys contain the original URL and the values contain the raw text output as parsed by Beautiful Soup.
duster(): Takes a list of urls and uses the above two functions to create a dataframe with the webpage identifiers as a index, the raw url, and the raw text from the webpage with extra line breaks removed.
bow(): Takes a string text as an input and returns the list of unique words it contains.

Installation

$ pip install pybrokk

Usage

Imports

import pybrokk 
import requests 
import pandas as pd 
from bs4 import BeautifulSoup 
from sklearn.feature_extraction.text import CountVectorizer

Input Format

urls = ['https://www.utoronto.ca/',
         'https://www.ubc.ca/',
         'https://www.mcgill.ca/',
         'https://www.queensu.ca/']

create_id()

Creates unique IDs for a list of URLs

url_ids = create_id(urls)

text_from_url()

Creates a dictionary with original URLs as keys and parsed using `BeautifulSoup` text as values

dictionary = text_from_url(urls)

duster()

Create a dataframe using the outputs of `create_id()` and `text_from_url()`

df = duster(urls)

bow()

Create a dataframe of a bag of words appended to the input dataframe

df_bow = bow(df)

Contributing

Interested in contributing? Check out the contributing guidelines and the list of contributors who have contributed to the development of this project thus far. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by these terms.

License

pyBrokk was created by Elena Ganacheva, Mehdi Naji, Mike Guron, Daniel Merigo. It is licensed under the terms of the MIT license.

Credits

pyBrokk was created with cookiecutter and the py-pkgs-cookiecutter template. pyBrokk uses beautiful soup

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github/workflows		.github/workflows
docs		docs
src/pybrokk		src/pybrokk
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CONDUCT.md		CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyBrokk

Why `pyBrokk`

NOTE:

Features

Installation

Usage

Imports

Input Format

create_id()

Creates unique IDs for a list of URLs

text_from_url()

Creates a dictionary with original URLs as keys and parsed using `BeautifulSoup` text as values

duster()

Create a dataframe using the outputs of `create_id()` and `text_from_url()`

bow()

Create a dataframe of a bag of words appended to the input dataframe

Contributing

License

Credits

About

Releases 13

Packages

Contributors 4

Languages

License

UBC-MDS/pyBrokk

Folders and files

Latest commit

History

Repository files navigation

pyBrokk

Why pyBrokk

NOTE:

Features

Installation

Usage

Imports

Input Format

create_id()

Creates unique IDs for a list of URLs

text_from_url()

Creates a dictionary with original URLs as keys and parsed using BeautifulSoup text as values

duster()

Create a dataframe using the outputs of create_id() and text_from_url()

bow()

Create a dataframe of a bag of words appended to the input dataframe

Contributing

License

Credits

About

Resources

License

Stars

Watchers

Forks

Releases 13

Packages 0

Contributors 4

Languages

Why `pyBrokk`

Creates a dictionary with original URLs as keys and parsed using `BeautifulSoup` text as values

Create a dataframe using the outputs of `create_id()` and `text_from_url()`

Packages