WikiExtractor

This is WikiExtractor! A simple and easy to use Python-based Web Scraping tool that can be used to extract information from Wikipedia pages.

As an added feature we have also included a simple pdf extractor that uses the Tesseract OCR engine to extract text from pdf files.

Installation

To contribute and work on the repository, you need Python installed on your system. If you do not have Python installed, you can install it from here.

Fork and clone the repository from GitHub.

git clone https://github.com/<your-username-here>/WikiExtractor.git

Traverse to the directory where the repository is cloned.

cd WikiExtractor

To execute the script, you will need to install the dependencies. It is recommended to create a virtual environment to do the same

# Create a virtual environment (not necessary but recommended)
python3 -m venv <name-of-virtual-environment>
source <name-of-virtual-environment>/bin/activate

# Install the dependencies
pip install -r requirements.txt

Wikipedia Extractor

Use the following commands to run the script.

python wiki_extractor.py --keyword=<your_keyword> --num_urls=<your_num_urls> --output=<your_output_JSON_file>

Replace each <>with the appropriate values. Make sure to append .json to the end of the output file name to prevent any errors.

PDF Extractor

To use the PDF Extractor, you will additionally have to install the Tesseract OCR Engine from here. You will also have to install Poppler from here and add the bin folder to the system PATH.

To run the script, use this command in the terminal

python pdf_extractor.py

Implementation

The implementation of WikiExtractor is done in Python. The code is written in a modular way so that it can be easily integrated into other projects.

The wikipedia extractor tool leverages the Search Optimization of the Google search engine to give the user the best possible results. It initially sends a GET request to the Google search engine with the query as the search term. The search engine returns a list of Wikipedia URLs that are relevant to the search term. The extractor then sends a GET request to each of the URLs and extracts the relevant information from the HTML page.

The pdf extractor tool uses the Tesseract OCR engine to extract text from pdf files. The extractor first downloads the pdf file and then uses the Tesseract OCR engine to extract the text from the pdf file. The extractor then writes the extracted text to a JSON file.

Future Updates

The next version of this tool will be implemented using multiprocessing to speed up the process of extraction for maximun efficiency.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chromedriver.exe		chromedriver.exe
pdf_extractor.py		pdf_extractor.py
requirements.txt		requirements.txt
sample_output.json		sample_output.json
wiki_extractor.py		wiki_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiExtractor

Installation

Wikipedia Extractor

PDF Extractor

Implementation

Future Updates

About

Releases

Packages

Languages

License

MistaAsh/WikiExtractor

Folders and files

Latest commit

History

Repository files navigation

WikiExtractor

Installation

Wikipedia Extractor

PDF Extractor

Implementation

Future Updates

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages