Python Crawlers

Overview

This project is a toolkit that wraps all Python utilities for web crawling. It includes a networking utility, logging utility, concurrency programming utility, and database utility.

Key Features

There are two stages in the run time of a web crawler program.

I/O-intensive stage of networking.
CPU-intensive stage of crawling HTTP responses.

A web crawler that utilizes this project will use the Python multiprocessing module to distribute multiple URLs into multiple processes. Then, it creates many coroutines in each process to execute I/O-intensive networking tasks asynchronously. Once any HTTP response return, the process will crawl the response, and it is a CPU-intensive task. A web crawler will continue these two stages back and forth until all URLs are scraped.

For example, five hundred URLs need to be scraped and my computer can create five processes to handle these tasks simultaneously.

First, the web crawler will split these five hundred URLs into twenty-five chunks (twenty URLs per chunk).

Second, one chunk will be assigned to a process, which means there are five chunks to be handled inside five processes, which also means there are one hundred HTTP requests (5 * 20) to be sent simultaneously. Once any HTTP response return, the process will crawl the response.

Finally, when tasks (twenty URLs) of one chunk are all done in one process, the web crawler will assign another chunk to this process.

Because of asynchronous execution, one process can wait for multiple HTTP requests to return simultaneously. Besides, due to the multi-core computer structure, multiple processes can crawl multiple responses parallelly when any response return.

Please note the following two points:

The number of processes must coordinate with the number of computer cores. More processes don't mean a faster performance of the program. If there are too many processes running inside one program, it will cause more context switches between processes and lower the speed of that program.
Consider the network speed, don't assign too many tasks inside one process at a time. Because of asynchronous execution, one process can perform multiple HTTP requests simultaneously and multiple processes are running inside a web crawler. Therefore, it will make many HTTP requests wait together and lead to timeout problems.

Utilities & Concepts

HTTP Utility

Developed by the Python AIOHTTP module.
The following points are the main reasons that cause networking tasks to fail when web crawling:
- Too many requests cause blocking of the IP address or the user-agent.
- Too many requests trigger the defense mechanism of the website for not processing any request.
- Incorrect cookies inside the request.
The HTTP Utility wraps the following methods to solve these problems:
- HTTP request retry.
- Initial cookies.
- User-agent rotation.
- Proxies rotation.

Database Utility

There are three storage modes for the Database Utility:
- SQLite
- CSV
- JSON
The Database Utility defines the same interface for three storage modes. We can directly store data in the type of Python dictionary list.
SQLite
- Developed by the Python SQLAlchemy module.
CSV
- Developed by the Python csv module.

Logging Utility

A web crawler that utilizes this web crawler toolkit is under a multi-process environment. It will cause race condition problems if multiple processes write log messages into a log file together. The following points will solve this problem:
- The web crawler will create a process and a multiprocessing queue. Inside the process, the web crawler receives all log messages from the multiprocessing queue and writes them into a log file.
- Pass this multiprocessing queue to every process to collect all log messages.

Crawler Utility

The Crawler Utility wraps APIs of the multiprocessing module and the Database Utility, we can just pass a multiprocessing pool and a crawler function into the API and is good to go.
The Crawler Utility will temporarily save all collected data into the memory. Once the web crawler collects more than five hundred records, the Crawler Utility will use the Database Utility to move all data into the database. ( or the CSV/ JSON file)
The Crawler Utility will save all failure URLs into a retry_info.json file for recrawling again in the future.

Crawlers Examples

Site Name	Site URL	Code	Description
Yahoo Movie	Link	Link	Crawl all movies.
Under Armour	Link	Link	Crawl all products.

TODO

add selenum support
add proxies crawler
add user agents

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
crawlers		crawlers
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
clear.sh		clear.sh
docker-compose.yml		docker-compose.yml
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Crawlers

Table Of Contents

Overview

Key Features

Utilities & Concepts

HTTP Utility

Database Utility

Logging Utility

Crawler Utility

Crawlers Examples

TODO

About

Releases

Packages

Languages

Jerry0420/python-crawlers

Folders and files

Latest commit

History

Repository files navigation

Python Crawlers

Table Of Contents

Overview

Key Features

Utilities & Concepts

HTTP Utility

Database Utility

Logging Utility

Crawler Utility

Crawlers Examples

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages