A spider for news on Twitter and Weibo.
Some basic knowledge and the developement log
- Operating system: Common Linux distribution is feasible (my OS for development test is ubuntu 20.04)
Python> = 3.6.0
mongoDB> =4.2
- Docker, if you can, please keep the Docker version is the latest
Clone and install the dependencies of python
[email protected]:LonelVino/Spider.git
cd Spider
pip install -r requirements.txt
You can use the SCRAPY_PROJECT environment variable in scrapy.cfg
to specify a different project for scrapy to use. For example, we define project2=tw_spider.settings
in scrapy.cfg
, then we can change the project as Twitter Spider by using:
export SCRAPY_PROJECT=project2
(Refer to Command line tool)
By default, the scrapy command-line tool will use the default settings, e.g. wb_spider
.
Usage: scrapy <command> [options] [args] Available commands: crawl: Run a spider settings: Getting settings value startprojet: Creates a new Scrapy project
You have kinds of ways to explore the database, such as MongoDB Atlas, MongoDB Compass, MongoDB Server, etc., according to the MongoDB documents. Here, I used MongoDB Server and MongoDB Compass together.
You can refer to the official tutorial, which is very comprehensive.
Besides, I use MongoDB Campass to visualize the database, which is a powerful GUI for querying, aggregating, and analyzing the MongoDB data in a visual environment. The examples of Database are as follows:
start or stop a docker container
sudo docker container start [container name] sudo docker container stop [container name] sudo docker container ls
Check the ip of a container
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' [container_id or container_name]