Spider

A spider for news on Twitter and Weibo.

Pre-Knowledge

Some basic knowledge and the developement log

How to start

Operating environment

Operating system: Common Linux distribution is feasible (my OS for development test is ubuntu 20.04)
Python> = 3.6.0
mongoDB> =4.2
Docker, if you can, please keep the Docker version is the latest

Clone and install dependencies

Clone and install the dependencies of python

[email protected]:LonelVino/Spider.git
cd Spider
pip install -r requirements.txt

Select Spider

You can use the SCRAPY_PROJECT environment variable in scrapy.cfg to specify a different project for scrapy to use. For example, we define project2=tw_spider.settings in scrapy.cfg, then we can change the project as Twitter Spider by using:

export SCRAPY_PROJECT=project2

(Refer to Command line tool)

By default, the scrapy command-line tool will use the default settings, e.g. wb_spider.

some tips of `scrapy` command tool

Usage:
  scrapy <command> [options] [args]
Available commands:
  crawl:        Run a spider
  settings:     Getting settings value
  startprojet:  Creates a new Scrapy project

Initialize and Start Spider

DataBase

You have kinds of ways to explore the database, such as MongoDB Atlas, MongoDB Compass, MongoDB Server, etc., according to the MongoDB documents. Here, I used MongoDB Server and MongoDB Compass together.

MongoDB Server

You can refer to the official tutorial, which is very comprehensive.

MongoDB Compass

Besides, I use MongoDB Campass to visualize the database, which is a powerful GUI for querying, aggregating, and analyzing the MongoDB data in a visual environment. The examples of Database are as follows:

Appendix

Some tips of Docker

start or stop a docker container

sudo docker container start [container name]
sudo docker container stop [container name]
sudo docker container ls

Check the ip of a container

docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' [container_id or container_name]

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets/img		assets/img
docs		docs
init		init
middlewares		middlewares
tw_spider		tw_spider
wb_spider		wb_spider
.gitignore		.gitignore
README.md		README.md
project.todo		project.todo
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spider

Pre-Knowledge

How to start

Operating environment

Clone and install dependencies

Select Spider

some tips of `scrapy` command tool

Initialize and Start Spider

DataBase

MongoDB Server

MongoDB Compass

Appendix

Some tips of Docker

About

Releases

Packages

Languages

LonelVino/Spider

Folders and files

Latest commit

History

Repository files navigation

Spider

Pre-Knowledge

How to start

Operating environment

Clone and install dependencies

Select Spider

some tips of scrapy command tool

Initialize and Start Spider

DataBase

MongoDB Server

MongoDB Compass

Appendix

Some tips of Docker

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

some tips of `scrapy` command tool

Packages