GitHub - BEPb/web_crawler: fast and symple web crawler

Read this in other languages: Russian, हिन्दी, 中國人

Fast and simple crawler

How it works?

Вit's very simple: your bot massively signs your account in response, people follow you.

The order of preparation and work with the bot

Clone the repository or download the archive from github or using the following commands on the command line
```
$ cmd
$ git clone https://github.com/BEPb/github_bot
$ cd github_bot
```
Create a Python virtual environment.
Install all necessary packages for our code to work using the following command:
```
pip install -r requirements.txt
```
create a project called nameproject

scrapy startproject nameproject

after which you will have a folder with the name of this project and in it the minimum necessary files and dependencies

    scrapy.cfg #deploy configuration file
    nameproject/ # project's Python module, you'll import your code from here
        __init__.py
        items.py # project items definition file
        middlewares.py # project middlewares file
        pipelines.py # project pipelines file
        settings.py # project settings file
        spiders/ # a directory where you'll later put your spiders
            __init__.py

go to our project folder

cd nameproject

create a quotes_spider.py file in the spiders/ folder and write in it who and how we cheat
launch our crawler

scrapy crawl quotes

as a result of the execution, two new files were created: quotes-1.html and quotes-2.html with content for the corresponding URLs, as our parse method specifies.
use shell selectors

scrapy shell 'https://quotes.toscrape.com/page/1/'

view all 'title' objects using css. The result of executing response.css('title') is similar to list object named SelectorList which is a list of Selector objects that wrap XML/HTML elements and allow you to perform additional queries to refine the selection or retrieve data.

response.css('title')

and in order to view the list, specify the getall () method

response.css('title::text').getall()

the same can be done with xpath

response.xpath('//title/text()').get()

and now take div tags with class quote

response.css("div.quote")

take only the first element in the list

response.css("div.quote")[0]

in order to get the class in the tag, use the following command:

quote.css("span.text::text").get()
quote.css("small.author::text").get()

and this is how we will display the complete list of the class of the div tag

response.css("div.quote").css("div.tags a.tag::text").getall()

this is how we save the result in json format, where the -O command line switch overwrites any existing file;

scrapy crawl quotes -O quotes.json

and this is how we save the result in csv format

scrapy crawl quotes -O quotes.csv

The following command writes line by line using the .jl format

scrapy crawl quotes -o quotes.jl

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
img		img
nameproject		nameproject
README.chinese.md		README.chinese.md
README.hindi.md		README.hindi.md
README.md		README.md
README.ru.md		README.ru.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast and simple crawler

How it works?

The order of preparation and work with the bot

About

Releases

Packages

Languages

BEPb/web_crawler

Folders and files

Latest commit

History

Repository files navigation

Fast and simple crawler

How it works?

The order of preparation and work with the bot

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages