GitHub - rafaelcapucho/scrapy-eagle: Scrapy Eagle is a tool that allow us to run any Scrapy based project in a distributed fashion and monitor how it is going on and how many resources it is consuming on each server.

https://travis-ci.org/rafaelcapucho/scrapy-eagle.svg?branch=master

Scrapy Eagle is a tool that allow us to run any Scrapy based project in a distributed fashion and monitor how it is going on and how many resources it is consuming on each server.

This project is Under Development, don't use it yet

Requeriments

Scrapy Eagle uses Redis as Distributed Queue, so you will need a redis instance running.

Installation

It could be easily made by running the code bellow,

$ virtualenv eagle_venv; cd eagle_venv; source bin/activate
$ pip install scrapy-eagle

You should create one configparser configuration file (e.g. in /etc/scrapy-eagle.ini) containing:

[redis]
host = 127.0.0.1
port = 6379
db = 0
;password = someverysecretpass

[server]
debug = True
cookie_secret_key = ha74h3hdh42a
host = 0.0.0.0
port = 5000

[scrapy]
binary = /project_venv/bin/scrapy
base_dir = /project_venv/project_scrapy/project

[commands]
binary = /project_venv/bin/python3
base_dir = /project_venv/project_scrapy/project/commands

Then you will be able to execute the eagle_server command like,

eagle_server --config-file=/etc/scrapy-eagle.ini

Changes into your Scrapy project

Enable the components in your settings.py of your Scrapy project:

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_eagle.worker.scheduler.DistributedScheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_eagle.worker.dupefilter.RFPDupeFilter"

# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = "scrapy_eagle.worker.queue.SpiderPriorityQueue"

# Schedule requests using a queue (FIFO).
SCHEDULER_QUEUE_CLASS = "scrapy_eagle.worker.queue.SpiderQueue"

# Schedule requests using a stack (LIFO).
SCHEDULER_QUEUE_CLASS = "scrapy_eagle.worker.queue.SpiderStack"

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
SCHEDULER_IDLE_BEFORE_CLOSE = 0

# Specify the host and port to use when connecting to Redis (optional).
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
REDIS_URL = "redis://user:pass@hostname:6379"

Once the configuration is finished, you should adapt each spider to use our Mixin:

from scrapy.spiders import CrawlSpider, Rule
from scrapy_eagle.worker.spiders import DistributedMixin

class YourSpider(DistributedMixin, CrawlSpider):

    name = "domain.com"

    # start_urls = ['http://www.domain.com/']
    redis_key = 'domain.com:start_urls'

    rules = (
        Rule(...),
        Rule(...),
    )

    def _set_crawler(self, crawler):
        CrawlSpider._set_crawler(self, crawler)
        DistributedMixin.setup_redis(self)

Feeding a Spider from Redis

The class scrapy_eagle.worker.spiders.DistributedMixin enables a spider to read the urls from redis. The urls in the redis queue will be processed one after another.

Then, push urls to redis:

redis-cli lpush domain.com:start_urls http://domain.com/

Dashboard Development

If you would like to change the client-side then you'll need to have NPM installed because we use ReactJS to build our interface. Installing all dependencies locally:

cd scrapy-eagle/dashboard
npm install

Then you can run npm start to compile and start monitoring any changes and recompiling automatically.

To generate the production version, run npm run build.

To be easier to test the Dashboard you could use one simple http server instead of run the eagle_server, like:

sudo npm install -g http-server
cd scrapy-eagle/dashboard
http-server templates/

It would be available for you at http://127.0.0.1:8080

Note: Until now the Scrapy Eagle is mostly based on https://github.com/rolando/scrapy-redis.

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
docs/images		docs/images
scrapy_eagle		scrapy_eagle
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
MANIFEST.in		MANIFEST.in
README.rst		README.rst
generator.py		generator.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Requeriments

Installation

Changes into your Scrapy project

Feeding a Spider from Redis

Dashboard Development

About

Releases

Packages

Languages

rafaelcapucho/scrapy-eagle

Folders and files

Latest commit

History

Repository files navigation

Requeriments

Installation

Changes into your Scrapy project

Feeding a Spider from Redis

Dashboard Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages