The ScrapeOps Scrapy SDK is an extension for your Scrapy spiders that gives you all the scraping monitoring, statistics, alerting, scheduling and data validation you will need straight out of the box.
Just enable it in your settings.py
file and the SDK will automatically monitor your scrapers and send your logs to your scraping dashboard. When connected to a ScrapyD server, you can schedule and manage all your jobs from one easy to use interface.
Full documentation can be found here: ScrapeOps Documentation
View features
-
Scrapy Job Stats & Visualisation
- 📈 Individual Job Progress Stats
- 📊 Compare Jobs versus Historical Jobs
- 💯 Job Stats Tracked
- ✅ Pages Scraped & Missed
- ✅ Items Parsed & Missed
- ✅ Item Field Coverage
- ✅ Runtimes
- ✅ Response Status Codes
- ✅ Success Rates & Average Latencies
- ✅ Errors & Warnings
- ✅ Bandwidth
-
Health Checks & Alerts
- 🕵️♂️ Custom Spider & Job Health Checks
- 📦 Out of the Box Alerts - Slack (More coming soon!)
- 📑 Daily Scraping Reports
-
ScrapyD Cluster Management
- 🔗 Integrate With ScrapyD Servers
- ⏰ Schedule Periodic Jobs
- 💯 All Scrapyd JSON API Supported
- 🔐 Secure Your ScrapyD with BasicAuth, HTTPS or Whitelisted IPs
-
Proxy Monitoring (Coming Soon)
- 📈 Monitor Your Proxy Account Usage
- 📉 Track Your Proxy Providers Performance
- 📊 Compare Proxy Performance Verus Other Providers
You can get the ScrapeOps monitoring suite up and running in 4 easy steps.
pip install scrapeops-scrapy
Create a free ScrapeOps account here and get your API key from the dashboard.
When you have your API key, open your Scrapy projects settings.py
file and insert your API key into it.
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
In the settings.py
file, add in the ScrapeOps extension, by simply adding it to the EXTENSIONS
dictionary.
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}
To get the most accurate stats, you need to add in the ScrapeOps retry middleware into the DOWNLOADER_MIDDLEWARES
dictionary and disable the default Scrapy Retry middleware in your Scrapy project's settings.py
file.
You can do this by setting the default Scrapy RetryMiddleware to None
and enabling the ScrapeOps retry middleware in it's place.
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
The retry middleware will operate the exactly as before, however, the ScrapeOps retry middleware will log every request, response and exception your spiders generate.
By default the ScrapeOps SDK will log the settings used for each particular scrape so you can keep track of the settings used. However, to ensure it doesn't record sensitive information like API keys it won't log any settings that contain the following substrings:
API_KEY
APIKEY
SECRET_KEY
SECRETKEY
However, it can still log other settings that don't match these patterns. You can specify which settings not to log by adding the setting to the SCRAPEOPS_SETTINGS_EXCLUSION_LIST
.
SCRAPEOPS_SETTINGS_EXCLUSION_LIST = [
'NAME_OF_SETTING_NOT_TO_LOG'
]
That's all. From here, the ScrapeOps SDK will automatically monitor and collect statistics from your scraping jobs and display them in your ScrapeOps dashboard.