Skip to content

trile127/affiliate_web_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generic Web Scraper for Affiliate Website Products

Installation

Requires Python 3.6.5 and up

pip install -r requirements.txt

Parameters:

-o <name_of_file> Output file to write to. EX: url_results.json. Will write it to <path-to>/affiliate_web_scraper/affiliate_web_scraper/spiders/<name_of_file>

-a domains=<domains> Scrapes specific list of domains, delimited by ',' (comma)
-a allowed_domains=<allowed_domains> Does not scrape any url that is outside these allowed domains, delimited by ','
-a search_term=<exact phrase to find> Checks text descriptions of html page for this term

Run from Pycharm:

# Tested on Windows
scrapy.cmdline runspider generic_scraper.py -o url_results.json -a domains=https://thewirecutter.com/,https://appliancebuyersguide.com -a allowed_domains=thewirecutter.com,appliancebuyersguide.com -a search_term=SHEM63W55N

Run from Command Line:

# From directory: <path-to>/affiliate_web_scraper/
scrapy runspider affiliate_web_scraper/spiders/generic_scraper.py -o url_results.json -a domains=https://thewirecutter.com/,https://appliancebuyersguide.com -a allowed_domains=thewirecutter.com,appliancebuyersguide.com -a search_term=SHEM63W55N

Notes

Do not use on digitaltrends.com or consumeraffairs.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published