Snatch-It grabs images from the Internet and saves them on a local drive
-
Used Puppeteer (Google Headless Crome Node API) to scrape a site
-
Used
config
module to define snatch-it settings (from 'where to save images' to 'get image from this selector' settings) -
Goes through all pages till the 'next-page-selector' can be found on a page
-
Creates folders for a site and for each visited page to keep it easy to navigate
-
Used Books to Scrape. We love being scraped! as a default config (see it below)
- Install dependencies
npm install
- Run the app (with default settings)
npm start
- Create a custom config on
config
folder (given an default.json)
{
"browser": {
"headless": false
},
"paths": {
"storage": "./data/",
"mainFolder": "books/",
"prefixChapterFolder": "page-"
},
"urls": {
"start": "http://books.toscrape.com/"
},
"selectors": {
"image": ".product_pod img",
"nextPage": "ul.pager li.next a"
},
"extra": {
"pagesLimit": 100
}
}
then run the app
NODE_ENV=<your-config-name> npm start