DatasetScraper

Tool to create image datasets for machine learning problems by scraping search engines like Google, Bing and Baidu.

Features:

Search engine support: Google, Bing, Baidu. (in-production): Yahoo, Yandex, Duckduckgo
Image format support: jpg, png, svg, gif, jpeg
Fast multiprocessing enabled scraper
Very fast multithreaded downloader
Data verification after download for assertion of image files

Installation

pip install datasetscraper Alternatively, you can clone this repository:

About

Scraper.Scraper()
- creates a scraper object which can be used to scrape URLs from search engines and later download them.
- Scraper.Scraper(logLevel = logging.WARNING, headless=True) are the defaults.
  - headless: If False then you can see the actual pypeeter browser
  - logLevel: Simple python logger level
fetch_urls()
- To fetch image URLs
- fetch_urls(query, engine='google', maxlist=[200])
  - query: Search query
  - engine: search engine or list of search engines
  - maxlist: Maximum number of images to download. If multiple search engines exist then this needs to be a list specifying maximum number of images to fetch from each engine.
download
- To download the fetched URLs into a directory
- download(urls, directory='images/', formats=['jpg', 'png', 'jpeg'], default='jpg', nworkers=10, timeout=30)
  - urls: List of urls
  - directory: Place on disk where downloaded images are stored
  - formats: List of file formats to dwonload
  - nworkers: Number of threads for multithreaded download
  - timeout: maximum time before which the download is cancelled

Usage:

Import from datasetscraper import Scraper
Defaults

obj = Scraper.Scraper()
urls = obj.fetch_urls('kiniro mosaic')
obj.download(urls, directory='kiniro_mosaic/')

Specify a search engine

obj = Scraper.Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google'])
obj.download(urls, directory='kiniro_mosaic/')

Specify a list of search engines

obj = Scraper.Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'])
obj.download(urls, directory='kiniro_mosaic/')

Specify max images (default was 200)

obj = Scraper.Scraper()
urls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'], maxlist=[500, 300])
obj.download(urls, directory='kiniro_mosaic/')

FAQs

Why aren't yandex, yahoo, duckduckgo and other search engines supported? They are hard to scrape, I am working on them and will update as soon as I can.
I set maxlist=[500] why are only (x<500) images downloaded? There can be several reasons for this:
- Search ran out: This happens very often, google/bing might not have enough images for your query
- Slow internet: Increase the timeout (default is 60 seconds) as follows: obj.download(urls, directory='kiniro_mosaic/', timeout=100)
How to debug? You can change the logging level while making the scraper object : obj = Scraper(logger.INFO)

TODO:

More search engines
Better debug
Write documentation
Text data? Audio data?

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
DatasetScraper		DatasetScraper
build/lib/datasetscraper		build/lib/datasetscraper
datasetscraper.egg-info		datasetscraper.egg-info
datasetscraper		datasetscraper
dist		dist
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DatasetScraper

Features:

Installation

About

Usage:

FAQs

TODO:

Black hole chan kawaii

About

Releases

Packages

Languages

License

vaibkumr/DatasetScraper

Folders and files

Latest commit

History

Repository files navigation

DatasetScraper

Features:

Installation

About

Usage:

FAQs

TODO:

Black hole chan kawaii

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages