GitHub - Libisch/dbs-bagnowka-scrape: Scraping images and image data from www.bagnowka.pl

Scraping "Bagnowka" archive using Scrapy

This script scrapes all photos and attached data to be used by Beit-Hatfutsot Open Databases. Images are *uploaded to an AWS s3 bucket.

*In prder to enable the scraping, valid access keys should be applied in the "bphotos/settings.py" file.

Usage

1. Scrape

Using the command line, run:

scrapy crawl bphotos -o bphotos.json

After the prossess is completed, a .json file will be added to the folder, containg all the data for each photo, including Urls for stored original sized images and thumbnails.

2. Convert scraped photos into valid BH DBS data

run prsing.py (don't forget to change the name of the input file to match the one produced by ceawler and make sure output is valid).

3. Merge items with identical info

Using the output file from previous step, run merge_galleries.py (and make sure that the output is valid).

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
bphotos		bphotos
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
bagnowka_all.json		bagnowka_all.json
merg_galleries.py		merg_galleries.py
parsing.py		parsing.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraping "Bagnowka" archive using Scrapy

Usage

1. Scrape

2. Convert scraped photos into valid BH DBS data

3. Merge items with identical info

About

Releases

Packages

Languages

Libisch/dbs-bagnowka-scrape

Folders and files

Latest commit

History

Repository files navigation

Scraping "Bagnowka" archive using Scrapy

Usage

1. Scrape

2. Convert scraped photos into valid BH DBS data

3. Merge items with identical info

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages