Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example #33

Merged
merged 7 commits into from
Apr 8, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 69 additions & 108 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@ ArtScraper is a tool to download images and metadata for artworks available on
WikiArt (www.wikiart.org/) and Google Arts & Culture
(artsandculture.google.com/).

Functionality:
- `WikiArt` and `Google Arts & Culture`: Download images and metadata from a list of artworks' urls
- `Google Arts & Culture`: Download all images and metadata in the site, or from specific artists

## Installation and setup
## 1. Installation and setup

The ArtScraper package can be installed with pip, which automatically installs
the python dependencies:
Expand All @@ -16,7 +19,7 @@ pip install artscraper
```


### WikiArt
## 2. Downloading art from WikiArt

To download data from WikiArt it is necessary to obtain
[API](https://www.wikiart.org/en/App/GetApi) keys. After obtaining them, you
Expand All @@ -32,147 +35,105 @@ a new line, e.g.:
Alternatively, when ArtScraper doesn't detect the file `.wiki_api`, it will
ask for the API keys.

### Google Arts & Culture
An example of fetching data is shown below and in the [notebook](examples/example_artscraper.ipynb).

To download data from GoogleArt it is necessary to install
[Firefox](https://www.mozilla.org/en-US/firefox/new/) and `geckodriver`.

There are two options to download the geckodriver.
- Using a package manager (recommended on Linux/OS X): On Linux, geckodriver
is most likely available through the package manager for your distribution
(e.g. on Ubuntu `sudo apt install firefox-geckodriver`. On OS X, it is the
easiest to install it through either [brew](https://formulae.brew.sh/formula/geckodriver#default) or [macports](https://ports.macports.org/port/geckodriver/) (e.g. `brew install geckodriver`). Depending on your settings, you might need to add the directory where the geckodriver resides to the PATH variable.
- Downloading it [from here](https://github.com/mozilla/geckodriver/releases) and making it available to the code. For example in Windows you can download the file `geckodriver-v0.31.0-win64.zip`, place the driver in the directory of your code, and specify the path when you initialize the GoogleArtScraper. For example:
```python
with GoogleArtScraper(geckodriver_path="./geckodriver.exe") as scraper:
...
```

Make sure that you have a recent version of geckodriver, because selenium (a non-python dependency used in the GoogleArt scraper) uses features that were only recently introduced
in geckodriver. We have only tested the scraping on Linux/Firefox and OSX/Firefox.


## Download images and metadata (interactive)
from artscraper import WikiArtScraper

An example of fetching data is shown in an
[example](examples/example_artscraper.ipynb) notebook.
art_url = "https://www.wikiart.org/en/edvard-munch/anxiety-1894"

Assuming the WikiArtScraper
is used, we can download the data from a link with:
with WikiArtScraper(output_dir="data") as scraper:
scraper.load_link(art_url)
scraper.save_metadata()
scraper.save_image()

```python
```

from artscraper import WikiArtScraper
This will store both the image itself and the metadata in separate folders. If
you use ArtScraper in this way, it will skip images/metadata that is already
present. Remove the directory to force it to redownload it.

scraper = WikiArtScraper()
Results:

# Load the URL into the scraper.
scraper.load_link("some URL")
[<img src="https://uploads5.wikiart.org/images/edvard-munch/anxiety-1894.jpg" weight="20">](https://www.wikiart.org/en/edvard-munch/anxiety-1894)

# Get the metadata
metadata = scraper.get_metadata()

# Save the image to some local file
scraper.save_image("wiki.jpg")
## 3. Downloading art from Google Arts & Culture

# Release resources
scraper.close()
```
To download data from GoogleArt it is necessary to install
[Firefox](https://www.mozilla.org/en-US/firefox/new/) and `geckodriver`. Geckodriver is installed automatically when you run the code for the first time.
modhurita marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not confusing to first say that geckodriver needs to be installed, and then say that it is installed automatically? Maybe it is better to leave out any mention of geckodriver?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's remove geckodriver


We can see that every time we want to download either images or metadata, we
first load the URL into the scraper. For the GoogleArt implementation,
releasing the resources with `scraper.close` will ensure that the browser is
closed. The scraper should not be used after that.
ArtScraper will open a new Firefox window, navigate to the image, zoom on it and take a screenshot of it. It will take a few seconds. Do not minimize that browser, and do not let the screensaver go on.

## Download images and metadata (automatic)

An example of fetching data is shown in an
[example](examples/example_artscraper.ipynb) notebook.
### 3.1 Downloading art from Google Arts & Culture using artworks' urls

For many use cases it might be useful to download a series of links and store
them in a consistent way.
An example of fetching data is shown below and in the [notebook](examples/example_artscraper.ipynb).

```python

from artscraper import GoogleArtScraper

with GoogleArtScraper("data/output/googlearts") as scraper:
for url in some_links:
scraper.load_link(url)
scraper.save_metadata()
scraper.save_image()
art_url = "https://artsandculture.google.com/asset/anxiety-edvard-munch/JgE_nwHHS7wTPw"

with GoogleArtScraper() as scraper:
scraper.load_link(art_url)
metadata = scraper.get_metadata() #or scraper.save_metadata()
scraper.save_image("data/anxiety_munch.jpg")
print(metadata)

```

This will store both the image itself and the metadata in separate folders. If
you use ArtScraper in this way, it will skip images/metadata that is already
present. Remove the directory to force it to redownload it.

## Get list of all artists from Google Arts & Culture website
### 3.2 Downloading all art from Google Arts & Culture

See [example notebook](examples/example_collect_all_artworks.ipynb). A list with the Google Arts& Culture web addresses of all artists is returned.
```python
from artscraper import get_artist_links
See [example notebook](examples/example_collect_all_artworks.ipynb).

# Get links for all artists, as a list. The links are also saved in a file.
artist_urls = get_artist_links(executable_path='geckodriver', min_wait_time=1, output_file='artist_links.txt')
```
The final structure of the results will be
- data
- artist_links.txt (All artists, with one url per line)
- Artist_1
- description.txt (Description of artist, from wikidata)
- metadata.json (Metadata of arist, from wikidata)
- works.txt (All artworks, with one url per line)
- works
- work1
- artwork.png (Artwork image)
- metadata.json (Metadata of artwork, from Google Art and Culture)
- work2
- ...
- Artist_2
- ...

## Get links to an artist's works
A list of all works by a particular artist, specified by the address of their Google Arts & Culture webpage, is returned.

```python
# Find links to artworks for a particular artist, from their Google Arts & Culture webpage url
with FindArtworks(artist_link=artist_url, output_dir=output_dir, min_wait_time=1) as scraper:
# Get list of artist's works
artwork_links = scraper.get_artist_works()
```

## Get metadata about an artist
Metadata for the artist is returned.
```python
# Get metadata for a particular artist, from their Google Arts & Culture webpage url
with FindArtworks(artist_link=artist_url, output_dir=output_dir, min_wait_time=1) as scraper:
# Get list of artist's works
artwork_links = scraper.get_artist_metadata()
```
A full example (but please check the [example notebook](examples/example_collect_all_artworks.ipynb) to add retries):

## Collect data about all artists, and all their artworks
From a list containing links to all the artists, the following are saved, for each artist:
1. List containing all works by the artist
2. Description of the artist
3. Metadata of the artist
4. Image and metadata of each artwork by the artist
```python
from artscraper.find_artists import get_artist_links
# Get links for all artists, as a list
output_dir = "data"
artist_urls = get_artist_links(min_wait_time=1, output_file=f'{output_dir}/artist_links.txt')

# Find_artworks for each artist
for artist_url in artist_urls:
with FindArtworks(artist_link=artist_url, output_dir=output_dir, min_wait_time=1) as scraper:
# Save list of works, description, and metadata for an artist
with FindArtworks(artist_link=artist_url, output_dir=output_dir,
min_wait_time=min_wait_time) as scraper:
# Save list of artworks, the description, and metadata for an artist
scraper.save_artist_information()
# Get list of links to this artist's works
artwork_links = scraper.get_artist_works()
# Create directory for this artist
artist_dir = output_dir + '/' + scraper.get_wikipedia_article_title()
# Find artist directory
artist_dir = output_dir + '/' + scraper.get_artist_name()

# Scrape artworks
with GoogleArtScraper(artist_dir + '/' + 'works', min_wait=1) as subscraper:
# Go through each artwork link
with GoogleArtScraper(artist_dir + '/' + 'works', min_wait=min_wait_time) as subscraper:
# Get list of links to this artist's works
with open(artist_dir+'/'+'works.txt', 'r') as file:
artwork_links = [line.rstrip() for line in file]
# Download all artwork link (slow)
for url in artwork_links:
subscraper.load_link(url)
subscraper.save_metadata()
subscraper.save_image()
```

## Troubleshooting

Sometimes the `GoogleArtScraper` returns white images (tested on OS X), which
is most likely due to the screensaver kicking in. Apart from disabling the
screensaver, the following shell command might be useful to remove most of the
white images (if the data is in `data/output/google_arts`:

```
sh for F in $(find data/output/google_arts/ -iname painting.png -size -55k); do rm -r $(dirname $F); done
print(f'artwork URL: {url}')
subscraper.save_artwork_information()
```

Be careful with bash scripts like these and makes sure you are in the right
directory.

## Contributing

Expand Down
57 changes: 27 additions & 30 deletions artscraper/find_artists.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,40 +25,37 @@ def get_artist_links(webpage='https://artsandculture.google.com/category/artist'
'''

# Launch Firefox browser
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
with webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install())) as driver:
# Get Google Arts & Culture webpage listing all artists
driver.get(webpage)

# Get Google Arts & Culture webpage listing all artists
driver.get(webpage)
# Get scroll height after first time page load
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(random_wait_time(min_wait=min_wait_time))
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

# Get scroll height after first time page load
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(random_wait_time(min_wait=min_wait_time))
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Find xpaths containing artist links
elements = driver.find_elements('xpath', '//*[contains(@href,"categoryId=artist")]')

# Find xpaths containing artist links
elements = driver.find_elements('xpath', '//*[contains(@href,"categoryId=artist")]')
# List to store artist links
list_links = []
# Go through each xpath containing an artist link
for element in elements:
# Extract link to webpage
link = element.get_attribute('href')
# Remove trailing text
link = link.replace('?categoryId=artist', '')
# Append to list
list_links.append(link)

# List to store artist links
list_links = []
# Go through each xpath containing an artist link
for element in elements:
# Extract link to webpage
link = element.get_attribute('href')
# Remove trailing text
link = link.replace('?categoryId=artist', '')
# Append to list
list_links.append(link)

# Close driver
driver.close()

if output_file:
with open(output_file, 'w', encoding='utf-8') as file:
Expand Down
86 changes: 38 additions & 48 deletions examples/example_artscraper.ipynb

Large diffs are not rendered by default.

Loading
Loading