sodascience · modhurita · Apr 8, 2024 · Feb 29, 2024 · Feb 29, 2024 · Feb 29, 2024
diff --git a/README.md b/README.md
@@ -5,8 +5,11 @@ ArtScraper is a tool to download images and metadata for artworks available on
 WikiArt (www.wikiart.org/) and Google Arts & Culture
 (artsandculture.google.com/).
 
+Functionality:
+- `WikiArt` and `Google Arts & Culture`: Download images and metadata from a list of artworks' urls
+- `Google Arts & Culture`: Download all images and metadata in the site, or from specific artists
 
-## Installation and setup
+## 1. Installation and setup
 
 The ArtScraper package can be installed with pip, which automatically installs
 the python dependencies:
@@ -16,7 +19,7 @@ pip install artscraper
 ```
 
 
-### WikiArt
+## 2. Downloading art from WikiArt
 
 To download data from WikiArt it is necessary to obtain
 [API](https://www.wikiart.org/en/App/GetApi) keys. After obtaining them, you
@@ -32,147 +35,105 @@ a new line, e.g.:
 Alternatively, when ArtScraper doesn't detect the file `.wiki_api`, it will
 ask for the API keys.
 
-### Google Arts & Culture
+An example of fetching data is shown below and in the [notebook](examples/example_artscraper.ipynb). 
 
-To download data from GoogleArt it is necessary to install 
-[Firefox](https://www.mozilla.org/en-US/firefox/new/) and `geckodriver`. 
-
-There are two options to download the geckodriver.
-- Using a package manager (recommended on Linux/OS X): On Linux, geckodriver 
-is most likely available through the package manager for your distribution
-(e.g. on Ubuntu `sudo apt install firefox-geckodriver`. On OS X, it is the 
-easiest to install it through either [brew](https://formulae.brew.sh/formula/geckodriver#default) or [macports](https://ports.macports.org/port/geckodriver/) (e.g. `brew install geckodriver`). Depending on your settings, you might need to add the directory where the geckodriver resides to the PATH variable. 
-- Downloading it [from here](https://github.com/mozilla/geckodriver/releases) and making it available to the code. For example in Windows you can download the file `geckodriver-v0.31.0-win64.zip`, place the driver in the directory of your code, and specify the path when you initialize the GoogleArtScraper. For example:
 ```python
-with GoogleArtScraper(geckodriver_path="./geckodriver.exe") as scraper:
-    ...
-```
-
-Make sure that you have a recent version of geckodriver, because selenium (a non-python dependency used in the GoogleArt scraper) uses features that were only recently introduced 
-in geckodriver. We have only tested the scraping on Linux/Firefox and OSX/Firefox.
-
 
-## Download images and metadata (interactive)
+from artscraper import WikiArtScraper
 
-An example of fetching data is shown in an
-[example](examples/example_artscraper.ipynb) notebook. 
+art_url = "https://www.wikiart.org/en/edvard-munch/anxiety-1894"
 
-Assuming the WikiArtScraper
-is used, we can download the data from a link with:
+with WikiArtScraper(output_dir="data") as scraper:
+    scraper.load_link(art_url)
+    scraper.save_metadata() 
+    scraper.save_image()
 
-```python
+```
 
-from artscraper import WikiArtScraper
+This will store both the image itself and the metadata in separate folders. If
+you use ArtScraper in this way, it will skip images/metadata that is already
+present. Remove the directory to force it to redownload it. 
 
-scraper = WikiArtScraper()
+Results:
 
-# Load the URL into the scraper.
-scraper.load_link("some URL")
+[<img src="https://uploads5.wikiart.org/images/edvard-munch/anxiety-1894.jpg" weight="20">](https://www.wikiart.org/en/edvard-munch/anxiety-1894)
 
-# Get the metadata
-metadata = scraper.get_metadata()
 
-# Save the image to some local file
-scraper.save_image("wiki.jpg")
+## 3. Downloading art from Google Arts & Culture
 
-# Release resources
-scraper.close()
-```
+To download data from GoogleArt it is necessary to install 
+[Firefox](https://www.mozilla.org/en-US/firefox/new/) and `geckodriver`. Geckodriver is installed automatically when you run the code for the first time.
 
-We can see that every time we want to download either images or metadata, we
-first load the URL into the scraper. For the GoogleArt implementation,
-releasing the resources with `scraper.close` will ensure that the browser is
-closed. The scraper should not be used after that.
+ArtScraper will open a new Firefox window, navigate to the image, zoom on it and take a screenshot of it. It will take a few seconds. Do not minimize that browser, and do not let the screensaver go on.
 
-## Download images and metadata (automatic)
 
-An example of fetching data is shown in an
-[example](examples/example_artscraper.ipynb) notebook.
+### 3.1 Downloading art from Google Arts & Culture using artworks' urls
 
-For many use cases it might be useful to download a series of links and store
-them in a consistent way.
+An example of fetching data is shown below and in the [notebook](examples/example_artscraper.ipynb). 
 
 ```python
-
 from artscraper import GoogleArtScraper
 
-with GoogleArtScraper("data/output/googlearts") as scraper:
-    for url in some_links:
-        scraper.load_link(url)
-        scraper.save_metadata()
-        scraper.save_image()
+art_url = "https://artsandculture.google.com/asset/anxiety-edvard-munch/JgE_nwHHS7wTPw"
+
+with GoogleArtScraper() as scraper:
+    scraper.load_link(art_url)
+    metadata = scraper.get_metadata() #or scraper.save_metadata()
+    scraper.save_image("data/anxiety_munch.jpg")
+    print(metadata) 
+
 ```
 
-This will store both the image itself and the metadata in separate folders. If
-you use ArtScraper in this way, it will skip images/metadata that is already
-present. Remove the directory to force it to redownload it.
 
-## Get list of all artists from Google Arts & Culture website
+### 3.2 Downloading all art from Google Arts & Culture 
 
-See [example notebook](examples/example_collect_all_artworks.ipynb). A list with the Google Arts& Culture web addresses of all artists is returned.
-```python
-from artscraper import get_artist_links
+See [example notebook](examples/example_collect_all_artworks.ipynb).
 
-# Get links for all artists, as a list. The links are also saved in a file.
-artist_urls = get_artist_links(executable_path='geckodriver', min_wait_time=1, output_file='artist_links.txt')
-```
+The final structure of the results will be
+- data
+  - artist_links.txt (All artists, with one url per line) 
+  - Artist_1
+    - description.txt (Description of artist, from wikidata)
+    - metadata.json (Metadata of arist, from wikidata)
+    - works.txt (All artworks, with one url per line)
+    - works 
+      - work1
+        - artwork.png (Artwork image)
+        - metadata.json (Metadata of artwork, from Google Art and Culture)
+      - work2
+        - ...
+  - Artist_2
+    - ... 
 
-## Get links to an artist's works
-A list of all works by a particular artist, specified by the address of their Google Arts & Culture webpage, is returned.
-
-```python
-# Find links to artworks for a particular artist, from their Google Arts & Culture webpage url
-    with FindArtworks(artist_link=artist_url, output_dir=output_dir, min_wait_time=1) as scraper:
-		    # Get list of artist's works
-            artwork_links = scraper.get_artist_works()
-```
 
-## Get metadata about an artist
-Metadata for the artist is returned.
-```python
-# Get metadata for a particular artist, from their Google Arts & Culture webpage url
-    with FindArtworks(artist_link=artist_url, output_dir=output_dir, min_wait_time=1) as scraper:
-		    # Get list of artist's works
-            artwork_links = scraper.get_artist_metadata()
-```
+A full example (but please check the [example notebook](examples/example_collect_all_artworks.ipynb) to add retries):
 
-## Collect data about all artists, and all their artworks
-From a list containing links to all the artists, the following are saved, for each artist:
-1. List containing all works by the artist
-2. Description of the artist
-3. Metadata of the artist
-4. Image and metadata of each artwork by the artist
 ```python
+from artscraper.find_artists import get_artist_links
+# Get links for all artists, as a list
+output_dir = "data"
+artist_urls = get_artist_links(min_wait_time=1, output_file=f'{output_dir}/artist_links.txt')
+
+# Find_artworks for each artist
 for artist_url in artist_urls:
-    with FindArtworks(artist_link=artist_url, output_dir=output_dir, min_wait_time=1) as scraper:
-            # Save list of works, description, and metadata for an artist
+    with FindArtworks(artist_link=artist_url, output_dir=output_dir, 
+                      min_wait_time=min_wait_time) as scraper:
+            # Save list of artworks, the description, and metadata for an artist
             scraper.save_artist_information()
-            # Get list of links to this artist's works 
-            artwork_links = scraper.get_artist_works()
-            # Create directory for this artist
-            artist_dir = output_dir + '/' + scraper.get_wikipedia_article_title()
+            # Find artist directory
+            artist_dir = output_dir + '/' + scraper.get_artist_name() 
+
     # Scrape artworks
-    with GoogleArtScraper(artist_dir + '/' + 'works', min_wait=1) as subscraper:
-        # Go through each artwork link
+    with GoogleArtScraper(artist_dir + '/' + 'works', min_wait=min_wait_time) as subscraper:
+        # Get list of links to this artist's works 
+        with open(artist_dir+'/'+'works.txt', 'r') as file:
+            artwork_links = [line.rstrip() for line in file]  
+        # Download all artwork link (slow)
         for url in artwork_links:
-            subscraper.load_link(url)
-            subscraper.save_metadata()
-            subscraper.save_image()
-```
-
-## Troubleshooting
-
-Sometimes the `GoogleArtScraper` returns white images (tested on OS X), which
-is most likely due to the screensaver kicking in. Apart from disabling the
-screensaver, the following shell command might be useful to remove most of the
-white images (if the data is in `data/output/google_arts`:
-
-```
-sh for F in $(find data/output/google_arts/ -iname painting.png -size -55k); do rm -r $(dirname $F); done
+            print(f'artwork URL: {url}')
+            subscraper.save_artwork_information()
 ```
 
-Be careful with bash scripts like these and makes sure you are in the right
-directory.
 
 ## Contributing
 

diff --git a/artscraper/find_artists.py b/artscraper/find_artists.py
@@ -25,40 +25,37 @@ def get_artist_links(webpage='https://artsandculture.google.com/category/artist'
     '''
 
     # Launch Firefox browser
-    driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))
+    with webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install())) as driver:
+        # Get Google Arts & Culture webpage listing all artists
+        driver.get(webpage)
 
-    # Get Google Arts & Culture webpage listing all artists
-    driver.get(webpage)
+        # Get scroll height after first time page load
+        last_height = driver.execute_script("return document.body.scrollHeight")
+        while True:
+            # Scroll down to bottom
+            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
+            # Wait to load page
+            time.sleep(random_wait_time(min_wait=min_wait_time))
+            # Calculate new scroll height and compare with last scroll height
+            new_height = driver.execute_script("return document.body.scrollHeight")
+            if new_height == last_height:
+                break
+            last_height = new_height
 
-    # Get scroll height after first time page load
-    last_height = driver.execute_script("return document.body.scrollHeight")
-    while True:
-        # Scroll down to bottom
-        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
-        # Wait to load page
-        time.sleep(random_wait_time(min_wait=min_wait_time))
-        # Calculate new scroll height and compare with last scroll height
-        new_height = driver.execute_script("return document.body.scrollHeight")
-        if new_height == last_height:
-            break
-        last_height = new_height
+        # Find xpaths containing artist links
+        elements = driver.find_elements('xpath', '//*[contains(@href,"categoryId=artist")]')
 
-    # Find xpaths containing artist links
-    elements = driver.find_elements('xpath', '//*[contains(@href,"categoryId=artist")]')
+        # List to store artist links
+        list_links = []
+        # Go through each xpath containing an artist link
+        for element in elements:
+            # Extract link to webpage
+            link = element.get_attribute('href')
+            # Remove trailing text
+            link = link.replace('?categoryId=artist', '')
+            # Append to list
+            list_links.append(link)
 
-    # List to store artist links
-    list_links = []
-    # Go through each xpath containing an artist link
-    for element in elements:
-        # Extract link to webpage
-        link = element.get_attribute('href')
-        # Remove trailing text
-        link = link.replace('?categoryId=artist', '')
-        # Append to list
-        list_links.append(link)
-
-    # Close driver
-    driver.close()
 
     if output_file:
         with open(output_file, 'w', encoding='utf-8') as file:

diff --git a/examples/example_artscraper.ipynb b/examples/example_artscraper.ipynb