Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Extract MediaStorage entity as parent to ImageStore #83

Merged
merged 10 commits into from
Jun 25, 2021

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Jun 3, 2021

Fixes WordPress/openverse#1739

This PR extracts the data and methods from the ImageStore class that are common to all media types, and creates an Abstract base class called MediaStore.

MediaStore has the methods for validating the tags, metadata, license information, the resulting TSV rows, and for writing the buffer to the disk. It also has an abstract method, add_item, which will be implemented by all child classes (ImageStore currently, but AudioStore will be added shortly). This method will handle the validation of single item metadata.

This is the third iteration of adding audio :) This time, in parallel with the API, this PR only handles the MediaStore extraction, so it can be merged before we have a final decision on what Audio metadata we want to save.

All the data fields we currently collect for images from providers can be found in IMAGE_TSV_COLUMNS list. Here, they are listed in order they are written to TSV, and the fields that are common for all media are in bold:

  • foreign_identifier
  • foreign_landing_url
  • image_url / audio_url (url in database)
  • thumbnail_url
  • width
  • height
  • filesize
  • license_
  • license_version
  • creator
  • creator_url
  • title
  • meta_data
  • tags
  • provider
  • source
  • ingestion_type

There are several ways this PR can be tested:

  1. Build the docker containers and run all the tests on Docker:
    docker exec cc_catalog_airflow_webserver_1 /usr/local/airflow/.local/bin/pytest

or

  1. Choose one of the provider API scripts, preferably the ones that don't require authentication, such as Cleveland museum or Science museum, and run the script by itself. After it finishes running, you should see the result in a .tsv file inside your /tmp folder. Hopefully, all fields collected should be written in the .tsv file.

or

  1. In Pycharm, I right-click on src/cc_catalog_airflow/dags/common folder, and then select Run 'pytest in common' to run the tests only for the common module that was changed in the PR.

@obulat obulat requested review from dhruvkb, krysal and zackkrida June 3, 2021 11:29
@obulat obulat requested a review from a team June 7, 2021 15:08
@zackkrida
Copy link
Member

FYI, I'm hoping to review this tomorrow 😸

@obulat obulat force-pushed the extract_media_storage branch from b9b2138 to 273ea6a Compare June 17, 2021 12:44
obulat added 9 commits June 21, 2021 17:06
Signed-off-by: Olga Bulat <[email protected]>
# Conflicts:
#	src/cc_catalog_airflow/dags/common/storage/image.py
#	src/cc_catalog_airflow/dags/common/storage/test_image.py
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
@obulat obulat mentioned this pull request Jun 25, 2021
@obulat obulat merged commit 25e18fa into main Jun 25, 2021
@obulat obulat deleted the extract_media_storage branch June 25, 2021 12:23
@zackkrida zackkrida added the ✨ goal: improvement Improvement to an existing user-facing feature label Aug 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨ goal: improvement Improvement to an existing user-facing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Create a parent MediaStorage entity with common metadata
2 participants