Normalize data models #244

obulat · 2022-05-20T07:56:26Z

Problem

This is a meta issue to track all the data model normalization work across all the repositories.
All open issues from this meta issue. You can also track the progress using the GitHub Project view.
Some data we have in the database was ingested a long time ago when we had a different set of required fields. This makes consuming the data difficult because the pieces that are set as required can be unavailable in the database.
We need to make sure that we have up-to-date data models across the stack, and that our data in the database confirms to them.

Description

To establish trust in our data, we need to make sure that we clearly describe what data we have, and to check that the database actually has all the data outlined. Also, we should remove the duplication of data classification/data cleaning between the Catalog and the API layers.

Here are the specific fields we should normalize:

All media

These fields are common for all media, however some fields only have NULL values in images, not in audio.

Document the use of each column and guideline for selection from sources #1410

URL

Consider using url as field name in provider scripts #1409

License URL

Ensure that all media have license_url in meta_data field #1565
Add license_url to meta_data JSONB field in the database in the catalog.
This field can be computed based on the license and license_version fields. We can run a SQL query or a one-off Python script to backfill it.
Remove license_url computation #703
Remove any code in the API that computes license_url after the data has been backfilled in the Catalog database.
Make license_url a non-optional media property #552
Set license_url as a required field in the frontend types.

Watermarked

Set the watermarked property to false in all images where it's NULL #1563
Set the watermarked property to false in all images where it's NULL (in images only)

Last synced with source

Backfill the last_synced_with_source field in the database #1562
Set last_synced_with_source to the value of updated_on, if available, or to created_on (in images only)

Mature (new column)

Collect sensitive content flag data from providers #1754
Save mature info from the origin

Description (new column)

Not sure if we have descriptions for each media item #1656

Image

Thumbnail

Remove thumbnail field for images from the catalog #1561
Remove the image thumbnail field from the catalog and from the provider scripts because we do not use provider thumbnails (we use the imaginary proxy server for image thumbnails instead).

Filetype

563 004 660 images

Add filetype to all images in the catalog DB #1560
Find a way of backfilling the image filetype values. It might be possible to compute it from the filename or the URL extension
Remove the code for computing filetype (extension) #702
Remove filetype computation from API and from ES index creation.

Width & height

12 571 694 images

Document the recommended image size to choose from providers #1551
Prefer the original or largest size available
Add width and height to all images in the catalog database #1559
Check which provider scripts do not set width and height and update them.
Backfill width and height values for images that don't have them (probably in the same process as the filesize update)
Re-index the images after width and height are added to all images in the catalog #701
Re-index all images to make sure that the width and height values are returned. This will also improve the size and aspect ratio filters.
Use the image size information provided by the API instead of making head requests #551
Remove width/height computation code from the frontend.

Filesize

561 894 897 images

https://github.com/WordPress/openverse-catalog/issues/522
Check which provider scripts do not set filesize and update them.
Backfill filesize values for images that don't have them (probably in the same process as the width and height update)

Improvements

More investigation needed

Implementation Plan: Investigate the use of alembic for openledger migrations #1836
Record how many upstream images are above 5 MB and would cause thumbnails to more likely time out #1451
Investigate why images don't have titles in the database #1556
There are 1 096 025 images that don't have a title. We should try to understand why those images don't have titles and add the titles if it is possible.
WordPress/openverse-catalog#782

Additional context

Updates that can be done with existing data:

tags
license_url
watermarked
last_synced_with

Updates that will require additional fetching from providers:

filetype
filesize
width
height

Message from @AetherUnbound with details from the database (from the Public Slack discussion):
Here are a count of NULL values for all fields that don't have a NOT NULL constraint. Unfortunately this doesn't give us information on license_url, if that's supposed to come from the meta_data field

deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE duration IS NULL) as duration,
 COUNT(*) FILTER (WHERE bit_rate IS NULL) as bit_rate,
 COUNT(*) FILTER (WHERE sample_rate IS NULL) as sample_rate,
 COUNT(*) FILTER (WHERE category IS NULL) as category,
 COUNT(*) FILTER (WHERE genres IS NULL) as genres,
 COUNT(*) FILTER (WHERE audio_set IS NULL) as audio_set,
 COUNT(*) FILTER (WHERE set_position IS NULL) as set_position,
 COUNT(*) FILTER (WHERE alt_files IS NULL) as alt_files,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source
 FROM audio;
-[ RECORD 1 ]-------------------------
total                   | 175858
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 0
thumbnail               | 86720
filetype                | 0
duration                | 0
bit_rate                | 89223
sample_rate             | 149241
category                | 13844
genres                  | 86720
audio_set               | 34914
set_position            | 86720
alt_files               | 115789
filesize                | 89138
license_version         | 0
creator                 | 10
creator_url             | 118
title                   | 0
meta_data               | 0
tags                    | 30092
watermarked             | 0
last_synced_with_source | 0
SELECT 1
Time: 0.149s


deploy@localhost:openledger> SELECT
 COUNT(*) as total,
 COUNT(*) FILTER (WHERE ingestion_type IS NULL) as ingestion_type,
 COUNT(*) FILTER (WHERE provider IS NULL) as provider,
 COUNT(*) FILTER (WHERE source IS NULL) as source,
 COUNT(*) FILTER (WHERE foreign_identifier IS NULL) as foreign_identifier,
 COUNT(*) FILTER (WHERE foreign_landing_url IS NULL) as foreign_landing_url,
 COUNT(*) FILTER (WHERE thumbnail IS NULL) as thumbnail,
 COUNT(*) FILTER (WHERE width IS NULL) as width,
 COUNT(*) FILTER (WHERE height IS NULL) as height,
 COUNT(*) FILTER (WHERE filesize IS NULL) as filesize,
 COUNT(*) FILTER (WHERE license_version IS NULL) as license_version,
 COUNT(*) FILTER (WHERE creator IS NULL) as creator,
 COUNT(*) FILTER (WHERE creator_url IS NULL) as creator_url,
 COUNT(*) FILTER (WHERE title IS NULL) as title,
 COUNT(*) FILTER (WHERE meta_data IS NULL) as meta_data,
 COUNT(*) FILTER (WHERE tags IS NULL) as tags,
 COUNT(*) FILTER (WHERE watermarked IS NULL) as watermarked,
 COUNT(*) FILTER (WHERE last_synced_with_source IS NULL) as last_synced_with_source,
 COUNT(*) FILTER (WHERE filetype IS NULL) as filetype,
 COUNT(*) FILTER (WHERE category IS NULL) as category
 FROM image;
-[ RECORD 1 ]-------------------------
total                   | 563667181
ingestion_type          | 0
provider                | 0
source                  | 0
foreign_identifier      | 0
foreign_landing_url     | 1
thumbnail               | 57584529
width                   | 12571694
height                  | 12571694
filesize                | 561894897
license_version         | 0
creator                 | 4459805
creator_url             | 22751618
title                   | 1096025
meta_data               | 366974
tags                    | 243751835
watermarked             | 1105608
last_synced_with_source | 554237
filetype                | 563004660
category                | 563622992
SELECT 1
Time: 2480.183s (41 minutes 20 seconds), executed in: 2480.182s (41 minutes 20 seconds)

The text was updated successfully, but these errors were encountered:

AetherUnbound · 2022-06-14T18:23:07Z

A note on data refreshes & normalization that @obulat brought up: We should continue performing full data refreshes in dev until we are confident in our data normalization. Until we get everything normalized, we may continue to find issues in production that can't be replicated in staging unless we refresh the catalog in its entirety.

I've also made https://github.com/WordPress/openverse-infrastructure/issues/120 to track this

Make most of the Vuex store modules namespaced

Organise and document `justfile`

obulat added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature labels May 20, 2022

obulat added the 💻 aspect: code Concerns the software code in the repository label May 20, 2022

This was referenced Apr 17, 2023

Set the watermarked property to false in all images where it's NULL #1563

Open

Backfill the last_synced_with_source field in the database #1562

Open

obulat added data normalization 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels May 20, 2022

obulat added this to Openverse Data Normalization Jun 15, 2022

obulat moved this to In Progress in Openverse Data Normalization Jun 15, 2022

obulat self-assigned this Aug 24, 2022

rwidom mentioned this issue Oct 11, 2022

iNaturalist in-SQL loading WordPress/openverse-catalog#745

Merged

This was referenced Feb 16, 2023

Clearly document all media properties #412

Closed

Data normalization #430

Closed

dhruvkb pushed a commit that referenced this issue Feb 20, 2023

Merge pull request #244 from WordPress/namespaced_store

8e9b7cd

Make most of the Vuex store modules namespaced

dhruvkb added a commit that referenced this issue Feb 20, 2023

Merge pull request #244 from WordPress/improv_just

fbf3fbd

Organise and document `justfile`

obulat added the 🧱 stack: frontend Related to the Nuxt frontend label Feb 22, 2023

obulat added this to Openverse Backlog Feb 23, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023

obulat mentioned this issue May 29, 2023

Add a script to generate the media_properties.md #2205

Closed

8 tasks

dhruvkb added this to the Data normalization milestone Dec 2, 2023

dhruvkb removed the data normalization label Dec 2, 2023

obulat removed their assignment Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize data models #244

Normalize data models #244

obulat commented May 20, 2022 •

edited by krysal

Loading

AetherUnbound commented Jun 14, 2022

Normalize data models #244

Normalize data models #244

Comments

obulat commented May 20, 2022 • edited by krysal Loading

Problem

Description

All media

URL

License URL

Watermarked

Last synced with source

Mature (new column)

Description (new column)

Image

Thumbnail

Filetype

Category

Width & height

Filesize

Tags

Improvements

More investigation needed

Additional context

AetherUnbound commented Jun 14, 2022

obulat commented May 20, 2022 •

edited by krysal

Loading