Normalize data models #244
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: frontend
Related to the Nuxt frontend
Milestone
Problem
This is a meta issue to track all the data model normalization work across all the repositories.
All open issues from this meta issue. You can also track the progress using the GitHub Project view.
Some data we have in the database was ingested a long time ago when we had a different set of required fields. This makes consuming the data difficult because the pieces that are set as required can be unavailable in the database.
We need to make sure that we have up-to-date data models across the stack, and that our data in the database confirms to them.
Description
To establish trust in our data, we need to make sure that we clearly describe what data we have, and to check that the database actually has all the data outlined. Also, we should remove the duplication of data classification/data cleaning between the Catalog and the API layers.
Here are the specific fields we should normalize:
All media
These fields are common for all media, however some fields only have
NULL
values in images, not in audio.URL
url
as field name in provider scripts #1409License URL
license_url
inmeta_data
field #1565Add
license_url
tometa_data
JSONB field in the database in the catalog.This field can be computed based on the
license
andlicense_version
fields. We can run a SQL query or a one-off Python script to backfill it.license_url
computation #703Remove any code in the API that computes
license_url
after the data has been backfilled in the Catalog database.license_url
a non-optional media property #552Set
license_url
as a required field in the frontend types.Watermarked
watermarked
property to false in all images where it'sNULL
#1563Set the watermarked property to
false
in all images where it'sNULL
(in images only)Last synced with source
last_synced_with_source
field in the database #1562Set
last_synced_with_source
to the value ofupdated_on
, if available, or tocreated_on
(in images only)Mature (new column)
Save mature info from the origin
Description (new column)
descriptions
for each media item #1656Image
Thumbnail
thumbnail
field for images from the catalog #1561Remove the image thumbnail field from the catalog and from the provider scripts because we do not use provider thumbnails (we use the imaginary proxy server for image thumbnails instead).
Filetype
563 004 660 images
filetype
to all images in the catalog DB #1560Find a way of backfilling the image filetype values. It might be possible to compute it from the filename or the URL extension
Remove filetype computation from API and from ES index creation.
Category
563 622 992 images
Investigate if we should run a one-off categorization script for images in the Catalog to backfill.
Remove the categorization code from the API.
Width & height
12 571 694 images
Prefer the original or largest size available
width
andheight
to all images in the catalog database #1559Check which provider scripts do not set
width
andheight
and update them.Backfill
width
andheight
values for images that don't have them (probably in the same process as thefilesize
update)width
andheight
are added to all images in the catalog #701Re-index all images to make sure that the
width
andheight
values are returned. This will also improve the size and aspect ratio filters.Remove width/height computation code from the frontend.
Filesize
561 894 897 images
Check which provider scripts do not set
filesize
and update them.Backfill
filesize
values for images that don't have them (probably in the same process as thewidth
andheight
update)Tags
tags
field for images #1557Ensure that all the tags have been cleaned in the database (including denylist/ duplicate tags/ tags with accuracy lower than 90).
URL
cleanup process from the ingestion server #700Remove the tag cleaning step from the ingestion server.
tags_list
field #704Improvements
More investigation needed
There are 1 096 025 images that don't have a title. We should try to understand why those images don't have titles and add the titles if it is possible.
Additional context
Updates that can be done with existing data:
Updates that will require additional fetching from providers:
Message from @AetherUnbound with details from the database (from the Public Slack discussion):
Here are a count of NULL values for all fields that don't have a NOT NULL constraint. Unfortunately this doesn't give us information on license_url, if that's supposed to come from the meta_data field
The text was updated successfully, but these errors were encountered: