Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filetype to all images in the catalog DB #1560

Open
4 tasks
obulat opened this issue May 20, 2022 · 0 comments
Open
4 tasks

Add filetype to all images in the catalog DB #1560

obulat opened this issue May 20, 2022 · 0 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented May 20, 2022

Current Situation

There are currently 563 004 660 images without a file type in the database.
We need to add the filetype information to all images.

Suggested Improvement

There are several things that need to be done here:

  • Audit the provider scripts for filetype extraction. Add the necessary steps, if necessary.
  • Identify the providers that don't have filetype data in the database.
  • Create a DAG for re-downloading the data from each of the providers from the previous step, and run it to refresh the data. This will probably need to be coordinated with other data collection (filesize, width, height).

Benefit

Data consistency.

Additional context

We are currently also extracting extensions during the Elasticsearch indexing:
https://github.com/WordPress/openverse-api/blob/2e85caf7aede8aaf9d77cd5cb050f50b860ee58e/ingestion_server/ingestion_server/elasticsearch_models.py#L135-L144
This should be done entirely in the catalog.

Here, we should also make sure that we don't create duplicate filetype for types such as jpg/jpeg.

Implementation

  • 🙋 I would be interested in implementing this feature.
@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users data normalization labels May 20, 2022
@obulat obulat changed the title Add filetype to all images in the catalog DB Add filetype to all images in the catalog DB May 20, 2022
@obulat obulat added ✨ goal: improvement Improvement to an existing user-facing feature and removed 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels May 20, 2022
@obulat obulat mentioned this issue Apr 17, 2023
6 tasks
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@dhruvkb dhruvkb added this to the Data normalization milestone Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants