Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the watermarked property to false in all images where it's NULL #1563

Open
1 task
obulat opened this issue May 20, 2022 · 2 comments
Open
1 task

Set the watermarked property to false in all images where it's NULL #1563

obulat opened this issue May 20, 2022 · 2 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧹 status: ticket work required Needs more details before it can be worked on

Comments

@obulat
Copy link
Contributor

obulat commented May 20, 2022

Current Situation

Currently, there are 1 105 608 images that have watermarked property set to NULL.

Suggested Improvement

It should be safe to assume that these images are not watermarked, and set the value to false.

Benefit

Data consistency.

Additional context

Part of #244

Implementation

  • 🙋 I would be interested in implementing this feature.
@obulat obulat added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work data normalization 🟩 priority: low Low priority and doesn't need to be rushed 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels May 20, 2022
@obulat obulat mentioned this issue May 20, 2022
29 tasks
@obulat obulat changed the title Set the watermarked property to false in all images where it's NULL Set the watermarked property to false in all images where it's NULL May 20, 2022
@obulat obulat added ✨ goal: improvement Improvement to an existing user-facing feature and removed 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels May 20, 2022
@krysal
Copy link
Member

krysal commented Sep 30, 2022

I wonder what is the meaning of this column and why it should be considered false for all those images 🤔
Can you elaborate on the reasoning, Olga? Do we have providers that return watermarked images?

Also, I think we would have to modify this for each provider script instead of doing a one-time UPDATE in the database, or the refresh process will convert the column to null again.

@krysal krysal added the 🧹 status: ticket work required Needs more details before it can be worked on label Sep 30, 2022
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@obulat
Copy link
Contributor Author

obulat commented Mar 1, 2023

I wonder what is the meaning of this column and why it should be considered false for all those images 🤔 Can you elaborate on the reasoning, Olga? Do we have providers that return watermarked images?

Sorry, I haven't noticed your reply until today, @krysal.
The reasoning for this issue was that by default, we should probably consider images to be not watermarked and only set watermarked to true if we know that the provider returns watermarked images.

There were some providers who set the watermarked to true in the CommonCrawl days: IHA, McCord Museum, FloraOn, but now no providers set watermarked value, and so it falls back to false: https://github.com/WordPress/openverse-catalog/blob/9cd0b97b844c7442a00e8c056fc7246e83db06d7/openverse_catalog/dags/common/storage/image.py#L61

Because of this fallback, currently, we cannot be sure what false or null in the watermarked column means: not watermarked or not known. So, I think it makes sense to simplify and only use false here.

Also, I think we would have to modify this for each provider script instead of doing a one-time UPDATE in the database, or the refresh process will convert the column to null again.

I was thinking of running the UPDATE in the catalog (upstream, the source-of-truth) database, which is not touched by the refresh process, so the values will not be re-set to null. The data would be updated during the re-ingestion, but some providers are not re-ingested, so the data would never be updated.

@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@dhruvkb dhruvkb added this to the Data normalization milestone Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧹 status: ticket work required Needs more details before it can be worked on
Projects
Status: 📋 Backlog
Development

No branches or pull requests

3 participants