Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify data refresh/provider cleaning #1663

Closed
4 tasks
AetherUnbound opened this issue Feb 3, 2022 · 10 comments
Closed
4 tasks

Unify data refresh/provider cleaning #1663

AetherUnbound opened this issue Feb 3, 2022 · 10 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🐍 tech: python Involves Python

Comments

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Feb 3, 2022

Problem

Some cleaning steps are replicated across the provider API scripts and the ingestion server's data refresh cleaning.

Description

Where possible, we should pull as much of these data cleaning steps as we can backwards into the provider scripts. The operations are so repeated on the data refresh end that we could save a significant amount of time during data refresh by having all of this data cleaning done prior to inserting the data into the catalog.

This will involve work in both the catalog and API repos. It may also require performing a single-pass cleaning operation to the existing records within the catalog.

Additional context

Depends on

Implementation

  • 🙋 I would be interested in implementing this feature.
@AetherUnbound AetherUnbound added ✨ goal: improvement Improvement to an existing user-facing feature 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Feb 3, 2022
@obulat obulat self-assigned this Jul 29, 2022
@obulat
Copy link
Contributor

obulat commented Aug 1, 2022

All the code for cleaning the data currently works for the new data we ingest using the MediaStore class.

It appears that cc-archive/cccatalog#517 data cleaning workflow was created to clean up all the upstream catalog data using the ImageStore class. If that DAG was run successfully, then all the cleaning steps from the API has been done in the upstream catalog database, and all we need to do is to remove the cleaning steps from the API.

@zackkrida, do you know if the linked DAG was indeed run?

@AetherUnbound
Copy link
Collaborator Author

If we're uncertain, perhaps we can run a quick query to check that we don't have any records that match some criteria that occurs within the cleaning! I suspect it wasn't run because of how much cleaning is logged during that part of the data refresh, but I'd love to be wrong!

@zackkrida
Copy link
Member

I mentioned to Olga, but I've asked @mathemancer if he remembers if we ran this DAG against the full dataset or not.

@mathemancer
Copy link
Contributor

I unfortunately can't remember. However, given the timing of that work, It's probable that it wasn't run (due to other, higher priorities in that moment).

@obulat
Copy link
Contributor

obulat commented Aug 2, 2022

Thank you for your reply, @mathemancer!

@obulat
Copy link
Contributor

obulat commented Aug 2, 2022

@AetherUnbound, how can we look at the data refresh logs? If the logs are not clear enough, we could add more logging to check if there are many items that do indeed get cleaned up. I think that would be easier to check than running a query against the upstream database. I tried select * from image where foreign_landing_url LIKE '/%' limit 1;, but it's still running...

@obulat
Copy link
Contributor

obulat commented Aug 2, 2022

Unfortunately, I managed to confirm that the cleanup DAG did not run over the upstream database.

The cleanup step removes all the clarifai-generated tags with an accuracy value lower than 90. I selected some old Flickr images:
select * from image where provider='flickr' and tags is not null and created_on < now() - interval '5 years' limit 3;
One of those images has some tags with accuracy of 0.75-0.87:

SELECT tags FROM image WHERE identifier='cf3a97a0-0162-4c8c-b9a8-ac1412c96986';
[{"name": "bar", "accuracy": "0.81673", "provider": "clarifai"}, {"name": "beer", "accuracy": "0.94402", "provider": "clarifai"}, {"name": "blur", "accuracy": "0.93241", "provider": "clarifai"}, {"name": "luxury", "accuracy": "0.7906", "provider": "clarifai"}, {"name": "rum", "accuracy": "0.82097", "provider": "clarifai"}, {"name": "still life", "accuracy": "0.75311", "provider": "clarifai"}, {"name": "wood", "accuracy": "0.87821", "provider": "clarifai"}, {"name": "cold", "accuracy": "0.7625... 

If you look at the same image in the API (https://api.openverse.engineering/v1/images/cf3a97a0-0162-4c8c-b9a8-ac1412c96986/), you'll see that the data refresh process removed those tags.

@obulat
Copy link
Contributor

obulat commented Aug 2, 2022

All of the steps of the ingestion server's data refresh cleaning are actually done by the catalog's MediaStore class for the newly ingested media.

@AetherUnbound, @stacimc So, do you think we should close this issue as it is currently non-actionable, or move it to v1.4.0?

The problem described in this issue should be solved by v1.4.0 milestone in the catalog, and on the API side by WordPress/openverse-api#839.

@AetherUnbound
Copy link
Collaborator Author

Let's move this to 1.4.0! Thanks for all the investigating you did on this Olga 🙂

@krysal krysal mentioned this issue Dec 14, 2022
2 tasks
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@krysal krysal added this to the Data normalization milestone Feb 20, 2024
@krysal
Copy link
Member

krysal commented Aug 16, 2024

The cleanup steps were removed from the ingestion server –except for the tags step since we want to keep it now– given the catalog data was corrected whithin #3415. This is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🐍 tech: python Involves Python
Projects
Archived in project
Development

No branches or pull requests

5 participants