Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill the last_synced_with_source field in the database #1562

Open
1 task
obulat opened this issue May 20, 2022 · 4 comments
Open
1 task

Backfill the last_synced_with_source field in the database #1562

obulat opened this issue May 20, 2022 · 4 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented May 20, 2022

Current Situation

Currently, there are 554 237 images that set NULL as last_synced_with_source value.

Suggested Improvement

Set last_synced_with_source to the value of updated_on, if available, or to created_on in the database where it is currently NULL.

Benefit

Data consistency.

Additional context

Part of #244

Implementation

  • 🙋 I would be interested in implementing this feature.
@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels May 20, 2022
@obulat obulat mentioned this issue May 20, 2022
29 tasks
@obulat obulat added data normalization ✨ goal: improvement Improvement to an existing user-facing feature and removed 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels May 20, 2022
@krysal
Copy link
Member

krysal commented Sep 30, 2022

I believe this should be done naturally during the data ingestion/refresh process. @stacimc @AetherUnbound what do you think?

@stacimc
Copy link
Collaborator

stacimc commented Sep 30, 2022

That should be possible! With perhaps a bit of lag for dated DAGs, as it will require the reingestion process to cover old data over time. Reingestion flows are now supported but we need to hook up workflows for a few of our dated DAGs, and turn them on in production.

@AetherUnbound
Copy link
Collaborator

I think there may still be some records which will never receive a last_synced_with_source even if all dated DAGs have reingestion workflow and if all non-dated DAGs were certain to re-consume every record they've ever touched. IIRC we have some records from the common crawl dataset which did not come from a provider, and so I would expect those to remain NULL since we're not likely to touch them again.

@stacimc
Copy link
Collaborator

stacimc commented Sep 30, 2022

IIRC we have some records from the common crawl dataset which did not come from a provider

! I did not know this -- I had it in my head that the crawler only identified potential providers -- but of course it must be true. Very good point 😮

@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 24, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@dhruvkb dhruvkb added this to the Data normalization milestone Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

5 participants