Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the batched_update DAG with stored CSVs to update Catalog URLs #3415

Closed
obulat opened this issue Nov 29, 2023 · 2 comments · Fixed by #4610
Closed

Use the batched_update DAG with stored CSVs to update Catalog URLs #3415

obulat opened this issue Nov 29, 2023 · 2 comments · Fixed by #4610
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented Nov 29, 2023

Problem

We have generated some CSVs with identifier and another column that we need to use to update the Catalog media table, but we don't have a way to efficiently run the media table updates.

Description

The batched update DAG is reusable DAG which can be used to perform an arbitrary batched update on a Catalog media table, while handling deadlocking and timeout concerns.
During the cleanup process in data refresh, we generate the CSVs that contain the item identifier and the cleaned up version of another column (title, url, foreign_landing_url, creator_url and tags). We need a DAG that is similar to the batched update DAG, but can use a CSV table for selecting the items that need to be updated.

It is important that this work does not delete any tags. The tag column, while present in the CSVs, should not be used.

Additional context

The CSV files are saved in the docker container of the ingestion server when we run data refresh.

@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Nov 29, 2023
@krysal krysal self-assigned this Feb 6, 2024
@krysal krysal added this to the Data normalization milestone Feb 22, 2024
@sarayourfriend
Copy link
Collaborator

Blocking this based on the discussion in #4417 (comment), on the foundation that we should not delete data from the catalog, and this would do so by deleting any information that is filtered during the data refresh cleanup (namely, tags).

We could unblock this by excluding tags from this work, as long as that's the only place data would be deleted as part of this, and handling tags a different way, but that's complicated by one of the big questions in that IP yet to be resolved (where/when to filter tags not meant for search). Any discussion about that should go into that IP.

@zackkrida zackkrida changed the title Add a batched_update DAG for using with CSV files Use the batched_update DAG with stored CSVs to update Catalog URLs Jun 5, 2024
@zackkrida
Copy link
Member

zackkrida commented Jun 5, 2024

This was discussed in today's priorities meeting. We decided to move forward with this PR but only apply updates to URLs. We do not want to delete any data (tags in this case).

The accuracy threshold filtering of existing tags will now be preserved and iterated on as part of the Removal of the ingestion server #3925 project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants