Remove deny-listed tags in the catalog with the batched_update
DAG
#4453
Labels
🗄️ aspect: data
Concerns the data in our catalog and/or databases
🛠 goal: fix
Bug fix
🟧 priority: high
Stalls work on the project or its dependents
🧱 stack: catalog
Related to the catalog and Airflow DAGs
⛔ status: blocked
Blocked & therefore, not ready for work
Milestone
Problem
The creation of the
MediaStore
class in the catalog introduced the removal of certain specific labels. Records previously ingested are cleaned with each data refresh process in the Ingestion Server. We want to apply this cleaning step in the catalog to the old rows to save the data and the extra time it takes for each pass.You can find the tags to remove in the following file:
openverse/catalog/dags/common/storage/media.py
Lines 14 to 37 in d1f6c88
Description
Write the instructions for removing these tags and apply them with a
batched_update
DAG run. In theory, it should only be required for theimage
table since audio was incorporated after creating theMediaStore
class, but the list of denied tags wasn't unified until a long time later, and the audio table is much smaller, so it would not hurt to apply them there too.See an example of the parameters required by this DAG here: #1566 (comment).
The text was updated successfully, but these errors were encountered: