Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deny-listed tags in the catalog with the batched_update DAG #4453

Closed
krysal opened this issue Jun 6, 2024 · 1 comment
Closed

Remove deny-listed tags in the catalog with the batched_update DAG #4453

krysal opened this issue Jun 6, 2024 · 1 comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work

Comments

@krysal
Copy link
Member

krysal commented Jun 6, 2024

Problem

The creation of the MediaStore class in the catalog introduced the removal of certain specific labels. Records previously ingested are cleaned with each data refresh process in the Ingestion Server. We want to apply this cleaning step in the catalog to the old rows to save the data and the extra time it takes for each pass.

You can find the tags to remove in the following file:

# Filter out tags that exactly match these terms. All terms should be lowercase.
TAG_DENYLIST = {
"no person",
"squareformat",
"undefined",
}
# Filter out tags that contain the following terms. All entrées should be lowercase.
TAG_CONTAINS_DENYLIST = {
":",
"=",
"by",
"by-nc",
"by-nc-nd",
"by-nc-sa",
"by-nd",
"by-sa",
"cc0",
"creative commons",
"flickriosapp",
"pdm",
"public domain",
"uploaded",
}

Description

Write the instructions for removing these tags and apply them with a batched_update DAG run. In theory, it should only be required for the image table since audio was incorporated after creating the MediaStore class, but the list of denied tags wasn't unified until a long time later, and the audio table is much smaller, so it would not hurt to apply them there too.

See an example of the parameters required by this DAG here: #1566 (comment).

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 🧱 stack: catalog Related to the catalog and Airflow DAGs 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Jun 6, 2024
@krysal krysal added this to the Data normalization milestone Jun 6, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Jun 6, 2024
@krysal krysal moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Jun 6, 2024
@krysal krysal added the ⛔ status: blocked Blocked & therefore, not ready for work label Jun 10, 2024
@openverse-bot openverse-bot moved this from 📅 To Do to ⛔ Blocked in Openverse Backlog Jun 10, 2024
@krysal
Copy link
Member Author

krysal commented Jun 11, 2024

Not needed as we actually want to conserve all the tags, as discussed in #4417.

@krysal krysal closed this as not planned Won't fix, can't repro, duplicate, stale Jun 11, 2024
@openverse-bot openverse-bot moved this from ⛔ Blocked to 🗑 Discarded in Openverse Backlog Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs ⛔ status: blocked Blocked & therefore, not ready for work
Projects
Archived in project
Development

No branches or pull requests

1 participant