Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter machine-generated tags using an inclusion-based check, with a comparison against all reviewed labels #4813

Open
AetherUnbound opened this issue Aug 26, 2024 · 0 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@AetherUnbound
Copy link
Collaborator

Description

Once the new Rekognition tags have been inserted into the catalog, we will want to remove the blanket-filter for all Rekognition tags that was put in place in #4644. We will also need to add the logic for filtering out the tags that were determined should be excluded in #4795.

For all machine-generated labels, we will employ an inclusion-based filtering process. This means that we will only filter out labels that match the list of approved labels, which prevents labels that are unreviewed from appearing in the downstream dataset. This can be added to the alter_data step of the data refresh (see #4684) and would only be applied to tags where the provider was not the record's provider. We will not be adding this to the legacy ingestion server - the Rekognition labels are currently filtered wholesale and can remain that way until we move to the new data refresh process.

The comparison between labels on the record and labels in the list should be case-insensitive, given that the semantic content of the labels is generally case-insensitive too. Similar to the sensitive terms list, both inclusion and reviewed lists will be applied to all tag sources (in that, we will not maintain provider-specific lists for now).

For any orthographic corrections we've made to the labels, we will have the corrected label present in the inclusion list and the original label in the reviewed list. This will ensure that the corrected label is surfaced in the API, but the original label gets blocked in the cases where it may be added by another provider.

We will also add a step for recording if a label was not in the inclusion list and if it did not exist in a full list of all reviewed labels from the provider. These "unreviewed" labels should be surfaced as part of the data refresh, so maintainers can review them and decide if they should be included in the inclusion list.

Additional context

See this section of the related IP.

@AetherUnbound AetherUnbound added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Aug 26, 2024
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Aug 26, 2024
@AetherUnbound AetherUnbound moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📅 To Do
Development

No branches or pull requests

4 participants
@AetherUnbound and others