Filter machine-generated tags using an inclusion-based check, with a comparison against all reviewed labels #4813

AetherUnbound · 2024-08-26T19:04:27Z

Description

Once the new Rekognition tags have been inserted into the catalog, we will want to remove the blanket-filter for all Rekognition tags that was put in place in #4644. We will also need to add the logic for filtering out the tags that were determined should be excluded in #4795.

For all machine-generated labels, we will employ an inclusion-based filtering process. This means that we will only filter out labels that match the list of approved labels, which prevents labels that are unreviewed from appearing in the downstream dataset. This can be added to the alter_data step of the data refresh (see #4684) and would only be applied to tags where the provider was not the record's provider. We will not be adding this to the legacy ingestion server - the Rekognition labels are currently filtered wholesale and can remain that way until we move to the new data refresh process.

The comparison between labels on the record and labels in the list should be case-insensitive, given that the semantic content of the labels is generally case-insensitive too. Similar to the sensitive terms list, both inclusion and reviewed lists will be applied to all tag sources (in that, we will not maintain provider-specific lists for now).

For any orthographic corrections we've made to the labels, we will have the corrected label present in the inclusion list and the original label in the reviewed list. This will ensure that the corrected label is surfaced in the API, but the original label gets blocked in the cases where it may be added by another provider.

We will also add a step for recording if a label was not in the inclusion list and if it did not exist in a full list of all reviewed labels from the provider. These "unreviewed" labels should be surfaced as part of the data refresh, so maintainers can review them and decide if they should be included in the inclusion list.

Additional context

See this section of the related IP.

The text was updated successfully, but these errors were encountered:

AetherUnbound added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Aug 26, 2024

AetherUnbound added this to the Incorporate Rekognition Data milestone Aug 26, 2024

openverse-bot added this to Openverse Backlog Aug 26, 2024

openverse-bot moved this to 📋 Backlog in Openverse Backlog Aug 26, 2024

AetherUnbound moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Aug 26, 2024

AetherUnbound mentioned this issue Sep 6, 2024

Incorporate Rekognition data into the catalog #431

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter machine-generated tags using an inclusion-based check, with a comparison against all reviewed labels #4813

Filter machine-generated tags using an inclusion-based check, with a comparison against all reviewed labels #4813

AetherUnbound commented Aug 26, 2024

Filter machine-generated tags using an inclusion-based check, with a comparison against all reviewed labels #4813

Filter machine-generated tags using an inclusion-based check, with a comparison against all reviewed labels #4813

Comments

AetherUnbound commented Aug 26, 2024

Description

Additional context