Filter machine-generated tags using an inclusion-based check, with a comparison against all reviewed labels #4813
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
Milestone
Description
Once the new Rekognition tags have been inserted into the catalog, we will want to remove the blanket-filter for all Rekognition tags that was put in place in #4644. We will also need to add the logic for filtering out the tags that were determined should be excluded in #4795.
For all machine-generated labels, we will employ an inclusion-based filtering process. This means that we will only filter out labels that match the list of approved labels, which prevents labels that are unreviewed from appearing in the downstream dataset. This can be added to the
alter_data
step of the data refresh (see #4684) and would only be applied to tags where theprovider
was not the record'sprovider
. We will not be adding this to the legacy ingestion server - the Rekognition labels are currently filtered wholesale and can remain that way until we move to the new data refresh process.The comparison between labels on the record and labels in the list should be case-insensitive, given that the semantic content of the labels is generally case-insensitive too. Similar to the sensitive terms list, both inclusion and reviewed lists will be applied to all tag sources (in that, we will not maintain provider-specific lists for now).
For any orthographic corrections we've made to the labels, we will have the corrected label present in the inclusion list and the original label in the reviewed list. This will ensure that the corrected label is surfaced in the API, but the original label gets blocked in the cases where it may be added by another provider.
We will also add a step for recording if a label was not in the inclusion list and if it did not exist in a full list of all reviewed labels from the provider. These "unreviewed" labels should be surfaced as part of the data refresh, so maintainers can review them and decide if they should be included in the inclusion list.
Additional context
See this section of the related IP.
The text was updated successfully, but these errors were encountered: