-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preemptively filter out Rekognition tags #4667
Conversation
@@ -133,12 +136,14 @@ def cleanup_tags(tags): | |||
if not tags: | |||
return None | |||
for tag in tags: | |||
below_threshold = False | |||
alt_filtered = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has been bothering me for a long time because it's difficult to understand all branches. What if we convert it to something like this:
if not isinstance(tag.get("name"), str) or "accuracy" in tag and float(tag["accuracy"]) < TAG_MIN_CONFIDENCE or
tag.get("provider") in TAG_SKIP_PROVIDERS:
should_filter = True
else:
lower_tag = tag["name"].lower()
should_filter = _tag_denylisted(lower_tag)
This will also be a bit better for performance because if the tag is {}
, we won't check for it's provider and accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question - hopefully this code is going away long-term, but I'll try and see if I can simplify/optimize it without spending too much time on it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, we log specifically if the tag is malformed based on the current set of conditions, but then don't log anything if it just gets filtered. So I think, given this will eventually be revised, I'll just leave it as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to log that? Malformed in this case means that a tag is either not a dictionary or doesn't have a string name value, right?
I think we've already removed those from the catalog db, haven't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that I think we have - that said, I'm already working on #4541 and have altered the logic in these steps anyway for that change. If you have a blocking objection here than I can change it, otherwise I'd rather wait until we have to rework these cleanup steps entirely as part of the data refresh removal!
@@ -55,6 +55,9 @@ | |||
# Filter out low-confidence tags, which indicate that the machine-generated tag | |||
# may be inaccurate. | |||
TAG_MIN_CONFIDENCE = 0.90 | |||
# Filter out tags that match the following providers (either because they haven't | |||
# been vetted or because they are known to be low-quality). | |||
TAG_SKIP_PROVIDERS = {"rekognition"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is kind of confusing to me. Could we change it to something like
EXCLUDED_TAG_PROVIDERS
or FILTERED_TAG_PROVIDERS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the hardest time naming this 😆 I wanted TAG
at the front to match the previous constant, but this fits much better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested faking some recognition tags in the catalog, then ran a data refresh, and they were successfully removed on the API DB 👍 Nice!
Fixes
Fixes #4644 by @AetherUnbound
Description
This PR adds an additional layer of filtering during the "tags cleanup" step of the (current) ingestion server which filters out tags that correspond to a specific provider. For the time being, only
rekognition
tags are filtered.Testing Instructions
I have been having some trouble getting
ov
to run these commands, but it might just be my desktop.Running
just ingestion_server/test-local
should run the tests, which all should pass!Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin