-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add specifics of Rekognition tag filtering to implementation plan #4784
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4784 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
For all machine-generated labels, we will employ an inclusion-based filtering | ||
process. This means that we will only filter out labels that match the list of | ||
approved labels, which prevents labels that are unreviewed from appearing in the | ||
downstream dataset. This can be added to the `alter_data` step of the data | ||
refresh (see #4684) and would only be applied to tags where the `provider` was | ||
not the record's `provider`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be generally helpful to clarify that the filtering of the tags is in relation to their use in search?
For all machine-generated labels, we will employ an inclusion-based filtering | |
process. This means that we will only filter out labels that match the list of | |
approved labels, which prevents labels that are unreviewed from appearing in the | |
downstream dataset. This can be added to the `alter_data` step of the data | |
refresh (see #4684) and would only be applied to tags where the `provider` was | |
not the record's `provider`. | |
For all machine-generated labels, we will employ an inclusion-based filtering | |
process. This means that we will only include labels in search that match the list of | |
approved labels, which prevents labels that are unreviewed from appearing in the | |
downstream dataset. We will add this to the `alter_data` step of the data | |
refresh (see #4684) and will only be applied to labels with the provider corresponding to the include list's provider. |
The inclusion lists should be per-provider, right? The methodology of these labels are not all equal and we may not find the same value or benefits from one provider to the other. Especially considering we know nothing about how or where the Clarifai dataset comes from (which really gives me pause viz our continued use of them at all... but that's a separate question).
Should also clarify that the comparisons used for filtering must be case insensitive, due to some labels receiving orthographic changes upon ingestion into the catalogue. Mostly to clarify the reason here, than the fact of the particular implementation detail. But also, we've made changes to the contents of the labels. So that means for example that we actually need to modify the inclusion list from the provider anyway, like to change "fireman" to "firefighter", "RAM memory" to "RAM"? Those won't match compared to the provider list even if case insensitive.
Does that mean we need to maintain those orthographic translations in multiple places, in the inclusion list and in the code itself? Or should we keep them in a mapping that can be used to retrieve our "fixed" version of the label as well as the provider's original label from the fixed version? Then the ingestion transformation can use the "provider -> fixed" mapping, and the data refresh filtering can use the "fixed -> provider" mapping to retrieve the original provider label from any fixed ones (and just using the label as is if a mapping doesn't exist, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inclusion list should cover all providers IMO, I'm not sure what benefit we would get from maintaining provider-specific inclusion lists. We don't do this for sensitive terms, we've simply applied a blanket across all of our data regarding term sensitivity. Given the paradigm of "include + all reviewed" labels, this would actually prevent us from having to exclude the Clarifai labels wholesale in the same way we're currently doing for Rekognition, since the only Clarifai labels that would pass through the alter data steps are labels which we've already reviewed and approved. We could then go in and capture all the "unreviewed" labels from Clarifai, and either add them to the reviewed list or the inclusion list. Having per-provider lists would allow us finer-grain control, but re-reviewing label sets that may have significant cross over seems like adding unnecessary extra work on our part. I think we can treat this in the same way as the sensitive terms list.
Yes, I can add more information about the case insensitive comparison. And for the cases of "fireman" or "ram memory", those two would go into the "all reviewed labels" list, but only "firefighter" and "ram" would be in the "included labels" list. I don't think we need to maintain a mapping in perpetuity within the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sensitive terms are different though, the dataset they are compared against is by definition "all English language words" because we don't know the extent of all terms present in title, description, and provider tags. Tags from a machine vision model are not that way, they are bound in their scope of terms, and have some particular methodology (or probably at best "pattern") for deriving those terms.
I agree there will be significant overlap and for some classes of terms (indeed, maybe for all the terms we have now) we can say we will not include from any provider (like gender terms like man and woman). But one provider may have labels we think are okay based on their methodology and not okay based on others. One model might confidently mislabel people as animals whereas another may not, distinguishing between those would be useful. That's especially important because we are specifically using an inclusion list rather than an exclusion list. We're in some way saying "we definitely want these labels", applying that as a blanket to all models doesn't seem wise.
For example, we are including the label "Barbie" based on it being about the toy, not a descriptor of a person. What if the Clarfai model also has that label but it isn't part of the "toys" category of labels? It's a fine distinction in this case and one we would want control over; or else accept we will always be working from the lowest common denominator of all relevant models' capabilities.
I don't feel too strongly here, though. We don't know the contents of the Clarifai labels so it's all theoretical. If we are trying to be careful (I believe we are), I don't think it's a big ask to review each label from each source independently of each other. We may even come to new conclusions about labels generally by doing so. I mean, especially because we do not know the provenance of the Clarifai labels, literally at all. We have no information about where they came from, how they ended up in the catalogue, or what model was used to generate them. Why wouldn't we want to review all of them, even if we just end up including the same set? What does it cost us other than a bit of time? It's even a good opportunity to have different people than you, me, and Krystle involved in reviewing those labels, perhaps even someone from outside the core maintainers.
But, if you feel strongly there is no issue here, I won't push for separate lists beyond what I've said here.
And for the cases of "fireman" or "ram memory", those two would go into the "all reviewed labels" list, but only "firefighter" and "ram" would be in the "included labels" list. I don't think we need to maintain a mapping in perpetuity within the code.
Shouldn't we maintain the mapping in code at least for provenance? I would hate to be the person in 2 years who tries to figure out how Openverse ended up with labels ostensibly from Rekognition which Rekognition does not itself say it generates. And how would we even do the mapping without it being in code? Wouldn't it be in the label ingestion DAG, which we would theoretically use as at least the building point if not exactly for ingesting future labels from Rekognition?
Is the idea that we would transform the labels in the inclusion list and in the all labels list? We'll need to document the transformations there too, in that case, for the same reason. Per-provider lists seems especially important if we're doing this kind of mapping and then not explicitly documenting the intention.
Maybe I am being too cautious. To be fully transparent, I am terrified by the use of these labels and despite the apparent benefits still maintain extreme reservations about all of this. Perhaps that is motivating an overly cautious approach from me on these. Input from someone else would be good to have here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the process described in the proposal as a good incremental approach. If we weren't filtering Clarify tags before by other criteria than the accuracy (and the general excluded terms) then we can continue with it. Mainly to limit the scope of the PR and the project itself, which goes around Recognition. If we see specific problems with the Clarify tags then we can prioritize that work you're proposing @sarayourfriend. This is not to say that your suggestion on keeping a list by provider would not be helpful, I agree that it will give us more control, but I think it's beneficial to start with Recognition and continue to see some progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have already added issues to evaluate Clarifai tags to the project's scope, for what its worth. Past history of not doing due diligence with these machine generated labels (or not documenting it, if it was done, we have no way of knowing), isn't a reason to continue ignoring the need for it.
If we see specific problems with the Clarify tags then we can prioritize that work you're proposing
Sure. We've already committed to looking at the tags anyway. It isn't a big thing to add later anyway, unless I'm missing something about this that would make it more complex than a for loop over a list of provider lists 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it incurs a greater maintenance burden, both by way of review and code, to be maintaining separate lists. If we find there is significant enough semantic or cultural distinction between the tags that different providers use, then we can always move to maintaining separate lists down the line. Right now we are only concerned with Rekognition and Clarifai labels, since we don't have any other clear plans for incorporating other sources of machine-generated labels. As such, I feel comfortable taking this approach knowing it's simpler on the maintenance end, and it's a reversible decision should we choose to change it down the line.
Shouldn't we maintain the mapping in code at least for provenance?
The mapping has been codified, along with the other labels we're excluding and the full list of reviewed labels, in #4795. It will then also be present in the DAG code which is ingesting the labels from the Rekognition dataset and inserting them to the catalog.
The transformed label would be in the inclusion list, and both it and the original label would be present in the all reviewed labels list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will then also be present in the DAG code which is ingesting the labels from the Rekognition dataset and inserting them to the catalog.
Okay. I didn't understand that would be the case. You also said:
I don't think we need to maintain a mapping in perpetuity within the code.
and both it and the original label would be present in the all reviewed labels list
Okay, I see. Including it in the all reviewed alongside the original, means it won't get flagged as unreviewed. That bit, including both in the reviewed list, was the part I was missing to understand how it would be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, apologies for communicating that in a confusing way! Including alongside the original was the point I was trying to convey 🙂
...osals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.md
Outdated
Show resolved
Hide resolved
Co-authored-by: sarayourfriend <[email protected]>
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The ideas emerging from the conversation are interesting and worth continuing exploring them. We can continue at the same time.
For all machine-generated labels, we will employ an inclusion-based filtering | ||
process. This means that we will only filter out labels that match the list of | ||
approved labels, which prevents labels that are unreviewed from appearing in the | ||
downstream dataset. This can be added to the `alter_data` step of the data | ||
refresh (see #4684) and would only be applied to tags where the `provider` was | ||
not the record's `provider`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the process described in the proposal as a good incremental approach. If we weren't filtering Clarify tags before by other criteria than the accuracy (and the general excluded terms) then we can continue with it. Mainly to limit the scope of the PR and the project itself, which goes around Recognition. If we see specific problems with the Clarify tags then we can prioritize that work you're proposing @sarayourfriend. This is not to say that your suggestion on keeping a list by provider would not be helpful, I agree that it will give us more control, but I think it's beneficial to start with Recognition and continue to see some progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realise I was still set requesting changes, nothing to add beyond the discussion, no changes to the text for it from me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM -- I agree strongly with the initial approach laid out here; any additional per-provider considerations discovered in the future should be easily iterated upon, without introducing complexity prematurely. Interesting discussion, thanks for the clarifications.
Description
This was prompted by a discussion in #4643.
This PR adds some clarity around the Rekognition tag filtering step, including adding an explicit step to implement the filtering using an inclusion-based list after the tags are inserted. See the discussion linked above for rationale, and the text changed for what's described.
Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin