Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specifics of Rekognition tag filtering to implementation plan #4784

Merged
merged 5 commits into from
Aug 23, 2024

Conversation

AetherUnbound
Copy link
Collaborator

Description

This was prompted by a discussion in #4643.

This PR adds some clarity around the Rekognition tag filtering step, including adding an explicit step to implement the filtering using an inclusion-based list after the tags are inserted. See the discussion linked above for rationale, and the text changed for what's described.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (ov just catalog/generate-docs media-props
    for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner August 19, 2024 22:49
@AetherUnbound AetherUnbound added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 📄 aspect: text Concerns the textual material in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Aug 19, 2024
@AetherUnbound AetherUnbound removed the request for review from a team August 19, 2024 22:49
@AetherUnbound AetherUnbound added the 🧭 project: implementation plan An implementation plan for a project label Aug 19, 2024
@openverse-bot openverse-bot added the 🧱 stack: documentation Related to Sphinx documentation label Aug 19, 2024
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4784

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

Comment on lines +525 to +530
For all machine-generated labels, we will employ an inclusion-based filtering
process. This means that we will only filter out labels that match the list of
approved labels, which prevents labels that are unreviewed from appearing in the
downstream dataset. This can be added to the `alter_data` step of the data
refresh (see #4684) and would only be applied to tags where the `provider` was
not the record's `provider`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be generally helpful to clarify that the filtering of the tags is in relation to their use in search?

Suggested change
For all machine-generated labels, we will employ an inclusion-based filtering
process. This means that we will only filter out labels that match the list of
approved labels, which prevents labels that are unreviewed from appearing in the
downstream dataset. This can be added to the `alter_data` step of the data
refresh (see #4684) and would only be applied to tags where the `provider` was
not the record's `provider`.
For all machine-generated labels, we will employ an inclusion-based filtering
process. This means that we will only include labels in search that match the list of
approved labels, which prevents labels that are unreviewed from appearing in the
downstream dataset. We will add this to the `alter_data` step of the data
refresh (see #4684) and will only be applied to labels with the provider corresponding to the include list's provider.

The inclusion lists should be per-provider, right? The methodology of these labels are not all equal and we may not find the same value or benefits from one provider to the other. Especially considering we know nothing about how or where the Clarifai dataset comes from (which really gives me pause viz our continued use of them at all... but that's a separate question).

Should also clarify that the comparisons used for filtering must be case insensitive, due to some labels receiving orthographic changes upon ingestion into the catalogue. Mostly to clarify the reason here, than the fact of the particular implementation detail. But also, we've made changes to the contents of the labels. So that means for example that we actually need to modify the inclusion list from the provider anyway, like to change "fireman" to "firefighter", "RAM memory" to "RAM"? Those won't match compared to the provider list even if case insensitive.

Does that mean we need to maintain those orthographic translations in multiple places, in the inclusion list and in the code itself? Or should we keep them in a mapping that can be used to retrieve our "fixed" version of the label as well as the provider's original label from the fixed version? Then the ingestion transformation can use the "provider -> fixed" mapping, and the data refresh filtering can use the "fixed -> provider" mapping to retrieve the original provider label from any fixed ones (and just using the label as is if a mapping doesn't exist, of course).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inclusion list should cover all providers IMO, I'm not sure what benefit we would get from maintaining provider-specific inclusion lists. We don't do this for sensitive terms, we've simply applied a blanket across all of our data regarding term sensitivity. Given the paradigm of "include + all reviewed" labels, this would actually prevent us from having to exclude the Clarifai labels wholesale in the same way we're currently doing for Rekognition, since the only Clarifai labels that would pass through the alter data steps are labels which we've already reviewed and approved. We could then go in and capture all the "unreviewed" labels from Clarifai, and either add them to the reviewed list or the inclusion list. Having per-provider lists would allow us finer-grain control, but re-reviewing label sets that may have significant cross over seems like adding unnecessary extra work on our part. I think we can treat this in the same way as the sensitive terms list.

Yes, I can add more information about the case insensitive comparison. And for the cases of "fireman" or "ram memory", those two would go into the "all reviewed labels" list, but only "firefighter" and "ram" would be in the "included labels" list. I don't think we need to maintain a mapping in perpetuity within the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sensitive terms are different though, the dataset they are compared against is by definition "all English language words" because we don't know the extent of all terms present in title, description, and provider tags. Tags from a machine vision model are not that way, they are bound in their scope of terms, and have some particular methodology (or probably at best "pattern") for deriving those terms.

I agree there will be significant overlap and for some classes of terms (indeed, maybe for all the terms we have now) we can say we will not include from any provider (like gender terms like man and woman). But one provider may have labels we think are okay based on their methodology and not okay based on others. One model might confidently mislabel people as animals whereas another may not, distinguishing between those would be useful. That's especially important because we are specifically using an inclusion list rather than an exclusion list. We're in some way saying "we definitely want these labels", applying that as a blanket to all models doesn't seem wise.

For example, we are including the label "Barbie" based on it being about the toy, not a descriptor of a person. What if the Clarfai model also has that label but it isn't part of the "toys" category of labels? It's a fine distinction in this case and one we would want control over; or else accept we will always be working from the lowest common denominator of all relevant models' capabilities.

I don't feel too strongly here, though. We don't know the contents of the Clarifai labels so it's all theoretical. If we are trying to be careful (I believe we are), I don't think it's a big ask to review each label from each source independently of each other. We may even come to new conclusions about labels generally by doing so. I mean, especially because we do not know the provenance of the Clarifai labels, literally at all. We have no information about where they came from, how they ended up in the catalogue, or what model was used to generate them. Why wouldn't we want to review all of them, even if we just end up including the same set? What does it cost us other than a bit of time? It's even a good opportunity to have different people than you, me, and Krystle involved in reviewing those labels, perhaps even someone from outside the core maintainers.

But, if you feel strongly there is no issue here, I won't push for separate lists beyond what I've said here.

And for the cases of "fireman" or "ram memory", those two would go into the "all reviewed labels" list, but only "firefighter" and "ram" would be in the "included labels" list. I don't think we need to maintain a mapping in perpetuity within the code.

Shouldn't we maintain the mapping in code at least for provenance? I would hate to be the person in 2 years who tries to figure out how Openverse ended up with labels ostensibly from Rekognition which Rekognition does not itself say it generates. And how would we even do the mapping without it being in code? Wouldn't it be in the label ingestion DAG, which we would theoretically use as at least the building point if not exactly for ingesting future labels from Rekognition?

Is the idea that we would transform the labels in the inclusion list and in the all labels list? We'll need to document the transformations there too, in that case, for the same reason. Per-provider lists seems especially important if we're doing this kind of mapping and then not explicitly documenting the intention.

Maybe I am being too cautious. To be fully transparent, I am terrified by the use of these labels and despite the apparent benefits still maintain extreme reservations about all of this. Perhaps that is motivating an overly cautious approach from me on these. Input from someone else would be good to have here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the process described in the proposal as a good incremental approach. If we weren't filtering Clarify tags before by other criteria than the accuracy (and the general excluded terms) then we can continue with it. Mainly to limit the scope of the PR and the project itself, which goes around Recognition. If we see specific problems with the Clarify tags then we can prioritize that work you're proposing @sarayourfriend. This is not to say that your suggestion on keeping a list by provider would not be helpful, I agree that it will give us more control, but I think it's beneficial to start with Recognition and continue to see some progress.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already added issues to evaluate Clarifai tags to the project's scope, for what its worth. Past history of not doing due diligence with these machine generated labels (or not documenting it, if it was done, we have no way of knowing), isn't a reason to continue ignoring the need for it.

https://docs.openverse.org/projects/proposals/rekognition_data/20240530-implementation_plan_augment_catalog_with_rekognition_tags.html#filter-clarifai-tags

If we see specific problems with the Clarify tags then we can prioritize that work you're proposing

Sure. We've already committed to looking at the tags anyway. It isn't a big thing to add later anyway, unless I'm missing something about this that would make it more complex than a for loop over a list of provider lists 🤷

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it incurs a greater maintenance burden, both by way of review and code, to be maintaining separate lists. If we find there is significant enough semantic or cultural distinction between the tags that different providers use, then we can always move to maintaining separate lists down the line. Right now we are only concerned with Rekognition and Clarifai labels, since we don't have any other clear plans for incorporating other sources of machine-generated labels. As such, I feel comfortable taking this approach knowing it's simpler on the maintenance end, and it's a reversible decision should we choose to change it down the line.

Shouldn't we maintain the mapping in code at least for provenance?

The mapping has been codified, along with the other labels we're excluding and the full list of reviewed labels, in #4795. It will then also be present in the DAG code which is ingesting the labels from the Rekognition dataset and inserting them to the catalog.

The transformed label would be in the inclusion list, and both it and the original label would be present in the all reviewed labels list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will then also be present in the DAG code which is ingesting the labels from the Rekognition dataset and inserting them to the catalog.

Okay. I didn't understand that would be the case. You also said:

I don't think we need to maintain a mapping in perpetuity within the code.

and both it and the original label would be present in the all reviewed labels list

Okay, I see. Including it in the all reviewed alongside the original, means it won't get flagged as unreviewed. That bit, including both in the reviewed list, was the part I was missing to understand how it would be consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, apologies for communicating that in a confusing way! Including alongside the original was the point I was trying to convey 🙂

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@krysal
@stacimc
@sarayourfriend
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The ideas emerging from the conversation are interesting and worth continuing exploring them. We can continue at the same time.

Comment on lines +525 to +530
For all machine-generated labels, we will employ an inclusion-based filtering
process. This means that we will only filter out labels that match the list of
approved labels, which prevents labels that are unreviewed from appearing in the
downstream dataset. This can be added to the `alter_data` step of the data
refresh (see #4684) and would only be applied to tags where the `provider` was
not the record's `provider`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the process described in the proposal as a good incremental approach. If we weren't filtering Clarify tags before by other criteria than the accuracy (and the general excluded terms) then we can continue with it. Mainly to limit the scope of the PR and the project itself, which goes around Recognition. If we see specific problems with the Clarify tags then we can prioritize that work you're proposing @sarayourfriend. This is not to say that your suggestion on keeping a list by provider would not be helpful, I agree that it will give us more control, but I think it's beneficial to start with Recognition and continue to see some progress.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise I was still set requesting changes, nothing to add beyond the discussion, no changes to the text for it from me 👍

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM -- I agree strongly with the initial approach laid out here; any additional per-provider considerations discovered in the future should be easily iterated upon, without introducing complexity prematurely. Interesting discussion, thanks for the clarifications.

@AetherUnbound AetherUnbound merged commit 1a86e40 into main Aug 23, 2024
56 checks passed
@AetherUnbound AetherUnbound deleted the docs/rekognition-impl-clarify-action branch August 23, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧭 project: implementation plan An implementation plan for a project 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants