Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP: Undo split indices for sensitive text detection #4904

Merged
merged 6 commits into from
Oct 17, 2024

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Sep 10, 2024

Fixes

Part of #3336 by @AetherUnbound

Description

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

Current round

This discussion is currently in the Decision round.

The deadline for review of this round is 2024-09-25.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • [N/A] I added or updated tests for the changes I made (if applicable).
  • [N/A] I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (ov just catalog/generate-docs media-props
    for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🧱 stack: documentation Related to Sphinx documentation 🧭 project: implementation plan An implementation plan for a project labels Sep 10, 2024
@sarayourfriend sarayourfriend requested a review from a team as a code owner September 10, 2024 06:07
@sarayourfriend sarayourfriend requested review from krysal, stacimc and dhruvkb and removed request for a team and krysal September 10, 2024 06:07
@sarayourfriend sarayourfriend force-pushed the add/undo-split-filtered-index branch from b5f73fd to d616bf8 Compare September 10, 2024 06:19
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4904

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan looks good to me. The steps are logical, the changes to the API look correct and the approximate analysis of the performance impact also makes sense.

Copy link
Member

@zackkrida zackkrida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarayourfriend this looks excellent but I'd like to suggest one addition: Could you define specific prerequisites for the "cleanup" steps? My thinking is that in the past on some projects, Nuxt 3 being a recent example, we have jumped into cleanup work somewhat hastily and perhaps without sufficient assurance that our changes were stable.

@sarayourfriend
Copy link
Collaborator Author

Could you define specific prerequisites for the "cleanup" steps?

Sure thing, good call out. When I get to revision (after Staci reviews for clarification round), I'll add something like the following:

Clean-up should occur only after 2 weeks of running the new approach in production, including two full production data refreshes. This is to ensure sufficiently exericse the new approach both during the data refresh and at query time before starting to take actions that will make rolling back much more cumbersome.

Does that sound alright?

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, @sarayourfriend -- I had a question about the indexer worker in the local dev environment, but that should be easily handled. I'm curious about your thoughts on the ingestion approach, but I think this approach will work well and I see the tradeoffs.

@sarayourfriend
Copy link
Collaborator Author

I've also been thinking about this IP for the last week and regretting my recommendation of the sensitivity list. I think instead, an object of boolean properties like sensitivity: { text: boolean, user_reported: boolean } would be better. It could also have an any: boolean field as a normalised version of all the booleans in the object, which we could query against for simpler non-sensitive queries, which are the predominant kind of queries we make. Regardless of the version we go with, I want to change the IP to go with this approach. It has the following advantages:

  1. It does not require using a Painless script to update the document in the index (way simpler, less fiddly, easier to test and maintain).
  2. It produces an identical type of must_not term query on the boolean property/ies, matching our current query. Therefore, we can be confident it will have identical performance characteristics to our current query.
  3. It is potentially more flexible long term, as an object can much more naturally grow in properties to include things like which fields had sensitive text, which sensitive terms were detected, and so forth. Things that may be valuable or even essential information to improving how our sensitive text detection works in the future.

The last advantage particularly applies in the context of a catalogue-based approach like the one Staci asked about in this comment.

@sarayourfriend
Copy link
Collaborator Author

@zackkrida I've added details for a cool-off period in the plan in this commit: 13c2beb

@stacimc I've added details about the discussion we had yesterday re: moving the check into Airflow in this commit: 27b3781

That second commit also includes the update to use a sensitivity object with boolean properties and a denormalised any.

I am waiting on one last question to clarify from @stacimc that I sent in Slack regarding how the indexer workers are used, and then I will be able to make a small change (it will be small either way) to address this clarification Staci mentioned regarding the ephemerality of the indexer workers.

@sarayourfriend sarayourfriend force-pushed the add/undo-split-filtered-index branch from 27b3781 to e68381a Compare September 20, 2024 00:35
@sarayourfriend
Copy link
Collaborator Author

@dhruvkb and @stacimc this is ready for y'all to take another look and make a decision or raise blockers. @dhruvkb I know you left an approval before but just wanted to wait until the decision round to lock it in so feel free to change your mind, of course! 🙂

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! I love the addition of the denormalized any as well 👍

Thanks for indulging my questions about the approach -- and for so clearly documenting what ended up being a very complex discussion! 😄 Especially as we talk about the many different places data transformation is happening in our pipelines, it's so nice to really nail down our priorities and consider the options thoroughly. I feel very confident with the approach you've outlined here, cheers!

Co-authored-by: Staci Mullins <[email protected]>
@sarayourfriend
Copy link
Collaborator Author

This is past the deadline, and I've pinged in Slack with no response, so I'm going to merge based on Dhruv's previous PR review with the approval.

Thanks for the reviews, y'all.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally got to reply to this. Fantastic write-up, @sarayourfriend! Everything seems evaluated and well explained. Compared to the alternatives, the selected approach sounds wonderfully simple (not necessarily easy). I'm eager to try it, but I want to refrain from interfering since you have expressed the intention of continuing with it.

I left minor comments that don't block anything. Would you like me to merge it as is? Do you have the list of issues for a milestone?

Comment on lines +61 to +63
Additionally, a denormalised field `sensitivity.any` will be added to simplify
our current most-common query case, where we query for works that have no known
sensitivity designations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well thought out, this simplifies the case a lot!

Comment on lines +302 to +304
`Path(__file__).parent / f"sensitive_terms-{target_index}.txt"`. This
function should check an environment variable for the network location of
the sensitive terms list. If that variable is undefined, it should simply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this environment variable SENSITIVE_TERMS_LOC? It's mentioned below, but where it comes from needs to be clarified.

Comment on lines +73 to +79
[^provider-supplied-sensitivity]:
[Please see the note in the linked code above regarding provider supplied sensitivity](https://github.com/WordPress/openverse/blob/46a42f7e2c2409d7a8377ce188f4fafb96d5fdec/api/api/constants/sensitivity.py#L4-L7).
This plan makes no explicit consideration for provider supplied sensitivity.
However, I believe the approach described in this plan increases, or at
least maintains, our flexibility in the event it becomes relevant (i.e., we
start intentionally ingesting works explicitly designated as sensitive or
mature by the source).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this when reading the rendered version in the preview link. I wrote a comment, but I removed it after getting to the footnotes, which answered my question :) I think it's worth leaving this part as a paragraph rather than a footnote.

@sarayourfriend
Copy link
Collaborator Author

@krysal please feel free to make any edits you'd like directly to the IP and merge it as you'd prefer. I can assist in the work if you'd like, otherwise, please go ahead and implement it however you see fit 👍

@krysal krysal merged commit cd0ee7c into main Oct 17, 2024
44 checks passed
@krysal krysal deleted the add/undo-split-filtered-index branch October 17, 2024 13:59
Danil49 pushed a commit to Danil49/openverse that referenced this pull request Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: documentation Related to Sphinx documentation
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants