-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP: Undo split indices for sensitive text detection #4904
Conversation
b5f73fd
to
d616bf8
Compare
Full-stack documentation: https://docs.openverse.org/_preview/4904 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. New files ➕: |
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan looks good to me. The steps are logical, the changes to the API look correct and the approximate analysis of the performance impact also makes sense.
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Show resolved
Hide resolved
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarayourfriend this looks excellent but I'd like to suggest one addition: Could you define specific prerequisites for the "cleanup" steps? My thinking is that in the past on some projects, Nuxt 3 being a recent example, we have jumped into cleanup work somewhat hastily and perhaps without sufficient assurance that our changes were stable.
Sure thing, good call out. When I get to revision (after Staci reviews for clarification round), I'll add something like the following:
Does that sound alright? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, @sarayourfriend -- I had a question about the indexer worker in the local dev environment, but that should be easily handled. I'm curious about your thoughts on the ingestion approach, but I think this approach will work well and I see the tradeoffs.
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Outdated
Show resolved
Hide resolved
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Show resolved
Hide resolved
I've also been thinking about this IP for the last week and regretting my recommendation of the
The last advantage particularly applies in the context of a catalogue-based approach like the one Staci asked about in this comment. |
@zackkrida I've added details for a cool-off period in the plan in this commit: 13c2beb @stacimc I've added details about the discussion we had yesterday re: moving the check into Airflow in this commit: 27b3781 That second commit also includes the update to use a sensitivity object with boolean properties and a denormalised I am waiting on one last question to clarify from @stacimc that I sent in Slack regarding how the indexer workers are used, and then I will be able to make a small change (it will be small either way) to address this clarification Staci mentioned regarding the ephemerality of the indexer workers. |
27b3781
to
e68381a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic! I love the addition of the denormalized any
as well 👍
Thanks for indulging my questions about the approach -- and for so clearly documenting what ended up being a very complex discussion! 😄 Especially as we talk about the many different places data transformation is happening in our pipelines, it's so nice to really nail down our priorities and consider the options thoroughly. I feel very confident with the approach you've outlined here, cheers!
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Staci Mullins <[email protected]>
...ecting_sensitive_textual_content/20240903-implementation_plan_undo_split_filtered_indices.md
Outdated
Show resolved
Hide resolved
This is past the deadline, and I've pinged in Slack with no response, so I'm going to merge based on Dhruv's previous PR review with the approval. Thanks for the reviews, y'all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally got to reply to this. Fantastic write-up, @sarayourfriend! Everything seems evaluated and well explained. Compared to the alternatives, the selected approach sounds wonderfully simple (not necessarily easy). I'm eager to try it, but I want to refrain from interfering since you have expressed the intention of continuing with it.
I left minor comments that don't block anything. Would you like me to merge it as is? Do you have the list of issues for a milestone?
Additionally, a denormalised field `sensitivity.any` will be added to simplify | ||
our current most-common query case, where we query for works that have no known | ||
sensitivity designations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well thought out, this simplifies the case a lot!
`Path(__file__).parent / f"sensitive_terms-{target_index}.txt"`. This | ||
function should check an environment variable for the network location of | ||
the sensitive terms list. If that variable is undefined, it should simply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this environment variable SENSITIVE_TERMS_LOC
? It's mentioned below, but where it comes from needs to be clarified.
[^provider-supplied-sensitivity]: | ||
[Please see the note in the linked code above regarding provider supplied sensitivity](https://github.com/WordPress/openverse/blob/46a42f7e2c2409d7a8377ce188f4fafb96d5fdec/api/api/constants/sensitivity.py#L4-L7). | ||
This plan makes no explicit consideration for provider supplied sensitivity. | ||
However, I believe the approach described in this plan increases, or at | ||
least maintains, our flexibility in the event it becomes relevant (i.e., we | ||
start intentionally ingesting works explicitly designated as sensitive or | ||
mature by the source). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about this when reading the rendered version in the preview link. I wrote a comment, but I removed it after getting to the footnotes, which answered my question :) I think it's worth leaving this part as a paragraph rather than a footnote.
@krysal please feel free to make any edits you'd like directly to the IP and merge it as you'd prefer. I can assist in the work if you'd like, otherwise, please go ahead and implement it however you see fit 👍 |
Co-authored-by: Staci Mullins <[email protected]>
Fixes
Part of #3336 by @AetherUnbound
Description
This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site. Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.
Current round
This discussion is currently in the Decision round.
The deadline for review of this round is 2024-09-25.
Checklist
Update index.md
).main
) or a parent feature branch.ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
ov just catalog/generate-docs media-props
for the catalog or
ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin