-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750
Comments
@WordPress/openverse-catalog This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion. Getting it for historical records could follow the same pattern as the proposed data normalisation RFC (#345). Should we move this issue? Side note that this is one of the projects mentioned as part of the content safety lighthouse goal in the project planning spreadsheet (#343) |
This seems reasonable to me, although as you mention would require some effort to establish for historical data. I would support this. It seems like one of the original comments here supported this approach as well:
|
I came across a potentially interesting approach for doing this, with it's own tradeoffs, naturally. ES Index Aliases support filtering: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-aliases.html#filtered What if we put sensitive media into their own indexes (sensitive-images sensitive-audio and so on), filtered by the This would mean that updates to our term list require reindexing, which we currently do weekly as part of the data refresh anyway. I think the idea there is that the sensitive term list can change at all, not that those changes need to be instantaneous. |
I assume that approach would perform better overall relative to the number of terms. It sounds like a good approach to me! |
BTW, the correct documentation for our current version of Elasticsearch for the index aliases is this: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/aliases.html#filter-alias Upon further investigation of the feature, I don't think it is worth spending time trying it, at least not for performance reasons. The filtered alias does not "pre-filter" the documents, it just applies the filter to every query against that index (as far as I can tell based on filtered alias performance questions raised online). However, it got me digging around because it seemed like something ES should be able to do (create a new index "type thing" from an existing index based on a filter). Indeed, it can: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-reindex.html Using the reindex API, we can tell ES to "reindex" an existing index into a differently named index and use a filter to select the documents from the origin index. So essentially we'd be copying I'm going to move this issue back to the API repository as it now seems to be squarely in the domain of the API/ingestion server rather than the catalogue, at least under our currently agreed-upon approach for sensitive term filtering. That said, while I think it is a good idea (for query performance) to create these secondary indexes that exclude documents matching sensitive terms, there are a couple of questions/complications I want to raise early so that we can consider them:
|
Just +1'ing here because I don't want to let my kids use OpenVerse for their school projects yet when results include NSFW images (even when I went to go try to search for something for my own demo site). |
Thinking more about this, it's far more complicated to rename the origin index at this point, so we could just use the word "filtered" for the new one (which has the same benefits of |
I've updated the PR linked to this issue to also enable the creation of filtered indexes. We can remove the Django API behaviour depending on if we decide whether it's worth keeping around or not. The PR creates a new action for the reindexing rather than trying to add it to an existing step. This means we'll need to update the data refresh DAG to call the new action as well as the "POINT_ALIAS" action afterwards, mirroring the changes made to |
@sarayourfriend should we close this issue in favor of some of the other plans/RFCs currently ongoing? Or will we just end up using this issue for that work? |
The project thread references it: #377 We could close this issue once the project thread is closed or close it now as a duplicate, I have no preference. |
I'll go ahead and close the issue, it's linked for context as you mention anyway 🙂 |
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This issue has been migrated from the CC Search API repository
Problem Description
We are trying to make NSFW content in CC Search "opt-in". We can catch a lot of NSFW content by using API specific filters and relying on moderation "upstream" at the source, but sometimes things slip through.
Solution Description
One way we can help prevent this is scanning for NSFW profanity and slurs in the title/tags/artist name and settings
nsfw = True
in the metadata field if it fails the check. There are 3rd party lists of dirty words that can help us achieve this. In my experience moderating content on CC Search, this will help prevent a lot of embarrassment and indignant emails from teachers.We can do a one-time scan-and-filter relatively easily, but we will also need a way to filter new content as it is ingested.
Additional Context
The Scunthorpe Problem
Original Comments:
aldenstpage commented on Tue Apr 21 2020:
Issue author brenoferreira commented on Tue Apr 21 2020:
kss682 commented on Wed Apr 22 2020:
The text was updated successfully, but these errors were encountered: