Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750

Closed
obulat opened this issue Apr 21, 2021 · 11 comments
Assignees
Labels
🕹 aspect: interface Concerns end-users' experience with the software ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API

Comments

@obulat
Copy link
Contributor

obulat commented Apr 21, 2021

This issue has been migrated from the CC Search API repository

Author: aldenstpage
Date: Tue Apr 21 2020
Labels: 🏷 status: label work required,🙅 status: discontinued

Problem Description

We are trying to make NSFW content in CC Search "opt-in". We can catch a lot of NSFW content by using API specific filters and relying on moderation "upstream" at the source, but sometimes things slip through.

Solution Description

One way we can help prevent this is scanning for NSFW profanity and slurs in the title/tags/artist name and settings nsfw = True in the metadata field if it fails the check. There are 3rd party lists of dirty words that can help us achieve this. In my experience moderating content on CC Search, this will help prevent a lot of embarrassment and indignant emails from teachers.

We can do a one-time scan-and-filter relatively easily, but we will also need a way to filter new content as it is ingested.

Additional Context

The Scunthorpe Problem


Original Comments:

aldenstpage commented on Tue Apr 21 2020:

Also: we're going to need to review the list of words carefully, because the lists that I linked to are too broad in what they consider NSFW and could have some unwanted inadvertent censorship effects.
source

Issue author brenoferreira commented on Tue Apr 21 2020:

One thing to watch out for in this word list is the potential for false positives that can end up filtering out a lot of content with words that aren't necessarily NSFW.
Edit: when I commented I had the tab open for a while so @aldenstpage comment hadn't loaded yet :D
source

kss682 commented on Wed Apr 22 2020:

For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv , so that the NSFW contents could be flaged and segregated at an early stage. @aldenstpage
source

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 🕹 aspect: interface Concerns end-users' experience with the software labels Dec 16, 2022
@sarayourfriend
Copy link
Collaborator

@WordPress/openverse-catalog This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion. Getting it for historical records could follow the same pattern as the proposed data normalisation RFC (#345). Should we move this issue? Side note that this is one of the projects mentioned as part of the content safety lighthouse goal in the project planning spreadsheet (#343)

@stacimc
Copy link
Contributor

stacimc commented Dec 16, 2022

This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion.

This seems reasonable to me, although as you mention would require some effort to establish for historical data. I would support this. It seems like one of the original comments here supported this approach as well:

For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv

@zackkrida
Copy link
Member

zackkrida commented Jan 25, 2023

I came across a potentially interesting approach for doing this, with it's own tradeoffs, naturally. ES Index Aliases support filtering: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-aliases.html#filtered

What if we put sensitive media into their own indexes (sensitive-images sensitive-audio and so on), filtered by the mature flag but also by filtering against the sensitive term list at index creation time? Then, at search time we route to the sensitive/non-sensitive alias based on the sensitive filter.

This would mean that updates to our term list require reindexing, which we currently do weekly as part of the data refresh anyway. I think the idea there is that the sensitive term list can change at all, not that those changes need to be instantaneous.

@sarayourfriend
Copy link
Collaborator

I assume that approach would perform better overall relative to the number of terms. It sounds like a good approach to me!

@sarayourfriend
Copy link
Collaborator

BTW, the correct documentation for our current version of Elasticsearch for the index aliases is this: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/aliases.html#filter-alias

Upon further investigation of the feature, I don't think it is worth spending time trying it, at least not for performance reasons. The filtered alias does not "pre-filter" the documents, it just applies the filter to every query against that index (as far as I can tell based on filtered alias performance questions raised online).

However, it got me digging around because it seemed like something ES should be able to do (create a new index "type thing" from an existing index based on a filter). Indeed, it can: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-reindex.html

Using the reindex API, we can tell ES to "reindex" an existing index into a differently named index and use a filter to select the documents from the origin index. So essentially we'd be copying X index with the same MultiMatch filter I wrote in WordPress/openverse-api#1108. We'd move the logic to index creation site and call to reindex to create the filtered index immediately after an index is created. We'd also need to do that when an index is updated via the update_index action.

I'm going to move this issue back to the API repository as it now seems to be squarely in the domain of the API/ingestion server rather than the catalogue, at least under our currently agreed-upon approach for sensitive term filtering.

That said, while I think it is a good idea (for query performance) to create these secondary indexes that exclude documents matching sensitive terms, there are a couple of questions/complications I want to raise early so that we can consider them:

  1. How will this affect the size of our ES cluster, specifically with disk usage?
  2. Will it also effect memory, as now twice as many indexes will suddenly be being queried?
  3. Will this effectively double index creation time? Will it be more because we'll be filtering documents as well (sending a query) to derive the documents for the complementary query?
  4. What should we call this secondary index that excludes the sensitive terms? "{model_name}-safe" comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now) "{model_name}-unfiltered" and the filtered index just "{model_name}", we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it).
  5. Do we want to keep the ability to apply additional sensitive term filters at query time, following the pattern in Create pre-filtered secondary indexes and add ability to automatically filter sensitive terms at query time openverse-api#1108. If we did, I am assuming the list would only be used in an "emergency" type situation where we realised a critical term is left out of the list used to create the filtered index that we want to filter out of the default searches ASAP. The invention of a new slur or something is the only thing I can really think of where this would apply. I am mildly sceptical that it would be useful to keep in as it seems unlikely we'd need it and "no code is the best code"/YAGNI.

@sarayourfriend sarayourfriend transferred this issue from WordPress/openverse-catalog Jan 30, 2023
@sc0ttkclark
Copy link

Just +1'ing here because I don't want to let my kids use OpenVerse for their school projects yet when results include NSFW images (even when I went to go try to search for something for my own demo site).

@sarayourfriend sarayourfriend self-assigned this Jan 30, 2023
@sarayourfriend
Copy link
Collaborator

What should we call this secondary index that excludes the sensitive terms? "{model_name}-safe" comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now) "{model_name}-unfiltered" and the filtered index just "{model_name}", we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it).

Thinking more about this, it's far more complicated to rename the origin index at this point, so we could just use the word "filtered" for the new one (which has the same benefits of unfiltered I suggested before).

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Jan 31, 2023

I've updated the PR linked to this issue to also enable the creation of filtered indexes. We can remove the Django API behaviour depending on if we decide whether it's worth keeping around or not.

The PR creates a new action for the reindexing rather than trying to add it to an existing step. This means we'll need to update the data refresh DAG to call the new action as well as the "POINT_ALIAS" action afterwards, mirroring the changes made to load_sample_data.sh in the PR.

@obulat obulat transferred this issue from WordPress/openverse-api Feb 22, 2023
@obulat obulat added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels Mar 20, 2023
@AetherUnbound
Copy link
Collaborator

@sarayourfriend should we close this issue in favor of some of the other plans/RFCs currently ongoing? Or will we just end up using this issue for that work?

@sarayourfriend
Copy link
Collaborator

The project thread references it: #377

We could close this issue once the project thread is closed or close it now as a duplicate, I have no preference.

@AetherUnbound
Copy link
Collaborator

I'll go ahead and close the issue, it's linked for context as you mention anyway 🙂

dhruvkb pushed a commit that referenced this issue Apr 14, 2023
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕹 aspect: interface Concerns end-users' experience with the software ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Archived in project
6 participants