[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750

obulat · 2021-04-21T12:14:04Z

This issue has been migrated from the CC Search API repository

Author: aldenstpage
Date: Tue Apr 21 2020
Labels: 🏷 status: label work required,🙅 status: discontinued

Problem Description

We are trying to make NSFW content in CC Search "opt-in". We can catch a lot of NSFW content by using API specific filters and relying on moderation "upstream" at the source, but sometimes things slip through.

Solution Description

One way we can help prevent this is scanning for NSFW profanity and slurs in the title/tags/artist name and settings nsfw = True in the metadata field if it fails the check. There are 3rd party lists of dirty words that can help us achieve this. In my experience moderating content on CC Search, this will help prevent a lot of embarrassment and indignant emails from teachers.

We can do a one-time scan-and-filter relatively easily, but we will also need a way to filter new content as it is ingested.

Additional Context

The Scunthorpe Problem

Original Comments:

aldenstpage commented on Tue Apr 21 2020:

Also: we're going to need to review the list of words carefully, because the lists that I linked to are too broad in what they consider NSFW and could have some unwanted inadvertent censorship effects.
source

Issue author brenoferreira commented on Tue Apr 21 2020:

One thing to watch out for in this word list is the potential for false positives that can end up filtering out a lot of content with words that aren't necessarily NSFW.
Edit: when I commented I had the tab open for a while so @aldenstpage comment hadn't loaded yet :D
source

kss682 commented on Wed Apr 22 2020:

For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv , so that the NSFW contents could be flaged and segregated at an early stage. @aldenstpage
source

The text was updated successfully, but these errors were encountered:

sarayourfriend · 2022-12-16T04:58:53Z

@WordPress/openverse-catalog This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion. Getting it for historical records could follow the same pattern as the proposed data normalisation RFC (#345). Should we move this issue? Side note that this is one of the projects mentioned as part of the content safety lighthouse goal in the project planning spreadsheet (#343)

stacimc · 2022-12-16T22:12:02Z

This seems like an issue that might make more sense to put into the catalogue. Detecting these keywords might fit nicely into initial ingestion.

This seems reasonable to me, although as you mention would require some effort to establish for historical data. I would support this. It seems like one of the original comments here supported this approach as well:

For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv

zackkrida · 2023-01-25T17:32:15Z

I came across a potentially interesting approach for doing this, with it's own tradeoffs, naturally. ES Index Aliases support filtering: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-aliases.html#filtered

What if we put sensitive media into their own indexes (sensitive-images sensitive-audio and so on), filtered by the mature flag but also by filtering against the sensitive term list at index creation time? Then, at search time we route to the sensitive/non-sensitive alias based on the sensitive filter.

This would mean that updates to our term list require reindexing, which we currently do weekly as part of the data refresh anyway. I think the idea there is that the sensitive term list can change at all, not that those changes need to be instantaneous.

sarayourfriend · 2023-01-25T23:55:46Z

I assume that approach would perform better overall relative to the number of terms. It sounds like a good approach to me!

sarayourfriend · 2023-01-30T07:15:21Z

BTW, the correct documentation for our current version of Elasticsearch for the index aliases is this: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/aliases.html#filter-alias

Upon further investigation of the feature, I don't think it is worth spending time trying it, at least not for performance reasons. The filtered alias does not "pre-filter" the documents, it just applies the filter to every query against that index (as far as I can tell based on filtered alias performance questions raised online).

However, it got me digging around because it seemed like something ES should be able to do (create a new index "type thing" from an existing index based on a filter). Indeed, it can: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/docs-reindex.html

Using the reindex API, we can tell ES to "reindex" an existing index into a differently named index and use a filter to select the documents from the origin index. So essentially we'd be copying X index with the same MultiMatch filter I wrote in WordPress/openverse-api#1108. We'd move the logic to index creation site and call to reindex to create the filtered index immediately after an index is created. We'd also need to do that when an index is updated via the update_index action.

I'm going to move this issue back to the API repository as it now seems to be squarely in the domain of the API/ingestion server rather than the catalogue, at least under our currently agreed-upon approach for sensitive term filtering.

That said, while I think it is a good idea (for query performance) to create these secondary indexes that exclude documents matching sensitive terms, there are a couple of questions/complications I want to raise early so that we can consider them:

How will this affect the size of our ES cluster, specifically with disk usage?
Will it also effect memory, as now twice as many indexes will suddenly be being queried?
Will this effectively double index creation time? Will it be more because we'll be filtering documents as well (sending a query) to derive the documents for the complementary query?
What should we call this secondary index that excludes the sensitive terms? "{model_name}-safe" comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now) "{model_name}-unfiltered" and the filtered index just "{model_name}", we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it).
Do we want to keep the ability to apply additional sensitive term filters at query time, following the pattern in Create pre-filtered secondary indexes and add ability to automatically filter sensitive terms at query time openverse-api#1108. If we did, I am assuming the list would only be used in an "emergency" type situation where we realised a critical term is left out of the list used to create the filtered index that we want to filter out of the default searches ASAP. The invention of a new slur or something is the only thing I can really think of where this would apply. I am mildly sceptical that it would be useful to keep in as it seems unlikely we'd need it and "no code is the best code"/YAGNI.

sc0ttkclark · 2023-01-30T20:47:18Z

Just +1'ing here because I don't want to let my kids use OpenVerse for their school projects yet when results include NSFW images (even when I went to go try to search for something for my own demo site).

sarayourfriend · 2023-01-30T22:26:11Z

What should we call this secondary index that excludes the sensitive terms? "{model_name}-safe" comes to mind, but then I thought, what if we actually give the modified name to the unfiltered query, then we don't have to ponder whether "safe" is a term we want to use at all (given what it might incorrectly imply). If we named the unfiltered (original) index (the one that exists now) "{model_name}-unfiltered" and the filtered index just "{model_name}", we'd get around all of these complications with the added bonus of making it somewhat clearer what is technically different about the two indexes (rather than what may or may not be qualitatively different about them, depending on how you look at it).

Thinking more about this, it's far more complicated to rename the origin index at this point, so we could just use the word "filtered" for the new one (which has the same benefits of unfiltered I suggested before).

sarayourfriend · 2023-01-31T00:05:27Z

I've updated the PR linked to this issue to also enable the creation of filtered indexes. We can remove the Django API behaviour depending on if we decide whether it's worth keeping around or not.

The PR creates a new action for the reindexing rather than trying to add it to an existing step. This means we'll need to update the data refresh DAG to call the new action as well as the "POINT_ALIAS" action afterwards, mirroring the changes made to load_sample_data.sh in the PR.

AetherUnbound · 2023-04-10T23:29:20Z

@sarayourfriend should we close this issue in favor of some of the other plans/RFCs currently ongoing? Or will we just end up using this issue for that work?

sarayourfriend · 2023-04-10T23:41:45Z

The project thread references it: #377

We could close this issue once the project thread is closed or close it now as a duplicate, I have no preference.

AetherUnbound · 2023-04-11T00:00:57Z

I'll go ahead and close the issue, it's linked for context as you mention anyway 🙂

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 🕹 aspect: interface Concerns end-users' experience with the software labels Dec 16, 2022

sarayourfriend transferred this issue from WordPress/openverse-api Dec 17, 2022

sarayourfriend mentioned this issue Jan 25, 2023

Create pre-filtered secondary indexes and add ability to automatically filter sensitive terms at query time WordPress/openverse-api#1108

Closed

7 tasks

sarayourfriend transferred this issue from WordPress/openverse-catalog Jan 30, 2023

sarayourfriend self-assigned this Jan 30, 2023

zackkrida mentioned this issue Feb 6, 2023

Filter and blur sensitive results by term matching #377

Closed

10 tasks

obulat added the stack: backend label Feb 22, 2023

obulat transferred this issue from WordPress/openverse-api Feb 22, 2023

obulat added this to Openverse Backlog Feb 23, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Feb 23, 2023

obulat added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels Mar 20, 2023

obulat moved this from 📋 Backlog to 🏗 In progress in Openverse Backlog Mar 28, 2023

AetherUnbound closed this as completed Apr 11, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Openverse Backlog Apr 11, 2023

dhruvkb pushed a commit that referenced this issue Apr 14, 2023

Bump flake8 from 3.9.2 to 5.0.4 (#750)

b571d02

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

sarayourfriend mentioned this issue Jun 6, 2023

Replace external artist links with links to the relevant creator search on Openverse #630

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750

[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750

obulat commented Apr 21, 2021 •

edited

Loading

sarayourfriend commented Dec 16, 2022

stacimc commented Dec 16, 2022

zackkrida commented Jan 25, 2023 •

edited

Loading

sarayourfriend commented Jan 25, 2023

sarayourfriend commented Jan 30, 2023

sc0ttkclark commented Jan 30, 2023

sarayourfriend commented Jan 30, 2023

sarayourfriend commented Jan 31, 2023 •

edited

Loading

AetherUnbound commented Apr 10, 2023

sarayourfriend commented Apr 10, 2023

AetherUnbound commented Apr 11, 2023

[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750

[Feature] Use metadata keywords to help detect if something is NSFW (original #482) #750

Comments

obulat commented Apr 21, 2021 • edited Loading

Problem Description

Solution Description

Additional Context

Original Comments:

sarayourfriend commented Dec 16, 2022

stacimc commented Dec 16, 2022

zackkrida commented Jan 25, 2023 • edited Loading

sarayourfriend commented Jan 25, 2023

sarayourfriend commented Jan 30, 2023

sc0ttkclark commented Jan 30, 2023

sarayourfriend commented Jan 30, 2023

sarayourfriend commented Jan 31, 2023 • edited Loading

AetherUnbound commented Apr 10, 2023

sarayourfriend commented Apr 10, 2023

AetherUnbound commented Apr 11, 2023

obulat commented Apr 21, 2021 •

edited

Loading

zackkrida commented Jan 25, 2023 •

edited

Loading

sarayourfriend commented Jan 31, 2023 •

edited

Loading