Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

[Feature] Use metadata keywords to help detect if something is NSFW #482

Closed
aldenstpage opened this issue Apr 21, 2020 · 3 comments
Closed
Assignees
Labels
🙅 status: discontinued Not suitable for work as repo is in maintenance 🏷 status: label work required Needs proper labelling before it can be worked on

Comments

@aldenstpage
Copy link
Contributor

Problem Description

We are trying to make NSFW content in CC Search "opt-in". We can catch a lot of NSFW content by using API specific filters and relying on moderation "upstream" at the source, but sometimes things slip through.

Solution Description

One way we can help prevent this is scanning for NSFW profanity and slurs in the title/tags/artist name and settings nsfw = True in the metadata field if it fails the check. There are 3rd party lists of dirty words that can help us achieve this. In my experience moderating content on CC Search, this will help prevent a lot of embarrassment and indignant emails from teachers.

We can do a one-time scan-and-filter relatively easily, but we will also need a way to filter new content as it is ingested.

Additional Context

The Scunthorpe Problem

@aldenstpage
Copy link
Contributor Author

aldenstpage commented Apr 21, 2020

Also: we're going to need to review the list of words carefully, because the lists that I linked to are too broad in what they consider NSFW and could have some unwanted inadvertent censorship effects.

@brenoferreira
Copy link
Contributor

brenoferreira commented Apr 21, 2020

One thing to watch out for in this word list is the potential for false positives that can end up filtering out a lot of content with words that aren't necessarily NSFW.

Edit: when I commented I had the tab open for a while so @aldenstpage comment hadn't loaded yet :D

@kss682
Copy link

kss682 commented Apr 22, 2020

For the new content we could have a validator method in ImageStore class that checks against title,author and relevant attributes before inserting into tsv , so that the NSFW contents could be flaged and segregated at an early stage. @aldenstpage

@aldenstpage aldenstpage self-assigned this Apr 23, 2020
@aldenstpage aldenstpage transferred this issue from cc-archive/cccatalog Apr 23, 2020
@cc-archive cc-archive deleted a comment from Mr-burme Jul 23, 2020
@kgodey kgodey added 🚧 status: blocked Blocked & therefore, not ready for work 🧹 status: ticket work required Needs more details before it can be worked on and removed not ready for work labels Sep 24, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
@kgodey kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🚧 status: blocked Blocked & therefore, not ready for work 🧹 status: ticket work required Needs more details before it can be worked on labels Dec 16, 2020
@kgodey kgodey closed this as completed Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🙅 status: discontinued Not suitable for work as repo is in maintenance 🏷 status: label work required Needs proper labelling before it can be worked on
Development

No branches or pull requests

5 participants