WIP: Switch to a less overzealous family filter #2537

Hivemind-pulse · 2016-09-06T09:59:04Z

This resolves #1052 and resolves #827.

A long standing issue with Tribler is the fact that the family filter returns a lot of false positives, and thus filters out a lot of perfectly valid results. This is reflected in issues #1052 and #827 (which are pretty much duplicates).

This PR switches to a more restrictive list, namely this one: https://gist.github.com/ryanlewis/a37739d710ccdb4b406d
This is a list originally made for a now abandoned google project.

The list is relatively restricted, and should reduce false positives, while hardly creating new false negatives. The effect can be evaluated by running the TestRemoteChannelSearch in test_remote_search.py, and running a grep on the output. Compare:

Before:

XXXFilter: DEBUG: Torrent is NOT XXX: Cowboys.And.Aliens.2011.EXTENDED.720p.BluRay.x264-CROSSBOW 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: XXXFilter: "extreme" is dirty in criminal.minds.01x01.extreme.aggressor.avi
XXXFilter: DEBUG: Torrent is XXX: Criminal Minds Season 1 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: XXXFilter: "honeymoon" is dirty in season 1\house md season 1 episode 22 - honeymoon.avi
XXXFilter: DEBUG: Torrent is XXX: House MD Season 1, 2, 3, 4, 5, 6 & 7 + Extras (Deleted Scenes, Interviews etc) DVDRip HDTV TSV 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: XXXFilter: "shows" is dirty in tsv torrents\tv shows\the west wing deluxe dvd boxset season 1, 2, 3, 4, 5, 6 & 7 + extras (extra episode's etc).torrent
XXXFilter: DEBUG: Torrent is XXX: Spider-Man 1, 2  & 3 - The Complete DVD Boxset DVDRip 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: Star Trek (2009) DVDRip XviD-MAXSPEED 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: AMELIE.2001.DVDRip.(EngSubs).mkv 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: XXXFilter: "couple" is dirty in season 2\how i met your mother season 02 episode 05 - world's greatest couple.avi
XXXFilter: DEBUG: Torrent is XXX: How I Met Your Mother Season 1, 2, 3, 4, 5, & 6 + Extras DVDRip TSV 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: Batman Begins (2005) [1080p] 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: Pans Labyrinth 2006 720p.x264.BRRip.GokU61 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: The Departed (2006) 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: Up Pixar [2009] dvd rip nlx 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: XXXFilter: "amor" is dirty in amores perros.avi
XXXFilter: DEBUG: Torrent is XXX: Amores Perros.avi 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: XXXFilter: "russian" is dirty in subtitles\russian.srt
XXXFilter: DEBUG: Torrent is XXX: Donnie Darko 2001 BRRip 720p x264 RmD (HDScene Release) 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: Hotel Rwanda (2004) DVDrip. Xvid 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: Like Stars on Earth Dvdrip XviD -Sunny 'tracker.thepiratebay.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: dsl-4.4.10-syslinux.iso 'tracker.osst.co.uk:6969/'
XXXFilter: DEBUG: XXXFilter: Found term "porn" in teenage anal addicts - kat - porn4all.wmv
XXXFilter: DEBUG: Torrent is XXX: Teenage Anal Addicts - Kat - porn4all.wmv 'tracker.istole.it/'
XXXFilter: DEBUG: Torrent is NOT XXX: RadioGet 3.3.5.1 + Patch{H33T}{Easypath} 'fr33dom.h33t.com:3310/'
XXXFilter: DEBUG: Torrent is NOT XXX: Dictionary of Contemporary Slang - The Book That Absolutely The Best About Slang 'fr33dom.h33t.com:3310/'
XXXFilter: DEBUG: XXXFilter: "bud" is dirty in 05 - flavor bud living.mp3
XXXFilter: DEBUG: XXXFilter: "carrot" is dirty in 08 - ah carrot is as close as ah rabbit gets to ah diamond.mp3
XXXFilter: DEBUG: XXXFilter: "alex" is dirty in 09 - owed t'alex.mp3
XXXFilter: DEBUG: Torrent is NOT XXX: Captain Beefheart and The Magic Band - Bat Chain Puller (2012) mp3@320{1337x}-kawli 'udp://tracker.1337x.org:80/'
XXXFilter: DEBUG: Torrent is NOT XXX: Advanced SystemCare Pro 5.2.0.222 Final Multilingual Portable + Serial 'tracker.torrent.to:2710/'
XXXFilter: DEBUG: XXXFilter: "kat" is dirty in torrent downloaded from kat.ph.txt

After:

XXXFilter: DEBUG: Torrent is NOT XXX: Psych.S06E13.HDTV.x264-ASAP.mp4 ''
XXXFilter: DEBUG: Torrent is NOT XXX: [R.G. Mechanics] Halo - Spartan Assault ''
XXXFilter: DEBUG: Torrent is NOT XXX: [hshare.net].Girigiri.Mozaiku.6.Tsu.No.Kosuchuumu.De.Pakopako ''
XXXFilter: DEBUG: Torrent is NOT XXX: City Of The Living Dead & The Living Dead At The Manchester Morgue [FLAC] ''
XXXFilter: DEBUG: Torrent is NOT XXX: Various Artists - Indie Rock Playlist November (2014) 'udp://open.demonii.com:1337/'
XXXFilter: DEBUG: Torrent is NOT XXX: La Giuria (DTS AAC-ITA ENG) [email protected] ''
XXXFilter: DEBUG: Torrent is NOT XXX: Titanic.1997.SloSubs.Dvd.Rip--TheDefexKing '.partis.si/bc74e491e449e817881d2ca2f5c1b123/'
XXXFilter: DEBUG: Torrent is NOT XXX: Discovery.Ch.Prophets.of.Science.Fiction.5of8.Isaac.Asimov.XviD.AC3.MVGroup.org.avi 'udp://tracker.1337x.org:80'
XXXFilter: DEBUG: Torrent is NOT XXX: The Walking Dead S02E13 1080i HDTV DD5.1 MPEG2-NTb [PublicHD.ORG] ''
XXXFilter: DEBUG: Torrent is NOT XXX: Catatonia - Equally Cursed And Blessed ''
XXXFilter: DEBUG: Torrent is NOT XXX: La leggenda di Zorro ''
XXXFilter: DEBUG: Torrent is NOT XXX: Nine Inch Nails [Kronman Ent. T.] 'www.h33t.com:3310/'
XXXFilter: DEBUG: Torrent is NOT XXX: The Subways - (2011) Money And Celebrity ''
XXXFilter: DEBUG: Torrent is NOT XXX: Outdoor Photographer - HDR Done Right Plus Super Sharp Depth of Field (February 2013) ''
XXXFilter: DEBUG: Torrent is NOT XXX: AA.VV. - DISCOinferno Collection [3CD] 'tracker.tntvillage.scambioetico.org:2710/'
XXXFilter: DEBUG: Torrent is NOT XXX: Two Blocks from the Edge ''
XXXFilter: DEBUG: Torrent is NOT XXX: Retail MOBI dictionaries ''
XXXFilter: DEBUG: Torrent is NOT XXX: Battle of Britain ''
XXXFilter: DEBUG: Torrent is NOT XXX: Dutch Treat - 118 Black Balloon (Indie,Folk) ''
XXXFilter: DEBUG: Torrent is NOT XXX: Australias.Got.Talent.s07e14.GrandFinal.Part.2.PDTV.x264.Hector.mp4 ''
XXXFilter: DEBUG: Torrent is NOT XXX: Infinite Stratos (2011) [Static-Subs] [720p H264] 'udp://denis.stalker.h3q.com:6969'
XXXFilter: DEBUG: Torrent is NOT XXX: [ www.TorrentDay.com ] - Embarrassing.Bodies.S06E07.PDTV.x264-BARGE ''
XXXFilter: DEBUG: Torrent is NOT XXX: [PixAndVideo] Amirah Adara (The Lackey and the Maid) February 15, 2014.mp4 ''
XXXFilter: DEBUG: Torrent is NOT XXX: [InfernalSubs-RAD] Fuuun Ishin Dai Shogun 01-12  [1920x1080] ''
XXXFilter: DEBUG: Torrent is NOT XXX: Ikoku-Meiro-no-Croisee ''
XXXFilter: DEBUG: Torrent is NOT XXX: Babysitter Diaries 12.mp4 ''
XXXFilter: DEBUG: XXXFilter: "fucked" is dirty in indian_college_girl_fucked_by_her_bf_caught_on_cam_(new).avi
XXXFilter: DEBUG: Torrent is XXX: indian_college_girl_fucked_by_her_bf_caught_on_cam_(new).avi ''
XXXFilter: DEBUG: Torrent is NOT XXX: Automatic test of Tribler Anonymous downloading - I will disappear when done ''
XXXFilter: DEBUG: Torrent is NOT XXX: Mancini Fires - Rise [indie-rock-alternative] 'tracker.mininova.org/'
XXXFilter: DEBUG: Torrent is NOT XXX: 8.Mile.2002.1080p.BluRay.H264.AAC-RARBG ''

tribler-ci · 2016-09-06T09:59:05Z

Can one of the admins verify this patch?

synctext · 2016-09-07T20:07:39Z

ok to test

synctext · 2016-09-07T20:08:08Z

good stuff! thnx for that much needed contribution

lfdversluis · 2016-09-11T09:31:50Z

I guess it would be hard to 'verify' the performance of this list, but I fully agree that the current list targets too broad terms.
I always found the current approach naive. Surely there are better approaches to a filter nowadays using datamining/machine learning techniques. However, these cost time and training (msc work? :))

But for now, I guess this will be good 👍

Hivemind-pulse · 2016-09-11T19:25:31Z

I agree that a more sophisticated approach would be sweet to have here. I actually attempted a scheme were I would poll random youtube vids, note down the frequency of different words in the titles, and do the same for 'adult' videos. One could then compare which words are significantly more common in adult video's and use those words as indicators of adult content. I have some code for this, but the problem with this approach however is that most adult video sites actually have reasonably strict rate limiting.

I guess until some student decides a porn filter would be a good topic for his thesis, it will be a simple list :)

You can evaluate the prevalence of false negatives by searching for 'dubious' terms such as "girl" or "blonde" (both of which used to be blocked entirely before). Results seem pretty clean.

devos50 · 2016-09-19T07:50:46Z

retest this please

devos50 · 2016-09-24T20:16:01Z

Hi, thanks for your contribution :). I took some time to investigate the effects of this PR on the family filter we already implemented. To do so, I've extracted files from the pirate bay in the category porn and video and saved these file names to separate lists. In total, this dataset contains 1721 files of which 892 classified as xxx and 829 classified as non-xxx. Next, I used the list of keywords we currently have and your list and calculated the amount of false positives (non-xxx items gets classified as xxx) and false negatives. The results are presented below:

Current keyword list:

False negatives: 138 (15.47 %)
False positives: 66 (7.96 %)

With your proposed keyword list:

False negatives: 513 (57.51 %)
False positives: 20 (2.41 %)

While your list significantly reduces the amount of false positives, the amount of false negatives is also increased which means that your proposed list is not restrictive enough, i.e. there would be much inappropriate content in the GUI with your list. Personally, I would rather have a filter that wrongly filters out some non-xxx content and hides much xxx content in the GUI than the reverse situation.

While I agree that our current filter is sub-optimal, I argue it's the best choice we have for now. If you have any other suggestions for the filter that reduces the number of false positives/negatives, please let me know in the appropriate issues or on this PR. Any help on this would greatly be appreciated!

synctext · 2016-09-25T11:08:56Z

Thank you both for spending time on this. Clearly a difficult task these day to filter spam and explicit material.

We would need somebody to allocate 2-4 months on this and dive into machine learning matters. But perhaps my estimate is too pessimistic..

lfdversluis · 2016-09-25T16:08:11Z

Not difficult at all, but most solutions are not open-source and for python. One can be easily built with NLTK (Natural Language Toolkit): http://www.slideshare.net/shanbady/nltk-natural-language-processing-in-python see slide 58 and onwards..

Switch to a less overzealous family filter

6798c29

devos50 approved these changes Sep 18, 2016

View reviewed changes

devos50 changed the title ~~Switch to a less overzealous family filter~~ WIP: Switch to a less overzealous family filter Sep 24, 2016

devos50 closed this Sep 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Switch to a less overzealous family filter #2537

WIP: Switch to a less overzealous family filter #2537

Hivemind-pulse commented Sep 6, 2016

tribler-ci commented Sep 6, 2016

synctext commented Sep 7, 2016

synctext commented Sep 7, 2016

lfdversluis commented Sep 11, 2016 •

edited

Loading

Hivemind-pulse commented Sep 11, 2016 •

edited

Loading

devos50 commented Sep 19, 2016

devos50 commented Sep 24, 2016

synctext commented Sep 25, 2016

lfdversluis commented Sep 25, 2016

WIP: Switch to a less overzealous family filter #2537

WIP: Switch to a less overzealous family filter #2537

Conversation

Hivemind-pulse commented Sep 6, 2016

tribler-ci commented Sep 6, 2016

synctext commented Sep 7, 2016

synctext commented Sep 7, 2016

lfdversluis commented Sep 11, 2016 • edited Loading

Hivemind-pulse commented Sep 11, 2016 • edited Loading

devos50 commented Sep 19, 2016

devos50 commented Sep 24, 2016

synctext commented Sep 25, 2016

lfdversluis commented Sep 25, 2016

lfdversluis commented Sep 11, 2016 •

edited

Loading

Hivemind-pulse commented Sep 11, 2016 •

edited

Loading