-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Switch to a less overzealous family filter #2537
Conversation
Can one of the admins verify this patch? |
ok to test |
good stuff! thnx for that much needed contribution |
I guess it would be hard to 'verify' the performance of this list, but I fully agree that the current list targets too broad terms. But for now, I guess this will be good 👍 |
I agree that a more sophisticated approach would be sweet to have here. I actually attempted a scheme were I would poll random youtube vids, note down the frequency of different words in the titles, and do the same for 'adult' videos. One could then compare which words are significantly more common in adult video's and use those words as indicators of adult content. I have some code for this, but the problem with this approach however is that most adult video sites actually have reasonably strict rate limiting. I guess until some student decides a porn filter would be a good topic for his thesis, it will be a simple list :) You can evaluate the prevalence of false negatives by searching for 'dubious' terms such as "girl" or "blonde" (both of which used to be blocked entirely before). Results seem pretty clean. |
retest this please |
Hi, thanks for your contribution :). I took some time to investigate the effects of this PR on the family filter we already implemented. To do so, I've extracted files from the pirate bay in the category porn and video and saved these file names to separate lists. In total, this dataset contains 1721 files of which 892 classified as xxx and 829 classified as non-xxx. Next, I used the list of keywords we currently have and your list and calculated the amount of false positives (non-xxx items gets classified as xxx) and false negatives. The results are presented below: Current keyword list:
With your proposed keyword list:
While your list significantly reduces the amount of false positives, the amount of false negatives is also increased which means that your proposed list is not restrictive enough, i.e. there would be much inappropriate content in the GUI with your list. Personally, I would rather have a filter that wrongly filters out some non-xxx content and hides much xxx content in the GUI than the reverse situation. While I agree that our current filter is sub-optimal, I argue it's the best choice we have for now. If you have any other suggestions for the filter that reduces the number of false positives/negatives, please let me know in the appropriate issues or on this PR. Any help on this would greatly be appreciated! |
Thank you both for spending time on this. Clearly a difficult task these day to filter spam and explicit material. We would need somebody to allocate 2-4 months on this and dive into machine learning matters. But perhaps my estimate is too pessimistic.. |
Not difficult at all, but most solutions are not open-source and for python. One can be easily built with NLTK (Natural Language Toolkit): http://www.slideshare.net/shanbady/nltk-natural-language-processing-in-python see slide 58 and onwards.. |
This resolves #1052 and resolves #827.
A long standing issue with Tribler is the fact that the family filter returns a lot of false positives, and thus filters out a lot of perfectly valid results. This is reflected in issues #1052 and #827 (which are pretty much duplicates).
This PR switches to a more restrictive list, namely this one: https://gist.github.com/ryanlewis/a37739d710ccdb4b406d
This is a list originally made for a now abandoned google project.
The list is relatively restricted, and should reduce false positives, while hardly creating new false negatives. The effect can be evaluated by running the TestRemoteChannelSearch in test_remote_search.py, and running a grep on the output. Compare:
Before:
After: