-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better way to prevent the creation of spam users #1613
Comments
Looking at recently added members today, I noticed the same problem. https://tatoeba.org/eng/user/profile/smvec Access to this profile is PUBLIC. All the information can be seen by everyone. Perhaps making it impossible to set profiles to "Access to this profile is PUBLIC." until a member has contributed sentences would help solve this problem. Likely it's more of an SEO, hoping Google will count them as links, than it is an attempt to spam our members. If Google can't find the links, then perhaps a lot of this would be eliminated. Seeing how the formatting of the profiles are so similar, likely only one or just a few SEO companies are doing this. |
The easiest way to make those links useless for SEO would be to include |
Somewhat related, if we could somehow make it clearer that people didn't need to create an account to browse or download the data, then we would probably not have so many accounts being created. I know that the number look impressive when looking at the number of "members," but that is a very deceptively high number when compared to the number of contributors. It would likely help to very clearly state on the main page that you only need to create an account to write to the website (comments, wall, sentences, etc.) and view non-public members' profiles. This might also encourage those who may incorrectly think registration is necessary to stay and use the website longer. |
Maybe I’m wrong, but looking at the profiles you and CK mentioned, I think they have been created by a human rather than a machine. The country in the profile even match the company the link is pointing to. Which means captcha and the like are probably useless. If we make the creation of new accounts more difficult, it will be a barrier for legitimate users too. I think we need to tackle this problem differently. |
Related: #349. |
Preventing the creation of spam users seems to be a suggested solution for a problem that is not clearly defined. Let's talk first about the negative impacts of registration of spam users, then we can figure out what to do. |
what about filtering machine-created users according to their given description? Here's an example of what I would personally filter according to the provided description: https://tatoeba.org/ita/user/profile/aaditisharma |
@Guybrush88 What makes you think this profile was not created by a human? |
well, maybe the fact that the profile is spamming an escort website |
Many of them do seem to be by the same SEO company, perhaps doing this as one of so many links they promise customers. You'll notice similarities....
Perhaps, if you are trying to write a script to find such contributors, this is something you could start with. Here's a quick look at new users. See how easy it is to spot the likely spammers. http://tatoeba.byethost3.com/new_users-2019-04-10.html |
@Guybrush88 I'm still unsure what the problem is on your side. On my side, if a user registers a "spam" account and doesn't do anything else, I actually don't care at all and it doesn't bother me at all. I don't use Tatoeba very frequently though so I'm probably less likely to feel a bad user experience from those "spam" accounts. You mentioned filtering machine-created users, but I realize that I'm not sure what you mean by "filtering". Do you mean to detect upon creation that a profile is seemingly created by a machine, and prevent the registration of this account? Or do you mean now showing the profile in the list of members? Or do you mean something else? At the end of the day, we'd just need to know what is actually the impact on your side: how do you find out about these accounts and how does it affect you? Do they bother you because you keep receiving spam private messages from these accounts, or is it something else? |
yeah, I suppose so, if it's created to spam (since bots like horus are created by the dev team)
generally when lurking this page: https://tatoeba.org/ita/users/all?sort=since&direction=desc , since sometimes I'm curious to see the growth of the user base within Tatoeba,and I also agree with the first two posts of CK. Another reason is that I find it slightly annoying to see automatically generated users that link to external sites about services (I guess related to what CK said, that is for ranking-related things) that are completely unrelated to the general purpose of Tatoeba, like some kinds of online shopping, medical services (I recall having seen also something like this), escorts websites, and so on, which might be considered spam also by other people, in my opinion |
Besides the feeling of "dirtiness" of having spam around on a website I care about, the presence of fake users is tampering with stats. The profile of aaditisharma has English listed as native-level, which means the number of native speakers of English is false. |
While there is no claim that all "members" are contributors, all those stats are a little deceptive since many people register and never do anything on the website. I would suspect that a number of them never come back a second time, but that's only a suspicion and perhaps it isn't true. |
today I noticed also some spam posts on the Wall. One has been already deleted, the other one is still here: https://tatoeba.org/ita/wall/show_message/33309#message_33309 The drawback of this might be that such wall posts might hide non-spam posts (i.e. people asking about some specific Tatoeba's features, asking language-related things, and so on) Edit: the linked post has been taken care of |
This is a related answer to a community admins email.....
Agreed. Part of the problem is that tatoeba.org is easy to sign up for and My guess is that most, if not all of these recent spam accounts, are The profile formats are often very similar, each with a URL and each Take a look at the new members and you'll notice that many of them https://tatoeba.org/eng/users/all?sort=since&direction=desc It would be interesting to see a list of members, sorted with the
I'm not so sure about this, but it is something we should consider. Note that some members may have only contributed audio and no |
I think that preventing the creation of spam accounts using captcha or similar methods is quite hard as these spammers seem to have resources to get around that. I can envision an endless cat and mouse game we'd always eventually loose, effectively making it harder for legitimate users to register.
@ck From your message, it seems that you are using the users list page to check who recently registered to Tatoeba (maybe to better introduce them to the project, or some other reason). Having the list cluttered by fake accounts is preventing you from doing it, therefore spam accounts are a problem to you. Is it correct?
We may be able to work around such problems by making spam accounts less visible from both users browsing Tatoeba and search engines crawling it. For example, we could hide links to members who didn't do any kind of contribution, such as sentences, audio, non-hidden wall posts and sentence comments, etc.
In other words, I think it should be easier to check if members are legitimate by analyzing their activity after registration.
I have the feeling that these spammers, for SEO reasons, are trying keep their account active as long as possible without drawing attention. In particular, they are not sending spam messages. If this is true, this may be a weakness we can exploit to tell them apart from legitimate members.
|
Perhaps changing the list to only display contributors of sentences (and possibly contributors of audio, comments and wall posts, and perhaps even members who have created lists, but no other contributions) would make this feed less cluttered Also, if non-contributors weren't counted in native-speaker counts, those numbers would become more accurate. Perhaps all member accounts without contributions could be hidden until contributions were made. Perhaps you could even rename this from "members" to "contributors." From the point of view of other members of the project, it's really the contributors that matter. A non-contributing registered member isn't really any different from a non-registered visitor, I think. You could still possibly have non-contributor but registered members's usernames show up in searches for usernames. I'm not sure if this would be needed, but it might be. I, too, don't think captcha would not help much. I also sort of suspect that verifying email addresses wouldn't help very much, though it would slow them down a little, but likely only just a little. One advantage to constantly deleting obvious spammers, especially if most of these account were from the same 2 or 3 SEO companies, would be that they might possibly notice that they were just wasting their time and stop doing it. However, they may not notice. In that case, we would just be wasting our time if this couldn't be automated. On the other hand, other people who were considering doing the same thing might not be attracted to doing so. |
I suggest the following:
This should be enough to address the issue that we currently have with spammers. We can also implement email verification, but there's a separate issue for that: #1703. |
I've noticed at least some spammers advertising Chinese companies set their native language to "Chinese (Jin)". They're probably rushing through the process; otherwise they should have noticed that "Mandarin Chinese" is another option. Examples of such users: xinlijie, tgcasting, textilecn, szzzxcl, spicaqwewer, sleevelining, ... based on the alphabetically last 10 self-declared Jin native speakers, about 60% appear to be spam accounts. While creating separate lists for each language is likely to help with the original problem (making the list of users noisier) for most languages, those languages that spammers list in their profile will still be affected. On a related note, I think search engines use links to "disreputable" sites as a signal of website quality. Is anyone monitoring tatoeba.org's search engine ranking? |
I think this would be unfair to our "normal contributors" We have a number of contributors who have contributed a lot of sentences, but have not felt the need for tagging and linking rights. In many cases, "advanced contributors" are not really any more advanced than our "normal contributors." I think that minimally all members who have actually contributed sentences should be displayed, at least those who have contributed 5 or more sentences. A 5 or more sentence limit would help prevent spammer sentence owners from being listed, since so far, I think, most have only added 1 sentence before one of us deletes it and sets the account to "spammer." This is an old set of stats, but you can see the number of sentences contributed by "normal contributors" here. Just sort by status. http://tatoeba.byethost3.com/stats-170218.html [Addition] Perhaps the "members" page could be generated once a week to look somewhat like the "stats-170218.html" page. That page is fairly lightweight and doesn't take so long to load. List the source to see how it's done. |
Thanks for the info. We should definitely set those as "spammer". We're not going to avoid doing some manual work when it comes to keeping the list of members clean from spammers anyway. We could think of trying to automate it somehow (implementing some spam detection and deactivate the accounts automatically), but there are other problems involved there and there's no need to go there until we really have a huge amount of spammer and it becomes humanly unmanageable. I would say that cleaning up the full list of registered users would be unmanageable at this point. But cleaning up the list of users who have added a language to their profile would be manageable. And it would contribute to having more accurate stats.
I don't think so. At least I don't. I just was reminded that you made an early comment about
We can add an info text explaining that not all members are displayed (just like there's a text on "Browse by language" pages that says only the last 1000 sentences are displayed). We anyway display normal contributors in the other pages of the Communty section (as long as they have a language in their profile), I think this is fair enough. Again, there are too many registered members by now. It doesn't make sense to have a huge list with 2000+ pages... If there was any other way to find admins, corpus maintainers or advanced contributors, I would have suggested to remove the whole list. But currently, the "Members" page is the only page where you can go if you want to contact someone who has a special status.
Well for me it's one step further: all members who added a language in their profile should be displayed (which is done on https://tatoeba.org/eng/stats/users_languages), because it's not possible to contribute sentences without adding the language in your profile. Not everyone who has added a language to their profile will have contributed sentences, but that's okay as long as they are not spammers, I think. |
Remember that we have people who contributed sentences to the corpus before you added the "languages" part of the profile pages. |
BTW, here is a list of all the contributors. http://tatoeba.byethost3.com/tatoeba_contributors-2019-10-26.zip Here is an online page with all the same data. |
@ckjpn This is going a bit off-topic, I think. I'm not saying it wouldn't be nice to have a page that lists the contributors who have added at least once sentence, but in itself, having such a page doesn't solve the issue with spammers. Here's my understanding of the situation. The problemThe problem about spammers manifests itself primarily through the Members page:
Once in a while spammers will create a spam sentence, or post a spam comment, or post a spam message on the Wall, but that is very rare as far as I know and can be handled quite easily by admins. The major issue is spam profiles. There's too many of them. Possible solutionsSolution 1 Solution 2 Solution 3 Solution 4 Solution 5 My opinionI'm in favor of solution 2, 3, and a bit of solution 4 (with admins only checking profiles that have a native language instead of every profile). For solution 3, I understand that it removes the possibility to get a picture of who has been registering lately (that is the use case that I described in "The problem"). The simplest way to compensate for it is to create sub-pages in the "Native speakers" page and add the option to sort by "member since". Or we could actually skip that and just add the sort option in the sub-pages of "Languages of members". That's even simpler. If we consider that hiding all the contributors on the Members page would still have other negative effects that are too important to be ignored, then I would be okay with just solution 2 alone. I'm not too fond of solution 1 for the same reason as @jiru. I'm not too fond of solution 4 if it applies to every profile. I think human intervention, to clean up spam, is always going to be necessary. But it should be channeled on situations that really matter. If a spammer creates a profile and does absolutely nothing else other than add a description and add a homepage, I see no real harm. It's only a minor annoyance. We don't need to be obsessive about it. I'm not too fond of solution 5 because it's too high effort (that is, if we want to do it properly). |
I think that at this time, this should be avoided.
This should be easy to do, so I'd suggest doing it. One additional possible step would be to make it so that any username with less than 1 sentence had their profiles automatically set (or reverted) to not be publicly viewable. This way, even if a spammer contributed a sentence, if an admin or corpus maintainer deleted the sentence, the profile would go back to non-public.
If we did this, I would suggest making it possible to search for these members with partial usernames, since it's often difficult to remember exact spellings for usernames. Related issue: #1994
I think this probably isn't a very good solution.
This combined with the idea presented in Solution 4 might work. Create an algorithm that detects possible spammers and put them in a queue for admins to manually check and set to "spammer." Make it possible for admins to easily removed non-spammers from the queue. |
Here is an example of one way this could be done. |
today there has been someone who's been constantly polluting the corpus with spam sentences in a Chinese-like language. This is the latest account: https://tatoeba.org/ita/user/profile/wowo203, and there have already been previous accounts today that have already been blocked by admins, but, apparently, this didn't stop the spam wave |
Do these user creation requests usually all come from the same IP range? Maybe there are other signs/patterns which make them suspicious (e.g. headers sent, sequence of pages requested). |
I had a look at the server logs. The following users registered from the same IP address: soso201 soso202 wowo201 wowo202 wowo203 wowo204. Because the username and email fields of the register form produce a request on every keystroke (another problem described in #2072; actually useful here), I can tell that a human was sitting there filling the form and making occasional typos. No bots used during the registration phase. Based on this, we could try to prevent this kind of trolls by preventing contributions of users who registered a new account using the same pattern as a previously recently blocked user. |
At least in this case, email verification would've slowed him down. |
Many of the usernames of the spam accounts created lately were kind of random. There's some recurring pattern (3 letters followed by 5 digits, for instance) but I suppose the spammers (if smart enough) would quickly identify the patterns we blocked and create accounts with no identifiable patterns. The username can be a criteria but cannot be the only one. Just right now, I blocked an account which username is chushoushebei. Checking patterns in the sentences would be more efficient, I think. I asked admins to stop deleting spam sentences. Instead, I marked the remaining ones as unapproved. To do that, I simply had to search for sentences with lang |
I wonder about the wisdom of discussing how you can potentially block spam bots in a publicly-accessible forum like GitHub. Anyone who creates a bot, can easily see what you plan to do and adjust their bot accordingly. I would suggest discussing this by email between people working on this problem. |
Why is that a problem? We are only discussing here general ways to prevent this spam sentences which are probably known since spam exists.
I can assure you that not all is discussed publicly. |
I thought that perhaps if you mentioned specific things you were looking for to help spot a specific spammer, then if he/she read that that was what you were looking for here at GitHub, he/she could easily avoid those things. That's why I thought that publicly discussing secrets on how you spot spammers would be like shooting yourself in the foot, and might make your job harder in the long run. I think that perhaps this particular recent bot is less of a spammer and more of an attacker who is trying to hurt the project rather than what I would think of as a spammer who is doing this for possible financial gain. |
These spams aren't specific to Tatoeba. If you search for some patterns (e.g. from sentence 8717809) you'll notice that other sites are also hit. |
Many of the sentences had tatoeba.org urls, though. |
Yes, I've noticed. But I'm pretty sure that these links are just side-effects caused by the way they add their spam. They use Tatoeba just as a billboard for the services they want to sell (like flyposting in the real world). |
I noticed that the URL I have above for easily scanning to see possible spammers is now dead. You can use this. Here is the code that you may need to update when this issue is addressed.
|
A few days ago, I scraped the profiles of Tatoeba users. Then I tried to detect spammy accounts by filtering those that:
I quickly reviewed the list of more than 8000 profiles detected and removed only about 20 false positives. I also noticed that this spammy profile phenomenon is on the rise. The final results are available online in this Excel file. I imagine that this data could be used to take action against these accounts in bulk. |
I think these could all be set to "spammer."Looking through this Excel file, I didn't see any that couldn't safely be deleted. To be reversible, instead of deleting, perhaps we could set all fo these to "spammer" status. That's what we've done to those who have spammed in places other than their profiles. Additional Filtering Possibilities to Find MoreIf someone were willing to go through and remove false positives, my guess is that by removing the following filter, we could find a lot more.
Since many of the spammers take time to upload a profile image, you could try the following filters for a next level of filtering. This would likely result in fewer false positives
|
I agree, and more importantly, so does Google.
More than 2,000 profiles in the list I shared do not have an image. So I don't think this is a relevant filtering criterion.
There are currently more than 10,000 profiles with outbound links on tatoeba.org, which means that about 2,000 of these profiles have been classified as "not spam" by my very basic spam detection algorithm. There is no doubt that some of these profiles are in fact spammy. A real improvement would be to use the Akismet spam detection API provided by Automattic.Thanks to the logs of our production server, we could teach their model how to detect spammy accounts. According to their documentation, the following parameters can be used for training:
Once the training is completed, we can also think about integrating automatic spam filtering into tatoeba2. The Akismet documentation gives the following example for PHP implementation: // Call to comment check
$data = array('blog' => 'http://yourblogdomainname.com',
'user_ip' => '127.0.0.1',
'user_agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6',
'referrer' => 'http://www.google.com',
'permalink' => 'http://yourblogdomainname.com/blog/post=1',
'comment_type' => 'comment',
'comment_author' => 'admin',
'comment_author_email' => '[email protected]',
'comment_author_url' => 'http://www.CheckOutMyCoolSite.com',
'comment_content' => 'It means a lot that you would take the time to review our software. Thanks again.');
akismet_comment_check( '123YourAPIKey', $data );
// Passes back true (it's spam) or false (it's ham)
function akismet_comment_check( $key, $data ) {
$request = 'blog='. urlencode($data['blog']) .
'&user_ip='. urlencode($data['user_ip']) .
'&user_agent='. urlencode($data['user_agent']) .
'&referrer='. urlencode($data['referrer']) .
'&permalink='. urlencode($data['permalink']) .
'&comment_type='. urlencode($data['comment_type']) .
'&comment_author='. urlencode($data['comment_author']) .
'&comment_author_email='. urlencode($data['comment_author_email']) .
'&comment_author_url='. urlencode($data['comment_author_url']) .
'&comment_content='. urlencode($data['comment_content']);
$host = $http_host = $key.'.rest.akismet.com';
$path = '/1.1/comment-check';
$port = 443;
$akismet_ua = "WordPress/4.4.1 | Akismet/3.1.7";
$content_length = strlen( $request );
$http_request = "POST $path HTTP/1.0\r\n";
$http_request .= "Host: $host\r\n";
$http_request .= "Content-Type: application/x-www-form-urlencoded\r\n";
$http_request .= "Content-Length: {$content_length}\r\n";
$http_request .= "User-Agent: {$akismet_ua}\r\n";
$http_request .= "\r\n";
$http_request .= $request;
$response = '';
if( false != ( $fs = @fsockopen( 'ssl://' . $http_host, $port, $errno, $errstr, 10 ) ) ) {
fwrite( $fs, $http_request );
while ( !feof( $fs ) )
$response .= fgets( $fs, 1160 ); // One TCP-IP packet
fclose( $fs );
$response = explode( "\r\n\r\n", $response, 2 );
}
if ( 'true' == $response[1] )
return true;
else
return false;
} |
Unfortunately, this wouldn't solve the problem of someone who is offering his/her service to add links to various websites, since whoever has been provided the service can click to the link and see the "spam" is still there. Minimally, changing the profiles from "public" to "must be logged in to see" would help with this and also prevent Google and other search engines from finding the pages. |
I still think it would be a good idea to solve this problem. It seems fairly obvious that at least a number of these are created by the same person or SEO company for clients. Just in the last hour, all these accounts were created. https://tatoeba.org/en/user/profile/direcctdssp27 ... and perhaps all the ones created in the previous hour, too. I didn't check them all but at least some were. ScreenshotMembers (total 69,929) https://tatoeba.org/en/users/all?sort=since&direction=desc |
While browsing the users list, I noticed that there are some users that were created just for spam, like this one: https://tatoeba.org/ita/user/profile/mingletrain
Maybe the code should be refined to have better and stricter rules for the creation of new users.
The text was updated successfully, but these errors were encountered: