Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better way to prevent the creation of spam users #1613

Open
Guybrush88 opened this issue Jul 19, 2018 · 46 comments
Open

Better way to prevent the creation of spam users #1613

Guybrush88 opened this issue Jul 19, 2018 · 46 comments
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. unclear The issue, its scope or the goal are not clearly identified

Comments

@Guybrush88
Copy link

While browsing the users list, I noticed that there are some users that were created just for spam, like this one: https://tatoeba.org/ita/user/profile/mingletrain
Maybe the code should be refined to have better and stricter rules for the creation of new users.

@jiru jiru added the enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. label Jul 19, 2018
@ckjpn
Copy link

ckjpn commented Jan 9, 2019

Looking at recently added members today, I noticed the same problem.

https://tatoeba.org/eng/user/profile/smvec
https://tatoeba.org/eng/user/profile/tradingexact
https://tatoeba.org/eng/user/profile/slikbeauty
https://tatoeba.org/eng/user/profile/jawlineme

Access to this profile is PUBLIC. All the information can be seen by everyone.

Perhaps making it impossible to set profiles to "Access to this profile is PUBLIC." until a member has contributed sentences would help solve this problem. Likely it's more of an SEO, hoping Google will count them as links, than it is an attempt to spam our members.

If Google can't find the links, then perhaps a lot of this would be eliminated.

Seeing how the formatting of the profiles are so similar, likely only one or just a few SEO companies are doing this.

@Yorwba
Copy link
Contributor

Yorwba commented Jan 9, 2019

The easiest way to make those links useless for SEO would be to include rel="nofollow" in all user-provided links (also in sentence comments etc.) to make search engines stop treating the links as an endorsement. That isn't necessarily guaranteed to stop the creation of spam accounts, because the people doing that kind of thing don't always notice when it stops working, but eventually they are likely to move on.

@ckjpn
Copy link

ckjpn commented Jan 10, 2019

Somewhat related, if we could somehow make it clearer that people didn't need to create an account to browse or download the data, then we would probably not have so many accounts being created.

I know that the number look impressive when looking at the number of "members," but that is a very deceptively high number when compared to the number of contributors.

It would likely help to very clearly state on the main page that you only need to create an account to write to the website (comments, wall, sentences, etc.) and view non-public members' profiles. This might also encourage those who may incorrectly think registration is necessary to stay and use the website longer.

@jiru
Copy link
Member

jiru commented Mar 4, 2019

Maybe the code should be refined to have better and stricter rules for the creation of new users.

Maybe I’m wrong, but looking at the profiles you and CK mentioned, I think they have been created by a human rather than a machine. The country in the profile even match the company the link is pointing to. Which means captcha and the like are probably useless. If we make the creation of new accounts more difficult, it will be a barrier for legitimate users too. I think we need to tackle this problem differently.

@jiru
Copy link
Member

jiru commented Mar 4, 2019

Related: #349.

@trang
Copy link
Member

trang commented Mar 5, 2019

Preventing the creation of spam users seems to be a suggested solution for a problem that is not clearly defined.

Let's talk first about the negative impacts of registration of spam users, then we can figure out what to do.

@Guybrush88
Copy link
Author

what about filtering machine-created users according to their given description? Here's an example of what I would personally filter according to the provided description: https://tatoeba.org/ita/user/profile/aaditisharma

@trang
Copy link
Member

trang commented Apr 9, 2019

@Guybrush88 What makes you think this profile was not created by a human?

@Guybrush88
Copy link
Author

well, maybe the fact that the profile is spamming an escort website

@ckjpn
Copy link

ckjpn commented Apr 10, 2019

Many of them do seem to be by the same SEO company, perhaps doing this as one of so many links they promise customers.

You'll notice similarities....

  1. All seem to have a logo-type of profile image, that seem to be uploaded at the same time as the username creation.
  2. All profile text seems to be similar length.
  3. All seem to have URLs in the website field.
  4. All have no sentence contributions. (maybe some of the non-SEO ones have also added spam sentences, though.)

Perhaps, if you are trying to write a script to find such contributors, this is something you could start with.

Here's a quick look at new users. See how easy it is to spot the likely spammers.

http://tatoeba.byethost3.com/new_users-2019-04-10.html
(Now that user images are based on the user ID number, this kind of page is very easy to produce.)

@trang
Copy link
Member

trang commented Apr 17, 2019

@Guybrush88 I'm still unsure what the problem is on your side. On my side, if a user registers a "spam" account and doesn't do anything else, I actually don't care at all and it doesn't bother me at all. I don't use Tatoeba very frequently though so I'm probably less likely to feel a bad user experience from those "spam" accounts.

You mentioned filtering machine-created users, but I realize that I'm not sure what you mean by "filtering". Do you mean to detect upon creation that a profile is seemingly created by a machine, and prevent the registration of this account? Or do you mean now showing the profile in the list of members? Or do you mean something else?

At the end of the day, we'd just need to know what is actually the impact on your side: how do you find out about these accounts and how does it affect you? Do they bother you because you keep receiving spam private messages from these accounts, or is it something else?

@Guybrush88
Copy link
Author

Do you mean to detect upon creation that a profile is seemingly created by a machine, and prevent the registration of this account?

yeah, I suppose so, if it's created to spam (since bots like horus are created by the dev team)

how do you find out about these accounts

generally when lurking this page: https://tatoeba.org/ita/users/all?sort=since&direction=desc , since sometimes I'm curious to see the growth of the user base within Tatoeba,and I also agree with the first two posts of CK.

Another reason is that I find it slightly annoying to see automatically generated users that link to external sites about services (I guess related to what CK said, that is for ranking-related things) that are completely unrelated to the general purpose of Tatoeba, like some kinds of online shopping, medical services (I recall having seen also something like this), escorts websites, and so on, which might be considered spam also by other people, in my opinion

@jiru
Copy link
Member

jiru commented Apr 18, 2019

Besides the feeling of "dirtiness" of having spam around on a website I care about, the presence of fake users is tampering with stats. The profile of aaditisharma has English listed as native-level, which means the number of native speakers of English is false.

@ckjpn
Copy link

ckjpn commented Apr 18, 2019

While there is no claim that all "members" are contributors, all those stats are a little deceptive since many people register and never do anything on the website. I would suspect that a number of them never come back a second time, but that's only a suspicion and perhaps it isn't true.

@trang trang added the unclear The issue, its scope or the goal are not clearly identified label Jul 27, 2019
@Guybrush88
Copy link
Author

Guybrush88 commented Oct 22, 2019

today I noticed also some spam posts on the Wall. One has been already deleted, the other one is still here: https://tatoeba.org/ita/wall/show_message/33309#message_33309

The drawback of this might be that such wall posts might hide non-spam posts (i.e. people asking about some specific Tatoeba's features, asking language-related things, and so on)

Edit: the linked post has been taken care of

@ckjpn
Copy link

ckjpn commented Oct 23, 2019

This is a related answer to a community admins email.....

Spammers are getting increasingly annoying. ....

Agreed.

Part of the problem is that tatoeba.org is easy to sign up for and
easy to write to the website.

My guess is that most, if not all of these recent spam accounts, are
by the same 1 or 2 SEO companies that sell their services, saying that
they will add many links around the web to help their customers get
higher rankings on search engines.

The profile formats are often very similar, each with a URL and each
with a logo of some kind for the profile photo. Perhaps this would be
enough of a pattern for a clever computer programmer to figure out a
way to create a filter for this and also a quick way to go through and
eliminate a lot of these accounts.

Take a look at the new members and you'll notice that many of them
with profile photos are the spammers.

https://tatoeba.org/eng/users/all?sort=since&direction=desc

It would be interesting to see a list of members, sorted with the
newest first, along with "home page" URLs.
My guess is that you could easily spot many of the accounts that
should likely be deleted this way.
Not all of the spammer accounts have profile images, but for a
"step 1", you could just look at new accounts with both a URL and a
profile photo. And then as a second step, look at the others with
homepage URLs.

If an account has no contributions and hasn't been used for six months, it should be OK to delete it.

I'm not so sure about this, but it is something we should consider.

Note that some members may have only contributed audio and no
sentences. Some may have only contributed comments. We would need to
be careful not to delete these kinds of members.

@jiru
Copy link
Member

jiru commented Oct 24, 2019 via email

@ckjpn
Copy link

ckjpn commented Oct 25, 2019

Perhaps changing the list to only display contributors of sentences (and possibly contributors of audio, comments and wall posts, and perhaps even members who have created lists, but no other contributions) would make this feed less cluttered

Also, if non-contributors weren't counted in native-speaker counts, those numbers would become more accurate. Perhaps all member accounts without contributions could be hidden until contributions were made.

Perhaps you could even rename this from "members" to "contributors." From the point of view of other members of the project, it's really the contributors that matter. A non-contributing registered member isn't really any different from a non-registered visitor, I think.

You could still possibly have non-contributor but registered members's usernames show up in searches for usernames. I'm not sure if this would be needed, but it might be.

I, too, don't think captcha would not help much. I also sort of suspect that verifying email addresses wouldn't help very much, though it would slow them down a little, but likely only just a little.

One advantage to constantly deleting obvious spammers, especially if most of these account were from the same 2 or 3 SEO companies, would be that they might possibly notice that they were just wasting their time and stop doing it. However, they may not notice. In that case, we would just be wasting our time if this couldn't be automated. On the other hand, other people who were considering doing the same thing might not be attracted to doing so.

@trang
Copy link
Member

trang commented Oct 29, 2019

I suggest the following:

  1. We remove normal contributors from the Members page and display only admins, corpus maintainers and advanced contributors. Normal contributors can still be found from the search function, they are just no longer listed on that page. We have too many members anyway, it no longer makes sense to have a paginated list of all of them. Along the way we remove the total number of registered users because this stat is irrelevant.
  2. We create sub-pages for each language in the Native speakers page. These sub-pages will list the native speakers of a specific language, with the option to sort by registration date so that we can see who are the newest members in that language. This is meant to shift the lurking activity to the Native speakers page instead of the Members page. Most spammers don't seem to bother adding a native language so the list native speakers is more relevant to lurk.

This should be enough to address the issue that we currently have with spammers.

We can also implement email verification, but there's a separate issue for that: #1703.

@Yorwba
Copy link
Contributor

Yorwba commented Oct 29, 2019

I've noticed at least some spammers advertising Chinese companies set their native language to "Chinese (Jin)". They're probably rushing through the process; otherwise they should have noticed that "Mandarin Chinese" is another option. Examples of such users: xinlijie, tgcasting, textilecn, szzzxcl, spicaqwewer, sleevelining, ... based on the alphabetically last 10 self-declared Jin native speakers, about 60% appear to be spam accounts.

While creating separate lists for each language is likely to help with the original problem (making the list of users noisier) for most languages, those languages that spammers list in their profile will still be affected.

On a related note, I think search engines use links to "disreputable" sites as a signal of website quality. Is anyone monitoring tatoeba.org's search engine ranking?

@ckjpn
Copy link

ckjpn commented Oct 30, 2019

We remove normal contributors from the Members page and display only admins, corpus maintainers and advanced contributors. Normal contributors can still be found from the search ...

I think this would be unfair to our "normal contributors" We have a number of contributors who have contributed a lot of sentences, but have not felt the need for tagging and linking rights. In many cases, "advanced contributors" are not really any more advanced than our "normal contributors."

I think that minimally all members who have actually contributed sentences should be displayed, at least those who have contributed 5 or more sentences. A 5 or more sentence limit would help prevent spammer sentence owners from being listed, since so far, I think, most have only added 1 sentence before one of us deletes it and sets the account to "spammer."

This is an old set of stats, but you can see the number of sentences contributed by "normal contributors" here. Just sort by status.

http://tatoeba.byethost3.com/stats-170218.html

[Addition]

Perhaps the "members" page could be generated once a week to look somewhat like the "stats-170218.html" page. That page is fairly lightweight and doesn't take so long to load. List the source to see how it's done.

@trang
Copy link
Member

trang commented Oct 30, 2019

@Yorwba

I've noticed at least some spammers advertising Chinese companies set their native language to "Chinese (Jin)"

Thanks for the info. We should definitely set those as "spammer". We're not going to avoid doing some manual work when it comes to keeping the list of members clean from spammers anyway.

We could think of trying to automate it somehow (implementing some spam detection and deactivate the accounts automatically), but there are other problems involved there and there's no need to go there until we really have a huge amount of spammer and it becomes humanly unmanageable.

I would say that cleaning up the full list of registered users would be unmanageable at this point. But cleaning up the list of users who have added a language to their profile would be manageable. And it would contribute to having more accurate stats.

Is anyone monitoring tatoeba.org's search engine ranking?

I don't think so. At least I don't.

I just was reminded that you made an early comment about rel="nofollow". If it has the effect that you're saying in your comment above, then it would be wise to implement it.

@ckjpn

I think this would be unfair to our "normal contributors"

We can add an info text explaining that not all members are displayed (just like there's a text on "Browse by language" pages that says only the last 1000 sentences are displayed).

We anyway display normal contributors in the other pages of the Communty section (as long as they have a language in their profile), I think this is fair enough. Again, there are too many registered members by now. It doesn't make sense to have a huge list with 2000+ pages...

If there was any other way to find admins, corpus maintainers or advanced contributors, I would have suggested to remove the whole list. But currently, the "Members" page is the only page where you can go if you want to contact someone who has a special status.

I think that minimally all members who have actually contributed sentences should be displayed

Well for me it's one step further: all members who added a language in their profile should be displayed (which is done on https://tatoeba.org/eng/stats/users_languages), because it's not possible to contribute sentences without adding the language in your profile. Not everyone who has added a language to their profile will have contributed sentences, but that's okay as long as they are not spammers, I think.

@ckjpn
Copy link

ckjpn commented Oct 31, 2019

.... all members who added a language in their profile should be displayed ... because it's not possible to contribute sentences without adding the language in your profile.

Remember that we have people who contributed sentences to the corpus before you added the "languages" part of the profile pages.

@ckjpn
Copy link

ckjpn commented Oct 31, 2019

BTW, here is a list of all the contributors.

http://tatoeba.byethost3.com/tatoeba_contributors-2019-10-26.zip
You can easily load this into a spreadsheet and sort on any of the following fields.
Username
ID
status
sentence count
native language
native language sentences

Here is an online page with all the same data.
It will likely take about 5 seconds to load and maybe about 4 seconds to do any sort.

http://tatoeba.byethost3.com/stats-191026-all.html

@trang
Copy link
Member

trang commented Oct 31, 2019

@ckjpn This is going a bit off-topic, I think. I'm not saying it wouldn't be nice to have a page that lists the contributors who have added at least once sentence, but in itself, having such a page doesn't solve the issue with spammers.

Here's my understanding of the situation.

The problem

The problem about spammers manifests itself primarily through the Members page:

  • Let's say I want to have an idea of what kind of users have registered lately, specifically what languages do they speak and perhaps other interesting info they might have added to their profile.
  • I go to the Members page.
  • I sort by "member since" with most recent members first.
  • I see that half of the profiles are spammers.
  • It's annoying.

Once in a while spammers will create a spam sentence, or post a spam comment, or post a spam message on the Wall, but that is very rare as far as I know and can be handled quite easily by admins. The major issue is spam profiles. There's too many of them.

Possible solutions

Solution 1
We make registration more strict.

Solution 2
We remove SEO incentives by adding rel="nofollow" in links.

Solution 3
We stop displaying regular contributors on the Members page.

Solution 4
We have admins check every new profile and mark the ones that are spammers as "spammer". We recruit some new admins if needed.

Solution 5
We create an algorithm to detect spammers and mark them automatically as spam.

My opinion

I'm in favor of solution 2, 3, and a bit of solution 4 (with admins only checking profiles that have a native language instead of every profile).

For solution 3, I understand that it removes the possibility to get a picture of who has been registering lately (that is the use case that I described in "The problem"). The simplest way to compensate for it is to create sub-pages in the "Native speakers" page and add the option to sort by "member since". Or we could actually skip that and just add the sort option in the sub-pages of "Languages of members". That's even simpler.

If we consider that hiding all the contributors on the Members page would still have other negative effects that are too important to be ignored, then I would be okay with just solution 2 alone.

I'm not too fond of solution 1 for the same reason as @jiru.

I'm not too fond of solution 4 if it applies to every profile. I think human intervention, to clean up spam, is always going to be necessary. But it should be channeled on situations that really matter. If a spammer creates a profile and does absolutely nothing else other than add a description and add a homepage, I see no real harm. It's only a minor annoyance. We don't need to be obsessive about it.

I'm not too fond of solution 5 because it's too high effort (that is, if we want to do it properly).

@ckjpn
Copy link

ckjpn commented Nov 1, 2019

Solution 1
We make registration more strict.

I think that at this time, this should be avoided.

Solution 2
We remove SEO incentives by adding rel="nofollow" in links.

This should be easy to do, so I'd suggest doing it.

One additional possible step would be to make it so that any username with less than 1 sentence had their profiles automatically set (or reverted) to not be publicly viewable. This way, even if a spammer contributed a sentence, if an admin or corpus maintainer deleted the sentence, the profile would go back to non-public.

Solution 3
We stop displaying regular contributors on the Members page.

If we did this, I would suggest making it possible to search for these members with partial usernames, since it's often difficult to remember exact spellings for usernames.

Related issue: #1994

Solution 4
We have admins check every new profile and mark the ones that are spammers as "spammer". We recruit some new admins if needed.

I think this probably isn't a very good solution.

Solution 5
We create an algorithm to detect spammers and mark them automatically as spam.

This combined with the idea presented in Solution 4 might work.

Create an algorithm that detects possible spammers and put them in a queue for admins to manually check and set to "spammer." Make it possible for admins to easily removed non-spammers from the queue.

@ckjpn
Copy link

ckjpn commented Dec 23, 2019

Solution 3
We stop displaying regular contributors on the Members page.

If we did this, I would suggest making it possible to search for these members with partial usernames, since it's often difficult to remember exact spellings for usernames.

Here is an example of one way this could be done.

http://tatoeba.byethost3.com/contributors/

@Guybrush88
Copy link
Author

today there has been someone who's been constantly polluting the corpus with spam sentences in a Chinese-like language. This is the latest account: https://tatoeba.org/ita/user/profile/wowo203, and there have already been previous accounts today that have already been blocked by admins, but, apparently, this didn't stop the spam wave

@Guybrush88
Copy link
Author

@AndiPersti
Copy link
Contributor

Do these user creation requests usually all come from the same IP range?

Maybe there are other signs/patterns which make them suspicious (e.g. headers sent, sequence of pages requested).

@jiru
Copy link
Member

jiru commented Apr 25, 2020

I had a look at the server logs.

The following users registered from the same IP address: soso201 soso202 wowo201 wowo202 wowo203 wowo204.
These users registered from another IP address: gymeimei1 gymeimei2 gymeimei3.
All the users showed the same user agent and both IPs are located in China.

Because the username and email fields of the register form produce a request on every keystroke (another problem described in #2072; actually useful here), I can tell that a human was sitting there filling the form and making occasional typos. No bots used during the registration phase.

Based on this, we could try to prevent this kind of trolls by preventing contributions of users who registered a new account using the same pattern as a previously recently blocked user.

@ckjpn
Copy link

ckjpn commented Apr 26, 2020

No bots used during the registration phase.

At least in this case, email verification would've slowed him down.

@trang
Copy link
Member

trang commented Apr 26, 2020

we could try to prevent this kind of trolls by preventing contributions of users who registered a new account using the same pattern as a previously recently blocked user.

Many of the usernames of the spam accounts created lately were kind of random. There's some recurring pattern (3 letters followed by 5 digits, for instance) but I suppose the spammers (if smart enough) would quickly identify the patterns we blocked and create accounts with no identifiable patterns. The username can be a criteria but cannot be the only one.

Just right now, I blocked an account which username is chushoushebei.

Checking patterns in the sentences would be more efficient, I think. I asked admins to stop deleting spam sentences. Instead, I marked the remaining ones as unapproved. To do that, I simply had to search for sentences with lang null and containing either wfgz or . The sentences from chushoushebei fit into the wfgz category.

@ckjpn
Copy link

ckjpn commented Apr 27, 2020

I wonder about the wisdom of discussing how you can potentially block spam bots in a publicly-accessible forum like GitHub.

Anyone who creates a bot, can easily see what you plan to do and adjust their bot accordingly.

I would suggest discussing this by email between people working on this problem.

@AndiPersti
Copy link
Contributor

I wonder about the wisdom of discussing how you can potentially block spam bots in a publicly-accessible forum like GitHub.

Why is that a problem? We are only discussing here general ways to prevent this spam sentences which are probably known since spam exists.
Furthermore this repository is open source so every countermeasure we'll implement in the code is visible anyways.
IMHO security by obscurity is never a good solution.

I would suggest discussing this by email between people working on this problem.

I can assure you that not all is discussed publicly.

@ckjpn
Copy link

ckjpn commented Apr 27, 2020

I thought that perhaps if you mentioned specific things you were looking for to help spot a specific spammer, then if he/she read that that was what you were looking for here at GitHub, he/she could easily avoid those things.

That's why I thought that publicly discussing secrets on how you spot spammers would be like shooting yourself in the foot, and might make your job harder in the long run.

I think that perhaps this particular recent bot is less of a spammer and more of an attacker who is trying to hurt the project rather than what I would think of as a spammer who is doing this for possible financial gain.

@AndiPersti
Copy link
Contributor

I think that perhaps this particular recent bot is less of a spammer and more of an attacker who is trying to hurt the project rather than what I would think of as a spammer who is doing this for possible financial gain.

These spams aren't specific to Tatoeba. If you search for some patterns (e.g. from sentence 8717809) you'll notice that other sites are also hit.
I'm pretty sure the spammer hopes do get some financial gains and it is very unlikely they will read this issue. (I'll bet they are just using a bot software like XRumer and won't be capable to write their own Tatoeba-specific scripts)

@ckjpn
Copy link

ckjpn commented Apr 27, 2020

Many of the sentences had tatoeba.org urls, though.

@AndiPersti
Copy link
Contributor

Yes, I've noticed. But I'm pretty sure that these links are just side-effects caused by the way they add their spam. They use Tatoeba just as a billboard for the services they want to sell (like flyposting in the real world).

@Guybrush88
Copy link
Author

@ckjpn
Copy link

ckjpn commented Feb 18, 2022

I noticed that the URL I have above for easily scanning to see possible spammers is now dead.

You can use this.
http://tatoeba.ueuo.com/newmembers.html

Here is the code that you may need to update when this issue is addressed.

<!DOCTYPE html><html lang="en">
<head><meta charset="utf-8"><title>Scan for Spammers</title>
<style>
img{border:1px solid #aaa}
</style>
</head><body>
<h1>Scan for Possible Spam Accounts</h2>
The 1,000 most-recently registered members.
<br />Most SEO-type of new member accounts have profile images, often a commercial-looking logo.
<br /><i>Find the most-recently added members on <a href="https://tatoeba.org/en/users/all?sort=since&direction=desc">https://tatoeba.org/en/users/all?sort=since&direction=desc</a></i>.
<script>
function imgError(image) {
image.onerror = "";
image.src = "https://tatoeba.org/img/profiles_36/unknown-avatar.png";
return true;
}
most_recent = 103053; // Change this number to the most-recently added User ID number.
lower_number = most_recent-1000;
document.write('<br />Checking Using <b>'+most_recent+'</b> as the most-recent member ID.<ol>');
for (let i = most_recent; i > lower_number; i--) {
document.write('<li>');
document.write('<a href="https://tatoeba.org/en/users/show/'+i+'">Latest Activity of User'+i+'</a>');
document.write(' - <a href="https://tatoeba.org/en/users/edit/'+i+'">Edit (Admins Only)</a>');
document.write('<br /><img src="https://tatoeba.org/img/profiles_128/'+i+'.png" width="96" height="96" onerror="imgError(this);"/>');
document.write('</li>');
}
</script>
</ol></body></html>

@LBeaudoux
Copy link
Contributor

LBeaudoux commented Jul 10, 2022

A few days ago, I scraped the profiles of Tatoeba users.

Then I tried to detect spammy accounts by filtering those that:

  • don't own any sentences
  • contain outbound links to suspicious sites
  • have a description text that does not contain some keywords related to language learning

I quickly reviewed the list of more than 8000 profiles detected and removed only about 20 false positives. I also noticed that this spammy profile phenomenon is on the rise.

The final results are available online in this Excel file. I imagine that this data could be used to take action against these accounts in bulk.

@ckjpn
Copy link

ckjpn commented Jul 10, 2022

I think these could all be set to "spammer."

Looking through this Excel file, I didn't see any that couldn't safely be deleted.

To be reversible, instead of deleting, perhaps we could set all fo these to "spammer" status. That's what we've done to those who have spammed in places other than their profiles.

Additional Filtering Possibilities to Find More

If someone were willing to go through and remove false positives, my guess is that by removing the following filter, we could find a lot more.

  • have a description text that does not contain some keywords related to language learning

Since many of the spammers take time to upload a profile image, you could try the following filters for a next level of filtering. This would likely result in fewer false positives

  • don't own any sentences
  • contain outbound links to suspicious sites
  • has a profile photo

@LBeaudoux
Copy link
Contributor

perhaps we could set all fo these to "spammer" status

I agree, and more importantly, so does Google.

many of the spammers take time to upload a profile image

More than 2,000 profiles in the list I shared do not have an image. So I don't think this is a relevant filtering criterion.

we could find a lot more

There are currently more than 10,000 profiles with outbound links on tatoeba.org, which means that about 2,000 of these profiles have been classified as "not spam" by my very basic spam detection algorithm. There is no doubt that some of these profiles are in fact spammy.

A real improvement would be to use the Akismet spam detection API provided by Automattic.Thanks to the logs of our production server, we could teach their model how to detect spammy accounts.

According to their documentation, the following parameters can be used for training:

  • user_ip (required). IP address of the comment submitter.
  • user_agent. User agent string of the web browser submitting the comment - typically the HTTP_USER_AGENT cgi variable.
  • referrer. The content of the HTTP_REFERER header should be sent here.
  • permalink. The full permanent URL of the entry the comment was submitted to.
  • comment_author. Name submitted with the comment.
  • comment_author_email. Email address submitted with the comment.
  • comment_author_url. URL submitted with comment. Only send a URL that was manually entered by the user, not an automatically generated URL like the user’s profile URL on your site.
  • comment_content. The content that was submitted.
  • comment_date_gmt. The UTC timestamp of the creation of the comment, in ISO 8601 format.
  • comment_post_modified_gmt. The UTC timestamp of the publication time for the post, page or thread on which the comment was posted.
  • user_role. The user role of the user who submitted the comment. This is an optional parameter.
  • Other server environmental variables. In PHP, there is an array of environmental variables called $_SERVER that contains information about the Web server itself as well as a key/value for every HTTP header sent with the request. This data is highly useful to Akismet. How the submitted content interacts with the server can be very telling, so please include as much of it as possible.

Once the training is completed, we can also think about integrating automatic spam filtering into tatoeba2. The Akismet documentation gives the following example for PHP implementation:

// Call to comment check
$data = array('blog' => 'http://yourblogdomainname.com',
          'user_ip' => '127.0.0.1',
          'user_agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6',
          'referrer' => 'http://www.google.com',
          'permalink' => 'http://yourblogdomainname.com/blog/post=1',
          'comment_type' => 'comment',
          'comment_author' => 'admin',
          'comment_author_email' => '[email protected]',
          'comment_author_url' => 'http://www.CheckOutMyCoolSite.com',
          'comment_content' => 'It means a lot that you would take the time to review our software.  Thanks again.');
akismet_comment_check( '123YourAPIKey', $data );
// Passes back true (it's spam) or false (it's ham)
function akismet_comment_check( $key, $data ) {
$request = 'blog='. urlencode($data['blog']) .
           '&user_ip='. urlencode($data['user_ip']) .
           '&user_agent='. urlencode($data['user_agent']) .
           '&referrer='. urlencode($data['referrer']) .
           '&permalink='. urlencode($data['permalink']) .
           '&comment_type='. urlencode($data['comment_type']) .
           '&comment_author='. urlencode($data['comment_author']) .
           '&comment_author_email='. urlencode($data['comment_author_email']) .
           '&comment_author_url='. urlencode($data['comment_author_url']) .
           '&comment_content='. urlencode($data['comment_content']);
$host = $http_host = $key.'.rest.akismet.com';
$path = '/1.1/comment-check';
$port = 443;
$akismet_ua = "WordPress/4.4.1 | Akismet/3.1.7";
$content_length = strlen( $request );
$http_request  = "POST $path HTTP/1.0\r\n";
$http_request .= "Host: $host\r\n";
$http_request .= "Content-Type: application/x-www-form-urlencoded\r\n";
$http_request .= "Content-Length: {$content_length}\r\n";
$http_request .= "User-Agent: {$akismet_ua}\r\n";
$http_request .= "\r\n";
$http_request .= $request;
$response = '';
if( false != ( $fs = @fsockopen( 'ssl://' . $http_host, $port, $errno, $errstr, 10 ) ) ) {
     
    fwrite( $fs, $http_request );
 
    while ( !feof( $fs ) )
        $response .= fgets( $fs, 1160 ); // One TCP-IP packet
    fclose( $fs );
     
    $response = explode( "\r\n\r\n", $response, 2 );
}
 
if ( 'true' == $response[1] )
    return true;
else
    return false;
}

@ckjpn
Copy link

ckjpn commented Oct 10, 2022

Solution 2
We remove SEO incentives by adding rel="nofollow" in links.

Unfortunately, this wouldn't solve the problem of someone who is offering his/her service to add links to various websites, since whoever has been provided the service can click to the link and see the "spam" is still there.

Minimally, changing the profiles from "public" to "must be logged in to see" would help with this and also prevent Google and other search engines from finding the pages.

@ckjpn
Copy link

ckjpn commented Jun 28, 2023

I still think it would be a good idea to solve this problem.

It seems fairly obvious that at least a number of these are created by the same person or SEO company for clients.

Just in the last hour, all these accounts were created.

https://tatoeba.org/en/user/profile/direcctdssp27
https://tatoeba.org/en/user/profile/raymond0213
https://tatoeba.org/en/user/profile/e81stdD
https://tatoeba.org/en/user/profile/radiant12
https://tatoeba.org/en/user/profile/AnExxcellentJaN
https://tatoeba.org/en/user/profile/DIPLOMENT
https://tatoeba.org/en/user/profile/jun3all
https://tatoeba.org/en/user/profile/Drlindacarr
https://tatoeba.org/en/user/profile/Worksman
https://tatoeba.org/en/user/profile/lenalenabobena125
https://tatoeba.org/en/user/profile/laquinta

... and perhaps all the ones created in the previous hour, too. I didn't check them all but at least some were.

Screenshot

Members (total 69,929)

https://tatoeba.org/en/users/all?sort=since&direction=desc

Screen Shot 2023-06-29 at 4 37 04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. unclear The issue, its scope or the goal are not clearly identified
Projects
None yet
Development

No branches or pull requests

7 participants