Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet.UserName ru locale return numeric characters only #225

Closed
Orygeunik opened this issue Jun 22, 2019 · 9 comments · Fixed by #233
Closed

Internet.UserName ru locale return numeric characters only #225

Orygeunik opened this issue Jun 22, 2019 · 9 comments · Fixed by #233

Comments

@Orygeunik
Copy link

Orygeunik commented Jun 22, 2019

Situation
Made next code (on C#):

    public class UserInfoFaker : Faker<UserInfo>
    {
        public UserInfoFaker() : base("ru")
        {
            RuleFor(ui => ui.UserLogin, f => f.Internet.UserName());
            RuleFor(ui => ui.UserPassword, f => f.Internet.Password());
        }
    }

In another place called code:

UserInfoFaker userInfoFaker = new UserInfoFaker();
List<UserInfo> tempCollection = userInfoFaker.Generate(5);

Seen in array:

Field name Field value Field type
UserLogin "_" string
UserPassword "DgGEGqL4DC" string
UserLogin "75" string
UserPassword "mL_gZWYimn" string
UserLogin "7" string
UserPassword "gcR4Hh0CBK" string
UserLogin "94" string
UserPassword "LGkLSIXOLr" string
UserLogin "62" string
UserPassword "TLF5FB7NXx" string

If change code:

    public class UserInfoFaker : Faker<UserInfo>
    {
        public UserInfoFaker()// : base("ru")
        {
            Locale = "ru";

            RuleFor(ui => ui.UserLogin, f => f.Internet.UserName());
            RuleFor(ui => ui.UserPassword, f => f.Internet.Password());
        }
    }

Seen in array:

Field name Field value Field type
UserLogin "Santa.Turcotte66" string
UserPassword "TjQR84wh1m" string
UserLogin "Samantha_Fadel86" string
UserPassword "JXzBKtqDtx" string
UserLogin "Justyn43" string
UserPassword "z8NFDRdsLO" string
UserLogin "Madisen50" string
UserPassword "YfcwBus8Ld" string
UserLogin "Novella.Schiller37" string
UserPassword "Ma4GV7rAec" string

Another way. Made next code:

Faker faker = new Faker("ru");
Person person = faker.Person;

Seen in array:

Field name Field value Field type
Email "[email protected]" string
FirstName "Анна" string
FullName "Анна Фомина" string
Gender Female Bogus.DataSets.Name.Gender
LastName "Фомина" string
UserName "64" string
Website "дарья.com" string

Btw Russian fullname is correct :)

Why with russian locale ("ru") UserName and Email not generated?

@bchavez
Copy link
Owner

bchavez commented Jun 22, 2019

Hi @Orygeunik,

Thanks for your question. I think this is due to issue #86.

The reason is, people complained that Bogus generated emails with diacritics. So, Bogus would have generated an email address like Анна.Фомина[email protected]; which in some cases, might be correct or not correct; depending on your method of email validation. I'm guessing in most cases using an email with a username like: Анна.Фомина[email protected] would fail typical email validation in most systems.

IIRC, technically, glyphs other than ASCII are valid email addresses but people don't make validation pass in their systems if they contain non-ASCII characters.

The issue here is really because .Email() is calling .Slugify() as shown below:

public string Email(string firstName = null, string lastName = null, string provider = null, string uniqueSuffix = null)
{
provider = provider ?? GetRandomArrayItem("free_email");
return Utils.Slugify(UserName(firstName, lastName)) + uniqueSuffix + "@" + provider;
}

public static string Slugify(string txt)
{
var str = txt.Replace(" ", "-").RemoveDiacritics();
return Regex.Replace(str, @"[^a-zA-Z0-9\.\-_]+", "");
}

/// <summary>
/// A string extension method that removes the diacritics character from the strings.
/// </summary>
/// <param name="this">The @this to act on.</param>
/// <returns>The string without diacritics character.</returns>
public static string RemoveDiacritics(this string @this)
{
string normalizedString = @this.Normalize(NormalizationForm.FormD);
var sb = new StringBuilder();
foreach( char t in normalizedString )
{
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(t);
if( uc != UnicodeCategory.NonSpacingMark )
{
sb.Append(t);
}
}
return sb.ToString().Normalize(NormalizationForm.FormC);
}

As you can see, the call stack ultimately hits .RemoveDiacritics() which that removes glyphs in the string Анна Фомина, and you get sometimes [email protected] email addresses.

I guess this boils down to the following question: is Анна.Фомина98 a valid user name? or is Anna.Fomina98? I don't know.

For now, there are a few workarounds which is to basically extend Bogus with your own custom extension methods:

void Main()
{
   var faker = new UserInfoFaker();
   faker.Generate(10).Dump();
}

public class UserInfoFaker : Faker<UserInfo>
{
   public UserInfoFaker() : base("ru")
   {
      RuleFor(ui => ui.FirstName, f => f.Person.FirstName);
      RuleFor(ui => ui.LastName, f => f.Person.LastName);
      RuleFor(ui => ui.UserLogin, f => f.UserName2() );
      RuleFor(ui => ui.UserLogin3, (f, ui) => f.UserName3(ui.FirstName, ui.LastName) );
      RuleFor(ui => ui.UserPassword, f => f.Internet.Password());
   }
}
public class UserInfo{
   public string FirstName{get;set;}
   public string LastName{get;set;}
   public string UserLogin{get;set;}
   public string UserLogin3{get;set;}
   public string UserPassword{get;set;}
}

public static class CustomExtensions{
   public static string UserName2(this Faker f){
      var en = f.Name["en"];
      return f.Internet.UserName(en.FirstName(), en.LastName());
   }
   public static string UserName3(this Faker f, string firstName, string lastName)
   {
      var val = f.Random.Number(2);

      string result;

      if (val == 0)
      {
         result = firstName + f.Random.Number(99);
      }
      else if (val == 1)
      {
         result = firstName + f.Random.ArrayElement(new[] { ".", "_" }) + lastName;
      }
      else
      {
         result = firstName + f.Random.ArrayElement(new[] { ".", "_" }) + lastName + f.Random.Number(99);
      }

      result = result.Replace(" ", string.Empty);
      return result;
   }
}

LINQPad_3414

I think there's possibly some work we could do here to make it a little easier if you want to keep the diacritics. Perhaps a parameter like .UserName(string firstName, string lastName, bool removeDiacritics = true) that will allow you to keep the characters intact.

Let me know what you think.

Thanks,
Brian Chavez

💨 🚶 "Bubbles of gas in my brain... Send me off balance, it's not enough"

@Orygeunik
Copy link
Author

Orygeunik commented Jun 22, 2019

It's good.

And I think the second way of solution this allow non-diactric characters to be translated into Latin. (Maybe enum option (remove, translit, other)?)

For example
Анна Фомина -> Anna Fomina

It's simple and modular way

And, why if you set the locale in the .ctor body (as in the example above), logins/mails are generated as if the locale is "en" (default)?

@Orygeunik
Copy link
Author

Additional question.
When setting the locale "ru" passwords are generated only with Latin letters (which is not quite true, the same KeePass supports as a master key password in Russian and gives the same number of characters (with Latin) greater bit power).
Maybe it makes sense to make the same logic for logins and mail?
Well, or separately specify what characters to use for generation...

@bchavez
Copy link
Owner

bchavez commented Jun 23, 2019

Hi @Orygeunik,

I just learned the process of translating Unicode characters to US-ASCII Latin/Roman characters is called "transliteration". Knowing is half the battle. Lol. 😃

I'll see what I can do to make Bogus better in this respect. To be honest, the issue described here has been an issue I never really put to rest. I had a feeling this issue was going to come up again. So I think it's finally time to put this issue at rest once and for all.

If the community has more input or anyone can offer more insight, please let me know. As far as I can tell, a quick google search for projects that specifically 'solve' "transliteration" are linked below:

https://github.com/pid/speakingurl
https://github.com/dzcpy/transliteration
https://archive.codeplex.com/?p=unidecode
https://www.nuget.org/packages/UnidecodeSharpFork/

If anyone has experience using them (or with transliteration in general), please let me know.

As for password generation with Cyrillic letters, I don't think Bogus will change the password generation algorithm at the moment. But I do get what you're saying, it would be nice if Bogus switched algorithms when ru is selected, but I don't think we're ready for that yet. I'll keep it in mind. However, if this is important to you right now, you can generate Cryllic passwords using the same C# extension method too:

void Main()
{
   
   var faker = new UserInfoFaker();
   faker.Generate(10).Dump();
}

public class UserInfoFaker : Faker<UserInfo>
{
   public UserInfoFaker() : base("ru")
   {
      RuleFor(ui => ui.FirstName, f => f.Person.FirstName);
      RuleFor(ui => ui.LastName, f => f.Person.LastName);
      RuleFor(ui => ui.UserPassword, f => f.Internet.RuPassword());
   }
}
public class UserInfo
{
   public string FirstName { get; set; }
   public string LastName { get; set; }
   public string UserPassword { get; set; }
}

public static class CustomExtensions
{
   private static readonly char[] RuChars = "АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя".ToArray();

   public static string RuPassword(this Bogus.DataSets.Internet i, int? len = null)
   {
      var length = len ?? i.Random.Number(8, 10);
      
      var picked = i.Random.ArrayElements(RuChars, length);
      
      return new string(picked);
   }
}

LINQPad_3415

☁️ ☀️ Bassnectar - Chasing Heaven - Into the Sun

@bchavez
Copy link
Owner

bchavez commented Jun 23, 2019

Also, to answer your question:

And, why if you set the locale in the .ctor body (as in the example above), logins/mails are generated as if the locale is "en" (default)?

I don't think you can set the .Locale in the constructor body in Faker<T> because the FakerHub facade is created in the constructor of Faker<T>. The reason there's a set property is that the ILocaleAware interface defines Locale {get; set} which does change the locale if you are using an individual DataSet like new Internet().Locale = "ru". However, if you want to change the locale for whatever reason dynamically in Faker<T>, you'll need to set the FakerHub in Faker<T>.ctor instead. Like this:

public class UserInfoFaker : Faker<UserInfo>
{
   public UserInfoFaker()// : base("ru")
   {
      // Locale = "ru"; doesn't work here.
      this.FakerHub = new Faker("ru");

      RuleFor(ui => ui.UserLogin, f => f.Internet.UserName());
      RuleFor(ui => ui.UserPassword, f => f.Internet.Password());
   }
}

But this still won't solve your original problem.

@Orygeunik
Copy link
Author

Orygeunik commented Jun 23, 2019

Each country has its own rules of transliteration (транслит). For Russian letters, you can quickly google a simple tables (which at first will be more than enough)

Typically, the description of the translit is used for the manufacture of a foreign passport.
Below is a picture that can be used for ticketing (on the plane)
pic


Additionally.
Accurately broadcast names/surnames is not necessary. For example, in Russia it is permissible to "several" spellings of the name/surname/(and by the way patronymic) transliterated.
I. e. the Russian name Василий can be written both as Vasilii and as Vasiliy and as Vasily

@rynkevich
Copy link

Hi @Orygeunik, @bchavez. Accidentally found this conversation.

The idea of having transliteration in Bogus is very cool! However, I don't think it is possible for every language.

Actually, Russian could be one of the easiest cases. While it requires a simple table (a dictionary) to look up letters/syllables, some languages are not so easily transliterated. For example, as far as I know, kanji (hieroglyphs) in Japanese could be quite polyphonic, changing their sound depending on the context they are used in. This particular problem is also mentioned in this issue of Slugify Elixir library.
Additionally, all of the projects mentioned by @bchavez support only a limited list of languages.

What I want to say is if you decide to implement this feature, you will face the need to assemble multiple transliteration libraries into one, which will definitely require considerable effort and may not be a complete and desirable solution for the problem.

@bchavez
Copy link
Owner

bchavez commented Jun 25, 2019

Hi Arseni,

Thank you for the feedback and insights. I really appreciate it. I don't have much experience in this area so, anything helps!

I think you are right about not being able to solve the problem 100% for all locales. My hope is we can cover a good majority of them; like ru. I think it's okay if we're not 100% complete since this is the approach we are taking with the current data ports from faker.js. Not all our locales are 100% complete. 😞 At the very least, whatever we come up with should be extensible for those users that don't have complete data.

My first implementation attempt used several massive large dictionary-like character replacements to perform transliteration. It was a straight-up port of speakingurl algorithm's in C#. The C# implementation was very ugly and brittle, but it worked. Ultimately, though, I wasn't happy with it.

My second implementation attempt uses a Trie data-structure for character replacements. I just got the basic algorithm working last night successfully. IMHO, it is a big improvement over speakingurl and is more aligned with what I had in mind. The implementation is quite elegant too. Since we're using a Trie, you can "probe ahead" and replace chunks of the input string. For example,

  • 1 input character can be replaced with 10 characters or,
  • 5 input characters can be replaced with 0 or more characters.

It also should work with locale-specific translates. For example, where the same character can have two different translates depending on the locale your using. IE:

  • Locale en: '♥' -> 'love'
  • Locale es: '♥' -> 'amor'.

A lot of these libraries are "slug"ifying text in the middle of processing characters which tends to make understanding and porting these algorithms like speakingurl difficult and overly complex.

In Bogus, I want transliteration and slugifying to be two separate and distinct operations. When both operations are separate, these algorithms tend to be more elegant and easier to understand, and maintain.

I still have more work and experiments to do. Also, I still need to do more work understanding how other libraries (like the Elixer library you pointed out) solve the same problem. Hopefully, at the end of this, we'll have a half-way decent implementation for Bogus. :)

Again, I want to thank everyone for their input and feedback. It is immensely helpful when others give feedback are looking ahead with more experience.

🏖️ 🎺 Beach Boys - Good Vibrations (Nick Warren bootleg)

@bchavez
Copy link
Owner

bchavez commented Jul 2, 2019

Hi @Orygeunik , @rynkevich ,

Basic transliteration support is now available in Bogus v28.0.1.

The Bogus.Extensions namespace has a string.Transliterate() method that transliterates Unicode characters to US-ASCII/Roman/Latin character sets.

Additionally, Person and Faker should work as expected.

void Main()
{
   Enumerable.Range(1, 10)
   .Select(_ => new Person("ru"))
   .Select(p => new {FullName = p.FullName, Email = p.Email, UserName = p.UserName})
   .Dump();
}

LINQPad_3459

Additionally, new method Internet.UserNameUnicode() that will preserve Unicode characters when creating user names. You can use this method to compose email addresses with Unicode characters if you choose.

Hope it works out well.

Thanks,
Brian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants