Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faker.Person.FirstName seems really slow compared to Faker.Name.FirstName #248

Closed
ThaDaVos opened this issue Aug 27, 2019 · 6 comments
Closed
Labels

Comments

@ThaDaVos
Copy link

Version Information

Software Version(s)
Bogus NuGet Package 28.0.2
.NET Core? 3.0 Preview 8
.NET Full Framework?
Windows OS? Windows 10
Linux OS?
Visual Studio? 16.3.0 Preview 2.0

What locale are you using with Bogus?

default so I guess english locale

What's the problem?

I'm generating 1.000.000 users but when I'm using anything related to the Faker.Person it really slows down -> 1.000.000 in 43secs to 2+ min -> if I use Faker.Name.FirstName instead of Faker.Person.FirstName it is just quick as expected -> got to say, I'm using multithreading using Parallel.For and came across this issue: #234

What possible solutions have you considered?

Using Faker.Name.FirstName instead of Faker.Person.Name but how would I replace Faker.Person.Email?

Do you have sample code to show what you're trying to do?

I can share the repo if needed (have to push it online first)
(Please be complete. Include any class models necessary to run your code.)

@bchavez
Copy link
Owner

bchavez commented Aug 27, 2019

Hi @dvdbot,

To replace f.Person.Email use the .RuleFor( t => t.prop, (f, t) =>...) overload, then use the context parameters of f.Internet.Email(t.FirstName, t.LastName). The code below shows how to remove accessing f.Person.Email:

void Main()
{
   var userFaker = new Faker<User>()
      .RuleFor(u => u.FirstName, f => f.Name.FirstName())
      .RuleFor(u => u.LastName, f => f.Name.LastName())
      .RuleFor(u => u.Email, (f, u) => f.Internet.Email(u.FirstName, u.LastName));
      
   userFaker.Generate(5).Dump();
}

public class User
{
   public string FirstName {get;set;}
   public string LastName {get;set;}
   public string Email{get;set;}
}
FirstName LastName Email
Eunice Feest [email protected]
Flo Gleichner [email protected]
Hattie Lockman [email protected]
Mohamed Muller [email protected]
Natasha Erdman [email protected]

If you're strictly concerned about performance, other performance ideas you might want to consider are described below:

  • Don't use Faker<T> and try using the Faker facade pattern (or individual datasets). See here. Strictly from a performance perspective, the problem with Faker<T> is that reflection is being used under the hood in Faker<T>. This is just one of those trade-offs of convenience vs performance.

  • Try testing your 1M generation without Parallel.For especially when using Faker<T>. Faker<T> has an internal lock to keep things somewhat sane when multiple threads are partying on Faker<T>.Generate() methods. There might be a possibility of Parallel.For causing more harm than good through lock contention on a single instance of Faker<T> and .Generate(). You'll have to investigate if that's the case with your project. If you must use Parallel.For, make sure each parallel has its own dedicated new Faker<T> instance using local seeds.

  • If f.Internet.Email() is too slow, try simplifying email generation with the following rule:

    .RuleFor(u => u.Email, (f, u) => $"{u.FirstName}{f.IndexGlobal}@test.com");

    Generating email addresses is a somewhat involved process because transliteration is required to (again) keep things somewhat sane. Each character generated in an email address goes through a trie data structure to replace diacritic characters in names to their closest Roman/Latin character equivalents. IE: Анна Фомина -> Anna Fomina. So, simplifying your email generation might help. See Internet.UserName ru locale return numeric characters only #225 and Support for Transliteration #233 for more info. Again, just one of those trade-offs; but at least you have an escape hatch.

  • Maybe take a step back and look at the overall data ETL process. How frequently do you need to generate 1M? Perhaps, you can generate the expensive Bogus data one time (or less frequently) and save the Bogus generated results in a JSON file. Then use that JSON file as part of your ETL process. Pre-rendering data beforehand for multiple repeat uses would be better than generating expensive data on the fly on each use.

This is the best I can do with the information you've provided. If you want this investigated further, you'll have to post some kind of code that can reproduce the issue, so we can further investigate. With issues like this, I prefer to have concrete code that reproduces an issue that you're seeing otherwise, it's easy to chase phantoms. Also, if we do find that something in the email code is too slow like traversing the trie for transliteration, current code like that won't change unless we can come up with an equivalent solution that offers better code maintainability without increasing complexity.

Feel free to re-open the issue if you can provide more information or pinpoint any performance issues in the source.

Let me know if that helps.

Thanks,
Brian

⚡ 🏃 "I was struck by lighting. Walkin' down the street..."

@ThaDaVos
Copy link
Author

It's not only the email - The thing that stumped me the most is that the Faker.Person.FirstName is a lot slower than Faker.Name.FirstName -> maybe because it generates a whole person?

@bchavez
Copy link
Owner

bchavez commented Aug 27, 2019

Yes, you're right. Accessing f => f.Person by the first .RuleFor rule will create a new Person() under the hood. f => f.Person is lazy loaded in Faker<T>.

Before each T in Faker<T> is generated, the person context instance is cleared for the next generation of T. So, any rules accessing .RuleFor(.., f => f.Person...) will cause a new Person() to spawn for some instance of T.

The code for that should be here:

private Person person;
/// <summary>
/// A contextually relevant fields of a person.
/// </summary>
public Person Person => person ?? (person = new Person(this.Random, this.Locale));

Each new person will have a startup cost of Populate:

protected internal virtual void Populate()
{
this.Gender = this.Random.Enum<Name.Gender>();
this.FirstName = this.DsName.FirstName(this.Gender);
this.LastName = this.DsName.LastName(this.Gender);
this.FullName = $"{this.FirstName} {this.LastName}";
this.UserName = this.DsInternet.UserName(this.FirstName, this.LastName);
this.Email = this.DsInternet.Email(this.FirstName, this.LastName);
this.Website = this.DsInternet.DomainName();
this.Avatar = this.DsInternet.Avatar();
this.DateOfBirth = this.DsDate.Past(50, Date.SystemClock().AddYears(-20));
this.Phone = this.DsPhoneNumbers.PhoneNumber();
this.Address = new CardAddress
{
Street = this.DsAddress.StreetAddress(),
Suite = this.DsAddress.SecondaryAddress(),
City = this.DsAddress.City(),
State = this.DsAddress.State(),
ZipCode = this.DsAddress.ZipCode(),
Geo = new CardAddress.CardGeo
{
Lat = this.DsAddress.Latitude(),
Lng = this.DsAddress.Longitude()
}
};
this.Company = new CardCompany
{
Name = this.DsCompany.CompanyName(),
CatchPhrase = this.DsCompany.CatchPhrase(),
Bs = this.DsCompany.Bs()
};
}

Which is probably why you're seeing such a high, first-time, initialization cost accessing f.Person.

@bchavez
Copy link
Owner

bchavez commented Aug 28, 2019

Hey @dvdbot , yesterday, I did a quick perf test and found similar results as you've described. Without using any f.Person, I can generate 1M in about 8 seconds. Using f.Person increases wall time to about 2m22s for 1M generations.

I wrote a lazy-loaded property Person class that when used from a Faker<T> context, generates f.Person values on an as-needed basis for .RuleFor(..., f => f.Person...) on the person class. This brought down the wall time from 2m22s to 12 seconds.

I'm still thinking about the consequences of changing Person and if this is something we should go forward with.

Pros

  • This is a common use case and the common use case should be fast. I share your sentiment that w.r.t semantics, something this common and straight forward shouldn't kill your performance.
  • Current behavior of Person will remain the same. new Person will still populate all properties at the same time. However, an internal new Person(faker) constructor will setup Person properties for lazy loaded usage by use from Faker<T>.

Cons

  • Fields to Properties: We'd have to change Person.FirstName, .LastName. etc, fields to properties. Need to investigate more about the consequences this change will have in F# as it relates to Field vs Property access patterns.

  • Serialization Side Effects: From what I've seen Person is used a lot in serialization to databases. This might still be okay; if not better.

  • Reflection: People reflecting over Person might break expecting fields instead of properties. However, I suspect the usage pool is very small.

  • Maintainability: Makes Person a little more complex, not too bad, but perhaps it's time to consider separating out standalone Person vs Faker<T> usage.

  • Determinism: When Person is in lazy mode, deterministic f.Person.Prop values are now dependent on the order in which properties are accessed. In lazy mode, if the calls into f.Person aren't deterministic then neither will view of a Person in a sequentially ordered sequence.
    For example, calling:

    // Some Nth generation of T in Faker<T>
    f.Person.FirstName
    f.Person.LastName    // Accessing any of these person properties, in any order,
    f.Person.DateOfBirth // has no effect on the other fields.
    vs
    f.Person.FirstName
    f.Person.DateOfBirth // different value because it was accessed before LastName
    f.Person.LastName    // completely different value because DOB was accessed first.
    

    This would impose a requirement for developers that all usages of f.Person in .RuleFor rules to source all randomization from a deterministic source. Any if or else statement blocks that access f.Person would need to be deterministic. Otherwise, it would not be a consistent view of a person for every re-run of some N in the sequence generation of T.

    This is the biggest downside that could make or break going forward. However, this point might be moot because it is already the case that f.Person is currently lazy-loaded as a whole on the first access of some RuleFor at N in the sequence generation of T.

    private Person person;
    /// <summary>
    /// A contextually relevant fields of a person.
    /// </summary>
    public Person Person => person ?? (person = new Person(this.Random, this.Locale));
    The pertinent question here is, does it matter? Lazy generation as a whole, or in part? Requires a bit more thinking. Hm.

  • Consistency and Semantics: Fundamentally, new Person and f.Person would have different underlying behavior depending on where they are used. 1) alone new Person() and 2) in some Faker<T>.RuleFor(_, f.Person.Prop). Another make or break decision. I would probably prefer API consistency over performance here; because the performance workaround is easy, just avoid using f.Person and use your own generation method.

Todo

  • Search public repos for Person usage and see any other impacts. Especially for downstream extensions like AutoBogus.

Also, a change like this would have to be put into a major version bump.

Then, all this weighed against adding a performance section in the README under Advanced Topics, Guidance, and Best Practices.

@bchavez bchavez removed the wontfix label Aug 28, 2019
@ThaDaVos
Copy link
Author

When using .RuleFor(_, f.Person.Prop) I thought I was selecting a category to then select a thing to generate, I didn't really think about it generating a whole person - how are the others like f.Name etc handled? Do these construct a complete object or not? Maybe the decision has to be based on this? Make the API consistent all across?

@bchavez
Copy link
Owner

bchavez commented Aug 28, 2019

When using .RuleFor(_, f.Person.Field) we're generating a new Person to use for each instance of T. f.Person creates a whole new person with context relevant fields. So, multiple accesses to the same f.Person.FirstName field by multiple rules, will produce the same value only for the lifetime during T is being built up. That is, a new T = new Person if any rule accesses f.Person.Prop.

Other than f.Person, everything else is based on datasets:

When using .RuleFor(_, f.Name.FirstName() ) we're using the Bogus.DataSets.Name dataset object on the Faker facade. No new datasets are created during the generation of T. Repeat calls to f.Name.FirstName() will generate a new first name every time FirstName() method is called. Datasets are not context-aware in Faker<T>; which is why f.Person can be helpful because f.Person is context-aware.

Here's a good example:

void Main()
{
   var easyFaker = new Faker<User>()
      .RuleFor(u => u.FirstName, f => f.Person.FirstName)
      .RuleFor(u => u.LastName, f => f.Person.LastName)
      .RuleFor(u => u.FullName, f => f.Person.FullName)
      .RuleFor(u => u.Email, f => f.Person.Email)
      .RuleFor(u => u.Gender, f => f.Person.Gender);
      
   easyFaker.Generate().Dump();
   
   var complexFaker = new Faker<User>()
      .RuleFor(u => u.Gender, f => f.PickRandom<Bogus.DataSets.Name.Gender>())
      .RuleFor(u => u.FirstName, (f, u) => f.Name.FirstName(u.Gender) )
      .RuleFor(u => u.LastName, (f, u) => f.Name.LastName(u.Gender))
      .RuleFor(u => u.FullName, (f, u) => $"{u.FirstName} {u.LastName}")
      .RuleFor(u => u.Email, (f, u) => f.Internet.Email(u.FirstName, u.LastName));
      
   complexFaker.Generate().Dump();
}

public class User{
   public string FirstName { get;set; }
   public string LastName { get; set; }
   public string FullName { get; set; }
   public string Email { get; set;}
   public Bogus.DataSets.Name.Gender Gender {get;set;}
}

Notice, with complexFaker you have to be slightly more mindful of the rule dependency graph because many rules are dependent on that first Gender rule. The complexFaker.RuleFor code can get unwieldy trying to carry state around with (faker, user) => lambda. The crux of the issue is that .RuleFor rules do not share state between invocations. The f.Person is a helper property that pushes context down into f.Person (by generating all the fields at once), so that .RuleFor rules can be simplified for common use cases (see easyFaker above).

I'm kinda leaning toward keeping the code unchanged and adding a new section in the readme that provides some performance guidelines. But I did want to give my thoughts on the subject as to why this is a little bit of a complicated situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants