Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the score in v2 calculated? #13

Closed
kphowell opened this issue Jun 3, 2021 · 7 comments
Closed

How is the score in v2 calculated? #13

kphowell opened this issue Jun 3, 2021 · 7 comments

Comments

@kphowell
Copy link

kphowell commented Jun 3, 2021

The readme describes a score between 0 and 100 that can be used as a threshold to control precision and recall. Can you provide more information on how this score is calculated, please?

Great dataset, by the way. Thank you.

@philipperemy
Copy link
Owner

philipperemy commented Jun 29, 2021

@kphowell hey thank you! The scores are calculated based on the frequencies of the names for a given country. For example, the most popular first name in Morocco is Mohamed so Mohamed will have a score of 100.

@KOLANICH
Copy link

I don't consier it as solved. I need probabilities, not "scores". How can I get them?

@philipperemy
Copy link
Owner

@KOLANICH Divide by 100 to have a probability.

@KOLANICH
Copy link

KOLANICH commented Nov 27, 2021

@philipperemy, do you really think Martin has probability of 1 to be a surname and just 0.8 to be a name, and 1 + 0.8 = 1.8 > 1? Can I get P("martin" | name) and P("martin" | surname) to use them in naïve Bayes inference (I wanna infer probabilities that a name-surname pair has the right order (first component is considered name, second is considered a surname), that it doesn't have the right order (they are swapped) and that they are not names and surnames at all (i.e. the both components are names or surnames or not in the dataset at all) ).

@philipperemy
Copy link
Owner

@KOLANICH okay I understand what you want to achieve here.
I dumped the full dataset here (curated of course): #17 (comment).
It contains one CSV per country and on each one <first>,<last>,<gender>,<country>.
You should be able to derive the probabilities you need with this data.

@KOLANICH
Copy link

KOLANICH commented Nov 28, 2021

Thanks for the info, in fact I have already seen that issue and that dataset, just didn't want to use it and tried to limit myself with the data already present within the package. BTW, should I send any PRs here or should I leave everything as it is and make an own package?

@philipperemy
Copy link
Owner

@KOLANICH any PRs is more than welcome! If you think what you will do can be helpful to others, you can open PRs here! Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants