Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Names grouped by country #17

Closed
djuxy opened this issue Oct 22, 2021 · 12 comments
Closed

Names grouped by country #17

djuxy opened this issue Oct 22, 2021 · 12 comments

Comments

@djuxy
Copy link

djuxy commented Oct 22, 2021

Hi @philipperemy ,
Great work!

Is it possible to get a list of names (first/last) and % of appearance in a given country?

Cheers

@philipperemy
Copy link
Owner

philipperemy commented Oct 22, 2021

@djuxy thank you :)

#14

Have a look at this one.

@djuxy
Copy link
Author

djuxy commented Oct 24, 2021

I need a bit different output that i think i can not get from this one. :(

Something like this would work:

"France": { "Aimée": 0.000323, "Blanche": 0.000276, "Chanel": 0.000179, ... "Yves": 0.000123 }, "Mexico": { "Alejandro": 0.000422, "Francisco": 0.000376, "Juan": 0.000221, ... "Yolanda": 0.000087 },

@philipperemy
Copy link
Owner

@djuxy I'm generating you that!

@philipperemy
Copy link
Owner

philipperemy commented Oct 25, 2021

@djuxy Here you go: https://drive.google.com/file/d/1wmVNXcfOYOcqhVesilI7LE0Lcm9-ivnw/view?usp=sharing.

It's a ZIP with this folder structure:

by_country
├── Afghanistan
│   ├── first_names.json
│   └── last_names.json
├── Albania
│   ├── first_names.json
│   └── last_names.json
├── Algeria
│   ├── first_names.json
│   └── last_names.json
├── Angola
│   ├── first_names.json
│   └── last_names.json
├── Argentina
│   ├── first_names.json
│   └── last_names.json

For a given JSON file, I provide pairs <name, count> where count is the number of occurrences of this name for this country. It's just a count without any normalization, except for a very simple filtering that I describe here:

NOTE: To save a bit of space and to reduce the noise, I kept the records with count>5

> cat France/first_names.json | head -n 20
{
  "A Marie": 8,
  "A-B": 6,
  "A-C": 17,
  "A-Charlotte": 7,
  "A-Claire": 9,
  "A-D": 8,
  "A-F": 7,
  "A-G": 6,
  "A-H": 8,
  "A-L": 28,
  "A-Laure": 19,
  "A-Lex": 7,
  "A-Line": 12,
  "A-Lise": 10,
  "A-Lys": 6,
  "A-Marie": 33,
  "A-S": 7,
  "A-So": 8,
  "A-Sophie": 30,

@djuxy
Copy link
Author

djuxy commented Oct 25, 2021

@philipperemy Amazing! Thank You!

P.S. I don't need this, but probably somebody needs only male/female names. It would be great to separate male and female names if it's possible.

@philipperemy
Copy link
Owner

I'm happy I could help here.

Yeah def a great idea here. We have male/female in the FB dump so we could add a gender attribute ;)

I'll think of some nice way to integrate it

@philipperemy
Copy link
Owner

@djuxy I'm going to generate a giant CSV containing first_name,last_name,gender,country.

Then with pandas it will be easy for anyone to manipulate it and derive some stats or whatever metric they want.

@djuxy
Copy link
Author

djuxy commented Oct 29, 2021

@philipperemy 👏 👏 👏 That would be great! Thank you!

@djuxy
Copy link
Author

djuxy commented Nov 1, 2021

@djuxy I'm going to generate a giant CSV containing first_name,last_name,gender,country.

Then with pandas it will be easy for anyone to manipulate it and derive some stats or whatever metric they want.

I give a thought. It's better to have 2 csv:

  • first one with: first_name, gender, country and count of appearance ,
  • second one with: last_name, country and count of appearance

Why?
Gender is associated with first_name only.
We can easily see what is usage of first_name and last name separately in some country.

@philipperemy
Copy link
Owner

philipperemy commented Nov 4, 2021

@djuxy I have just seen your message after I generated the large CSV. Have a look:

It is a folder containing one CSV per country using the country ISO code alpha_2 (hint: pycountry). I also added the country code in the last column if you wanted to concatenate all of them together. The uncompressed version takes around 10GB on the disk.

data
├── AE.csv
├── AF.csv
├── AL.csv
├── AO.csv
├── AR.csv
├── AT.csv
$ head FR.csv
Laure,Canet,F,FR
Louis,Givran,M,FR
Timothy,Dovin,M,FR
Anne Marie,Petiton,F,FR
Claudine,Solignac,F,FR
Florian,Burnat,,FR
Bendjy,Gobbo,M,FR
Danyel,Cambon,M,FR

It should contain all the information that you need.

The gender is either M, F or empty string (missing).

I agree that we can optimize it. Your suggestions make sense and I'll update all of that once I can find enough time.

Have a look in the mean time and let me know!

@SHEFKEVIN
Copy link

@philipperemy I just sent an email to your gmail :)

@philipperemy
Copy link
Owner

@SHEFKEVIN I replied you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants