Names grouped by country #17

djuxy · 2021-10-22T13:56:17Z

Hi @philipperemy ,
Great work!

Is it possible to get a list of names (first/last) and % of appearance in a given country?

Cheers

philipperemy · 2021-10-22T22:57:50Z

@djuxy thank you :)

#14

Have a look at this one.

djuxy · 2021-10-24T21:01:01Z

I need a bit different output that i think i can not get from this one. :(

Something like this would work:

"France": { "Aimée": 0.000323, "Blanche": 0.000276, "Chanel": 0.000179, ... "Yves": 0.000123 }, "Mexico": { "Alejandro": 0.000422, "Francisco": 0.000376, "Juan": 0.000221, ... "Yolanda": 0.000087 },

philipperemy · 2021-10-25T01:46:37Z

@djuxy I'm generating you that!

philipperemy · 2021-10-25T03:45:20Z

@djuxy Here you go: https://drive.google.com/file/d/1wmVNXcfOYOcqhVesilI7LE0Lcm9-ivnw/view?usp=sharing.

It's a ZIP with this folder structure:

by_country
├── Afghanistan
│   ├── first_names.json
│   └── last_names.json
├── Albania
│   ├── first_names.json
│   └── last_names.json
├── Algeria
│   ├── first_names.json
│   └── last_names.json
├── Angola
│   ├── first_names.json
│   └── last_names.json
├── Argentina
│   ├── first_names.json
│   └── last_names.json

For a given JSON file, I provide pairs <name, count> where count is the number of occurrences of this name for this country. It's just a count without any normalization, except for a very simple filtering that I describe here:

NOTE: To save a bit of space and to reduce the noise, I kept the records with count>5

> cat France/first_names.json | head -n 20
{
  "A Marie": 8,
  "A-B": 6,
  "A-C": 17,
  "A-Charlotte": 7,
  "A-Claire": 9,
  "A-D": 8,
  "A-F": 7,
  "A-G": 6,
  "A-H": 8,
  "A-L": 28,
  "A-Laure": 19,
  "A-Lex": 7,
  "A-Line": 12,
  "A-Lise": 10,
  "A-Lys": 6,
  "A-Marie": 33,
  "A-S": 7,
  "A-So": 8,
  "A-Sophie": 30,

djuxy · 2021-10-25T10:31:45Z

@philipperemy Amazing! Thank You!

P.S. I don't need this, but probably somebody needs only male/female names. It would be great to separate male and female names if it's possible.

philipperemy · 2021-10-25T11:44:40Z

I'm happy I could help here.

Yeah def a great idea here. We have male/female in the FB dump so we could add a gender attribute ;)

I'll think of some nice way to integrate it

philipperemy · 2021-10-28T02:21:19Z

@djuxy I'm going to generate a giant CSV containing first_name,last_name,gender,country.

Then with pandas it will be easy for anyone to manipulate it and derive some stats or whatever metric they want.

djuxy · 2021-10-29T08:30:14Z

@philipperemy 👏 👏 👏 That would be great! Thank you!

djuxy · 2021-11-01T10:37:52Z

@djuxy I'm going to generate a giant CSV containing first_name,last_name,gender,country.

Then with pandas it will be easy for anyone to manipulate it and derive some stats or whatever metric they want.

I give a thought. It's better to have 2 csv:

first one with: first_name, gender, country and count of appearance ,
second one with: last_name, country and count of appearance

Why?
Gender is associated with first_name only.
We can easily see what is usage of first_name and last name separately in some country.

philipperemy · 2021-11-04T05:08:21Z

@djuxy I have just seen your message after I generated the large CSV. Have a look:

https://drive.google.com/file/d/1wRQfw5EYpzulvRfHCGIUWB2am5JUYVGk/view?usp=sharing (~2GB tar bz2)

It is a folder containing one CSV per country using the country ISO code alpha_2 (hint: pycountry). I also added the country code in the last column if you wanted to concatenate all of them together. The uncompressed version takes around 10GB on the disk.

data
├── AE.csv
├── AF.csv
├── AL.csv
├── AO.csv
├── AR.csv
├── AT.csv

$ head FR.csv
Laure,Canet,F,FR
Louis,Givran,M,FR
Timothy,Dovin,M,FR
Anne Marie,Petiton,F,FR
Claudine,Solignac,F,FR
Florian,Burnat,,FR
Bendjy,Gobbo,M,FR
Danyel,Cambon,M,FR

It should contain all the information that you need.

The gender is either M, F or empty string (missing).

I agree that we can optimize it. Your suggestions make sense and I'll update all of that once I can find enough time.

Have a look in the mean time and let me know!

SHEFKEVIN · 2021-11-06T19:45:37Z

@philipperemy I just sent an email to your gmail :)

philipperemy · 2021-11-07T02:16:43Z

@SHEFKEVIN I replied you

philipperemy closed this as completed Oct 25, 2021

philipperemy mentioned this issue Nov 28, 2021

How is the score in v2 calculated? #13

Closed

philipperemy mentioned this issue Feb 16, 2022

Original data source? #22

Closed

philipperemy mentioned this issue Jul 15, 2022

Can we get all the first names and last names irrespective of country or any other filter #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Names grouped by country #17

Names grouped by country #17

djuxy commented Oct 22, 2021

philipperemy commented Oct 22, 2021 •

edited

Loading

djuxy commented Oct 24, 2021

philipperemy commented Oct 25, 2021

philipperemy commented Oct 25, 2021 •

edited

Loading

djuxy commented Oct 25, 2021

philipperemy commented Oct 25, 2021

philipperemy commented Oct 28, 2021

djuxy commented Oct 29, 2021

djuxy commented Nov 1, 2021

philipperemy commented Nov 4, 2021 •

edited

Loading

SHEFKEVIN commented Nov 6, 2021

philipperemy commented Nov 7, 2021

Names grouped by country #17

Names grouped by country #17

Comments

djuxy commented Oct 22, 2021

philipperemy commented Oct 22, 2021 • edited Loading

djuxy commented Oct 24, 2021

philipperemy commented Oct 25, 2021

philipperemy commented Oct 25, 2021 • edited Loading

djuxy commented Oct 25, 2021

philipperemy commented Oct 25, 2021

philipperemy commented Oct 28, 2021

djuxy commented Oct 29, 2021

djuxy commented Nov 1, 2021

philipperemy commented Nov 4, 2021 • edited Loading

SHEFKEVIN commented Nov 6, 2021

philipperemy commented Nov 7, 2021

philipperemy commented Oct 22, 2021 •

edited

Loading

philipperemy commented Oct 25, 2021 •

edited

Loading

philipperemy commented Nov 4, 2021 •

edited

Loading