Namibia is missing in some coding schemes #261

dieghernan · 2021-02-10T11:30:30Z

Again Namibia. I have realised that in four coding schemes (eurostat, genc2c, wb_api2c, ecb) is missing since in all of them the 2-letter code is NA . See sources:

https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_ecb.csv#L163
https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_genc.csv#L154
https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_world_bank_api.csv#L141
http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=dic%2Fen%2Fgeo.dic (source for get_eurostat):

countrycode/dictionary/get_eurostat.R

Line 4 in 75e3263

url = 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=dic%2Fen%2Fgeo.dic'

Reprex with the latest CRAN release

library(countrycode)

# Test
countrycode("NAM", "iso3c", "iso2c")
#> [1] "NA"
countrycode("NAM", "iso3c", "eurostat")
#> Warning in countrycode("NAM", "iso3c", "eurostat"): Some values were not matched unambiguously: NAM
#> [1] NA

# Analize
df <- codelist

# Filter Namibia
check <- df[df$country.name.en == "Namibia",]

# Check NA cols
NAscol <- colnames(check)[is.na(check[1, ])]

# Select no cldr fiels
NAscol <- NAscol[-grep("cldr", NAscol)]

NAscol
#> [1] "ecb"      "eu28"     "eurostat" "genc2c"   "wb_api2c"


sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] countrycode_1.2.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
#>  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.6  
#>  [9] highr_0.8       knitr_1.31      stringr_1.4.0   xfun_0.19      
#> [13] digest_0.6.25   rlang_0.4.10    evaluate_0.14

^{Created on 2021-02-10 by the reprex package (v0.3.0)}

Reprex after PR

library(countrycode)

# Test
countrycode("NAM", "iso3c", "iso2c")
#> [1] "NA"
countrycode("NAM", "iso3c", "eurostat")
#> [1] "NA"

# Analize
df <- codelist

# Filter Namibia
check <- df[df$country.name.en == "Namibia",]

# Check NA cols
NAscol <- colnames(check)[is.na(check[1, ])]

# Select no cldr fiels
NAscol <- NAscol[-grep("cldr", NAscol)]

NAscol
#> [1] "eu28"


sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.7 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
#> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] countrycode_1.2.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.27     assertthat_0.2.1  magrittr_2.0.1    reprex_1.0.0     
#>  [5] evaluate_0.14     highr_0.8         stringi_1.5.3     rlang_0.4.10     
#>  [9] cli_2.3.0         rstudioapi_0.13   fs_1.5.0          rmarkdown_2.6    
#> [13] tools_4.0.3       stringr_1.4.0     glue_1.4.2        xfun_0.20        
#> [17] yaml_2.2.1        compiler_4.0.3    htmltools_0.5.1.1 knitr_1.31

^{Created on 2021-02-10 by the reprex package (v1.0.0)}

Now only eu28 is missing, that it is ok (I leave out of the exercise the cldr* fields for clarity).

I have prepared a PR that hopefull fixes this issue,

Regards

The text was updated successfully, but these errors were encountered:

cjyetman · 2021-02-10T19:26:23Z

Thanks! and, confirmed...

library(countrycode)
is.na(countrycode("NAM", "iso3c", "eurostat"))
#> Warning in countrycode("NAM", "iso3c", "eurostat"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "ecb"))
#> Warning in countrycode("NAM", "iso3c", "ecb"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "eu28"))
#> Warning in countrycode("NAM", "iso3c", "eu28"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "genc2c"))
#> Warning in countrycode("NAM", "iso3c", "genc2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "wb_api2c"))
#> Warning in countrycode("NAM", "iso3c", "wb_api2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE

cjyetman · 2021-02-10T19:30:09Z

I would say that this issue is not fully "fixed" until the scrapers for each of these codes has been fixed. Maybe each should be split into its own issue so that they can be addressed separately?

cjyetman · 2021-02-10T19:37:04Z

Also of note... now that this tidyverse/rvest/issues/107 has finally been resolved, we can probably make dictionary/get_ecb.R work better.

cjyetman · 2021-02-10T19:39:20Z

on the other hand, a similar issue in jsonlite is still unresolved, so still requires workarounds...
jeroen/jsonlite/issues/98
jeroen/jsonlite/issues/314

vincentarelbundock · 2021-02-10T20:15:52Z

I'm not sure the problem is (entirely) related to our scrapers. It seems reader related to me. For instance, the "NA" string in data_genc.csv is correctly double-quoted in the raw CSV:

https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_genc.csv

The data.table::fread package does a good job of reading, but not read.csv nor read_csv:

setwd("~/repos/countrycode")
library(readr)
library(data.table)

# Base R
x = read.csv("dictionary/data_genc.csv")
"NA" %in% x$genc2c
#> [1] FALSE

# tidyverse
y = read_csv("dictionary/data_genc.csv")
"NA" %in% y$genc2c
#> [1] FALSE

# data.table
z = fread("dictionary/data_genc.csv")
"NA" %in% z$genc2c
#> [1] TRUE

vincentarelbundock · 2021-02-10T20:18:38Z

I made a minor commit with:

New tests for Namibia
A fix to the genc scraper with a Namibia-specific assertion

Obviously, if the saved data is not properly double-quoted, we should fix the scraper, but I'd like to get to the bottom of the read_csv issue instead because that feels like the more general solution.

8f0ff1e

vincentarelbundock · 2021-02-10T20:22:19Z

An even more minimal example:

library(readr)
library(data.table)

csv <- 'x,y
"1","NA"
"NA","2"'

str(read_csv(csv))
#> spec_tbl_df [2 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ x: num [1:2] 1 NA
#>  $ y: num [1:2] NA 2
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   x = col_double(),
#>   ..   y = col_double()
#>   .. )

str(fread(csv))
#> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
#>  $ x: chr  "1" "NA"
#>  $ y: chr  "NA" "2"
#>  - attr(*, ".internal.selfref")=<externalptr>

vincentarelbundock · 2021-02-10T20:23:00Z

Maybe we just set na.strings to "" in read_csv.

vincentarelbundock · 2021-02-10T20:56:16Z

Sorry for the multiple comments, but I pushed a change to add a bunch of na="" everywhere. This seems to fix everything, and my new tests now pass.

I think we're good to close, but it would be great if either of you could make sure the github version works locally.

dieghernan · 2021-02-10T21:25:36Z

Hi! So now it seems that some real NA are treated as "NA" (see https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/codelist_without_cldr.csv col eu28).

I was not able to check it locally yet, but as per my limited knowledge of the package, I guess that if checks are passed it’s because those new "NA" are on destination only fields (not really sure about this...)

I wonder if it could be possible to add a extra sanity check on dictionary/build.R that convert "NA" back to NA on destination-only fields, to avoid confusion.

Does it make any sense?

vincentarelbundock · 2021-02-10T21:28:23Z

That would be more explicit, but countrycode uses a strict one-to-one mapping between codes. So in principle if it works in one direction it will work in the other. Here, for example, we have:

library(countrycode)  

countrycode("NA", "genc2c", "country.name")            
 "Namibia"
 
countrycode(NA, "genc2c", "country.name")              
Error in countrycode(NA, "genc2c", "country.name") : 
  sourcevar must be a character or numeric vector. This error often
             arises when users pass a tibble (e.g., from dplyr) instead of a
             column vector from a data.frame (i.e., my_tbl[, 2] vs. my_df[, 2]
                                              vs. my_tbl[[2]])

The error is not super informative, I'll admit that ;)

cjyetman · 2021-02-10T21:28:49Z

The "proper" way to deal with this in readr is to set the na argument, which by default is na = c("", "NA")

readr::read_csv('x,y\n"US","NA"\n"NA","DE"')
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 US    <NA> 
#> 2 <NA>  DE
readr::read_csv('x,y\n"US","NA"\n"NA","DE"', na = "")
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 US    NA   
#> 2 NA    DE

vincentarelbundock · 2021-02-10T21:29:53Z

@cjyetman this is exactly what I did everywhere in my new commit.

dieghernan · 2021-02-10T21:32:52Z

One thing maybe I didn’t explain well is that the only scrapper that was not working properly was eurostat, at least for the four coding schemes I mentioned. The other three displayed the value on the csv as "NA".

cjyetman · 2021-02-10T21:43:10Z

Also pay attention to where the CSVs are being written (readr::format_csv is equivalent to readr::write_csv except that it returns the string rather than writing it to a file)...

readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,')
#> # A tibble: 3 x 2
#>   x     y    
#>   <lgl> <lgl>
#> 1 NA    NA   
#> 2 NA    NA   
#> 3 NA    NA
# all are converted to <NA>s

readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")
#> # A tibble: 3 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 NA    NA   
#> 2 NA    NA   
#> 3 <NA>  <NA>
# only the last row is converted to <NA>s

data <- readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")

readr::format_csv(data)
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"

readr::format_csv(data, na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"

technically, a string should not be quoted unless it's necessary

string <- 'x,y\n"NA","NA"\nNA,NA\n,'

readr::format_csv(readr::read_csv(string))
#> [1] "x,y\nNA,NA\nNA,NA\nNA,NA\n"

readr::format_csv(readr::read_csv(string, na = ""))
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"

readr::format_csv(readr::read_csv(string, na = ""), na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"

cjyetman · 2021-02-10T21:46:48Z

@cjyetman this is exactly what I did everywhere in my new commit.

I think that's the best thing to do... but again, just be careful that if any CSVs are written that they don't write <NA> as NA (without the quotes), which some CSV writers will do by default (for instance readr, e.g. readr::format_csv(data.frame(x = NA)) # [1] "x\nNA\n".

vincentarelbundock · 2021-02-10T21:54:16Z

Yes, I added na=“” to write calls too.

I think this is fixed. Feel free to reopen or comment if it still fails after reinstall from GH

cjyetman · 2021-02-10T21:57:55Z

better example of why you need to be careful of both ends of the round trip...

my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")))
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A"  "NA"
# BAD

my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")), na = "")
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A" NA
# GOOD

if na = "" is used everywhere for readr::read_csv, the same has to be used everywhere for readr::write_csv

dieghernan mentioned this issue Feb 10, 2021

Fix Namibia when code is NA #262

Closed

vincentarelbundock closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Namibia is missing in some coding schemes #261

Namibia is missing in some coding schemes #261

dieghernan commented Feb 10, 2021 •

edited

Loading

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021 •

edited

Loading

vincentarelbundock commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

dieghernan commented Feb 10, 2021 •

edited

Loading

vincentarelbundock commented Feb 10, 2021 •

edited

Loading

cjyetman commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

dieghernan commented Feb 10, 2021

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

cjyetman commented Feb 10, 2021

Namibia is missing in some coding schemes #261

Namibia is missing in some coding schemes #261

Comments

dieghernan commented Feb 10, 2021 • edited Loading

Reprex with the latest CRAN release

Reprex after PR

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021 • edited Loading

vincentarelbundock commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

dieghernan commented Feb 10, 2021 • edited Loading

vincentarelbundock commented Feb 10, 2021 • edited Loading

cjyetman commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

dieghernan commented Feb 10, 2021

cjyetman commented Feb 10, 2021

cjyetman commented Feb 10, 2021

vincentarelbundock commented Feb 10, 2021

cjyetman commented Feb 10, 2021

dieghernan commented Feb 10, 2021 •

edited

Loading

vincentarelbundock commented Feb 10, 2021 •

edited

Loading

dieghernan commented Feb 10, 2021 •

edited

Loading

vincentarelbundock commented Feb 10, 2021 •

edited

Loading