Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namibia is missing in some coding schemes #261

Closed
dieghernan opened this issue Feb 10, 2021 · 18 comments
Closed

Namibia is missing in some coding schemes #261

dieghernan opened this issue Feb 10, 2021 · 18 comments

Comments

@dieghernan
Copy link

dieghernan commented Feb 10, 2021

Again Namibia. I have realised that in four coding schemes (eurostat, genc2c, wb_api2c, ecb) is missing since in all of them the 2-letter code is NA . See sources:

Reprex with the latest CRAN release

library(countrycode)

# Test
countrycode("NAM", "iso3c", "iso2c")
#> [1] "NA"
countrycode("NAM", "iso3c", "eurostat")
#> Warning in countrycode("NAM", "iso3c", "eurostat"): Some values were not matched unambiguously: NAM
#> [1] NA

# Analize
df <- codelist

# Filter Namibia
check <- df[df$country.name.en == "Namibia",]

# Check NA cols
NAscol <- colnames(check)[is.na(check[1, ])]

# Select no cldr fiels
NAscol <- NAscol[-grep("cldr", NAscol)]

NAscol
#> [1] "ecb"      "eu28"     "eurostat" "genc2c"   "wb_api2c"


sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
#> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] countrycode_1.2.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
#>  [5] yaml_2.2.1      Rcpp_1.0.4.6    stringi_1.4.6   rmarkdown_2.6  
#>  [9] highr_0.8       knitr_1.31      stringr_1.4.0   xfun_0.19      
#> [13] digest_0.6.25   rlang_0.4.10    evaluate_0.14

Created on 2021-02-10 by the reprex package (v0.3.0)

Reprex after PR

library(countrycode)

# Test
countrycode("NAM", "iso3c", "iso2c")
#> [1] "NA"
countrycode("NAM", "iso3c", "eurostat")
#> [1] "NA"

# Analize
df <- codelist

# Filter Namibia
check <- df[df$country.name.en == "Namibia",]

# Check NA cols
NAscol <- colnames(check)[is.na(check[1, ])]

# Select no cldr fiels
NAscol <- NAscol[-grep("cldr", NAscol)]

NAscol
#> [1] "eu28"


sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.7 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
#> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] countrycode_1.2.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.27     assertthat_0.2.1  magrittr_2.0.1    reprex_1.0.0     
#>  [5] evaluate_0.14     highr_0.8         stringi_1.5.3     rlang_0.4.10     
#>  [9] cli_2.3.0         rstudioapi_0.13   fs_1.5.0          rmarkdown_2.6    
#> [13] tools_4.0.3       stringr_1.4.0     glue_1.4.2        xfun_0.20        
#> [17] yaml_2.2.1        compiler_4.0.3    htmltools_0.5.1.1 knitr_1.31

Created on 2021-02-10 by the reprex package (v1.0.0)

Now only eu28 is missing, that it is ok (I leave out of the exercise the cldr* fields for clarity).

I have prepared a PR that hopefull fixes this issue,

Regards

@cjyetman
Copy link
Collaborator

Thanks! and, confirmed...

library(countrycode)
is.na(countrycode("NAM", "iso3c", "eurostat"))
#> Warning in countrycode("NAM", "iso3c", "eurostat"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "ecb"))
#> Warning in countrycode("NAM", "iso3c", "ecb"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "eu28"))
#> Warning in countrycode("NAM", "iso3c", "eu28"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "genc2c"))
#> Warning in countrycode("NAM", "iso3c", "genc2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE
is.na(countrycode("NAM", "iso3c", "wb_api2c"))
#> Warning in countrycode("NAM", "iso3c", "wb_api2c"): Some values were not matched unambiguously: NAM
#> [1] TRUE

@cjyetman
Copy link
Collaborator

I would say that this issue is not fully "fixed" until the scrapers for each of these codes has been fixed. Maybe each should be split into its own issue so that they can be addressed separately?

@cjyetman
Copy link
Collaborator

Also of note... now that this tidyverse/rvest/issues/107 has finally been resolved, we can probably make dictionary/get_ecb.R work better.

@cjyetman
Copy link
Collaborator

on the other hand, a similar issue in jsonlite is still unresolved, so still requires workarounds...
jeroen/jsonlite/issues/98
jeroen/jsonlite/issues/314

@vincentarelbundock
Copy link
Owner

I'm not sure the problem is (entirely) related to our scrapers. It seems reader related to me. For instance, the "NA" string in data_genc.csv is correctly double-quoted in the raw CSV:

https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/data_genc.csv

The data.table::fread package does a good job of reading, but not read.csv nor read_csv:

setwd("~/repos/countrycode")
library(readr)
library(data.table)

# Base R
x = read.csv("dictionary/data_genc.csv")
"NA" %in% x$genc2c
#> [1] FALSE

# tidyverse
y = read_csv("dictionary/data_genc.csv")
"NA" %in% y$genc2c
#> [1] FALSE

# data.table
z = fread("dictionary/data_genc.csv")
"NA" %in% z$genc2c
#> [1] TRUE

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Feb 10, 2021

I made a minor commit with:

  1. New tests for Namibia
  2. A fix to the genc scraper with a Namibia-specific assertion

Obviously, if the saved data is not properly double-quoted, we should fix the scraper, but I'd like to get to the bottom of the read_csv issue instead because that feels like the more general solution.

8f0ff1e

@vincentarelbundock
Copy link
Owner

An even more minimal example:

library(readr)
library(data.table)

csv <- 'x,y
"1","NA"
"NA","2"'

str(read_csv(csv))
#> spec_tbl_df [2 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ x: num [1:2] 1 NA
#>  $ y: num [1:2] NA 2
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   x = col_double(),
#>   ..   y = col_double()
#>   .. )

str(fread(csv))
#> Classes 'data.table' and 'data.frame':   2 obs. of  2 variables:
#>  $ x: chr  "1" "NA"
#>  $ y: chr  "NA" "2"
#>  - attr(*, ".internal.selfref")=<externalptr>

@vincentarelbundock
Copy link
Owner

Maybe we just set na.strings to "" in read_csv.

@vincentarelbundock
Copy link
Owner

Sorry for the multiple comments, but I pushed a change to add a bunch of na="" everywhere. This seems to fix everything, and my new tests now pass.

I think we're good to close, but it would be great if either of you could make sure the github version works locally.

@dieghernan
Copy link
Author

dieghernan commented Feb 10, 2021

Hi! So now it seems that some real NA are treated as "NA" (see https://github.com/vincentarelbundock/countrycode/blob/main/dictionary/codelist_without_cldr.csv col eu28).

I was not able to check it locally yet, but as per my limited knowledge of the package, I guess that if checks are passed it’s because those new "NA" are on destination only fields (not really sure about this...)

I wonder if it could be possible to add a extra sanity check on dictionary/build.R that convert "NA" back to NA on destination-only fields, to avoid confusion.

Does it make any sense?

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Feb 10, 2021

That would be more explicit, but countrycode uses a strict one-to-one mapping between codes. So in principle if it works in one direction it will work in the other. Here, for example, we have:

library(countrycode)  

countrycode("NA", "genc2c", "country.name")            
 "Namibia"
 
countrycode(NA, "genc2c", "country.name")              
Error in countrycode(NA, "genc2c", "country.name") : 
  sourcevar must be a character or numeric vector. This error often
             arises when users pass a tibble (e.g., from dplyr) instead of a
             column vector from a data.frame (i.e., my_tbl[, 2] vs. my_df[, 2]
                                              vs. my_tbl[[2]])

The error is not super informative, I'll admit that ;)

@cjyetman
Copy link
Collaborator

The "proper" way to deal with this in readr is to set the na argument, which by default is na = c("", "NA")

readr::read_csv('x,y\n"US","NA"\n"NA","DE"')
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 US    <NA> 
#> 2 <NA>  DE
readr::read_csv('x,y\n"US","NA"\n"NA","DE"', na = "")
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 US    NA   
#> 2 NA    DE

@vincentarelbundock
Copy link
Owner

@cjyetman this is exactly what I did everywhere in my new commit.

@dieghernan
Copy link
Author

One thing maybe I didn’t explain well is that the only scrapper that was not working properly was eurostat, at least for the four coding schemes I mentioned. The other three displayed the value on the csv as "NA".

@cjyetman
Copy link
Collaborator

Also pay attention to where the CSVs are being written (readr::format_csv is equivalent to readr::write_csv except that it returns the string rather than writing it to a file)...

readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,')
#> # A tibble: 3 x 2
#>   x     y    
#>   <lgl> <lgl>
#> 1 NA    NA   
#> 2 NA    NA   
#> 3 NA    NA
# all are converted to <NA>s

readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")
#> # A tibble: 3 x 2
#>   x     y    
#>   <chr> <chr>
#> 1 NA    NA   
#> 2 NA    NA   
#> 3 <NA>  <NA>
# only the last row is converted to <NA>s

data <- readr::read_csv('x,y\n"NA","NA"\nNA,NA\n,', na = "")

readr::format_csv(data)
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"

readr::format_csv(data, na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"

technically, a string should not be quoted unless it's necessary

string <- 'x,y\n"NA","NA"\nNA,NA\n,'

readr::format_csv(readr::read_csv(string))
#> [1] "x,y\nNA,NA\nNA,NA\nNA,NA\n"

readr::format_csv(readr::read_csv(string, na = ""))
#> [1] "x,y\n\"NA\",\"NA\"\n\"NA\",\"NA\"\nNA,NA\n"

readr::format_csv(readr::read_csv(string, na = ""), na = "")
#> [1] "x,y\nNA,NA\nNA,NA\n,\n"

@cjyetman
Copy link
Collaborator

@cjyetman this is exactly what I did everywhere in my new commit.

I think that's the best thing to do... but again, just be careful that if any CSVs are written that they don't write <NA> as NA (without the quotes), which some CSV writers will do by default (for instance readr, e.g. readr::format_csv(data.frame(x = NA)) # [1] "x\nNA\n".

@vincentarelbundock
Copy link
Owner

Yes, I added na=“” to write calls too.

I think this is fixed. Feel free to reopen or comment if it still fails after reinstall from GH

@cjyetman
Copy link
Collaborator

better example of why you need to be careful of both ends of the round trip...

my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")))
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A"  "NA"
# BAD

my_csv <- readr::format_csv(data.frame(x = c("A", NA), y = c(NA, "B")), na = "")
readr::read_csv(my_csv, na = "")[[1]]
#> [1] "A" NA
# GOOD

if na = "" is used everywhere for readr::read_csv, the same has to be used everywhere for readr::write_csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants