Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort WORDLIST in a locale-independent way #48

Closed
Bisaloo opened this issue Mar 25, 2020 · 4 comments
Closed

Sort WORDLIST in a locale-independent way #48

Bisaloo opened this issue Mar 25, 2020 · 4 comments

Comments

@Bisaloo
Copy link
Member

Bisaloo commented Mar 25, 2020

Currently, the word order in WORDLIST is locale-dependent, which can create large spurious diffs when multiple people contribute to the package but use different locales.

I see two solutions:

  • use method = "radix" in sort(). It is to my knowledge the only locale independent sorting method
  • temporarily set a collating locale:
orig_locale <- Sys.getlocale("LC_COLLATE")
on.exit(Sys.setlocale("LC_COLLATE", orig_locale))
Sys.setlocale("LC_COLLATE", "C")

The nice thing about the second option is that you can set the locale to the one specified in DESCRIPTION.

Please let me know if you'd like me to submit a PR for this.

@jeroen
Copy link
Member

jeroen commented Apr 3, 2020

Hi thanks for the suggestion. I'm not super familiar with locales can you give an example how the sorting is different?

I rather not changes locales because that has funky side effects in some systes (though C is usually safe).

@Bisaloo
Copy link
Member Author

Bisaloo commented Apr 3, 2020

Here is a reprex:

test <- c(
  letters,
  LETTERS,
  "Hugo's words",
  "change"
)

library(withr)

with_locale(c("LC_COLLATE" = "fr_FR.UTF-8"), sort(test))
#>  [1] "a"            "A"            "b"            "B"            "c"           
#>  [6] "C"            "change"       "d"            "D"            "e"           
#> [11] "E"            "f"            "F"            "g"            "G"           
#> [16] "h"            "H"            "Hugo's words" "i"            "I"           
#> [21] "j"            "J"            "k"            "K"            "l"           
#> [26] "L"            "m"            "M"            "n"            "N"           
#> [31] "o"            "O"            "p"            "P"            "q"           
#> [36] "Q"            "r"            "R"            "s"            "S"           
#> [41] "t"            "T"            "u"            "U"            "v"           
#> [46] "V"            "w"            "W"            "x"            "X"           
#> [51] "y"            "Y"            "z"            "Z"

with_locale(c("LC_COLLATE" = "C"), sort(test))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

with_locale(c("LC_COLLATE" = "sk_SK.UTF-8"), sort(test))
#>  [1] "a"            "A"            "b"            "B"            "c"           
#>  [6] "C"            "d"            "D"            "e"            "E"           
#> [11] "f"            "F"            "g"            "G"            "h"           
#> [16] "H"            "Hugo's words" "change"       "i"            "I"           
#> [21] "j"            "J"            "k"            "K"            "l"           
#> [26] "L"            "m"            "M"            "n"            "N"           
#> [31] "o"            "O"            "p"            "P"            "q"           
#> [36] "Q"            "r"            "R"            "s"            "S"           
#> [41] "t"            "T"            "u"            "U"            "v"           
#> [46] "V"            "w"            "W"            "x"            "X"           
#> [51] "y"            "Y"            "z"            "Z"

Created on 2020-04-03 by the reprex package (v0.3.0)

There are also plenty of other examples where it can go wrong because of diacritics in names but I can't find a good reprex for this right now.

For a simple real-life example, see ropensci/lightr@515d193#diff-89da0e7dae7c72fd9541f184b5112343L13-L15 where OceanOptics and O'Hanlon swapped positions. This is for a simple package but for larger ones, with long vignettes, it is more annoying.

EDIT: from ?Comparison:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode code-point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.

@Bisaloo
Copy link
Member Author

Bisaloo commented Apr 3, 2020

And if you think changing locales is a bad idea, method = "radix" is not as bad as I thought it would be. It pretty good even. I was expecting a somewhat random order.

test <- c(
  letters,
  LETTERS,
  "Hugo's words",
  "change"
)

library(withr)

with_locale(c("LC_COLLATE" = "fr_FR.UTF-8"), sort(test, method = "radix"))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

with_locale(c("LC_COLLATE" = "C"), sort(test, method = "radix"))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

with_locale(c("LC_COLLATE" = "sk_SK.UTF-8"), sort(test, method = "radix"))
#>  [1] "A"            "B"            "C"            "D"            "E"           
#>  [6] "F"            "G"            "H"            "Hugo's words" "I"           
#> [11] "J"            "K"            "L"            "M"            "N"           
#> [16] "O"            "P"            "Q"            "R"            "S"           
#> [21] "T"            "U"            "V"            "W"            "X"           
#> [26] "Y"            "Z"            "a"            "b"            "c"           
#> [31] "change"       "d"            "e"            "f"            "g"           
#> [36] "h"            "i"            "j"            "k"            "l"           
#> [41] "m"            "n"            "o"            "p"            "q"           
#> [46] "r"            "s"            "t"            "u"            "v"           
#> [51] "w"            "x"            "y"            "z"

Created on 2020-04-03 by the reprex package (v0.3.0)

@jeroen
Copy link
Member

jeroen commented Apr 3, 2020

Ok sounds good van you send a PR?

@jeroen jeroen closed this as completed in 593e477 Apr 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants