-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename languages to match ISO 639-3 names #1670
Comments
I wrote a script that compares CLDR’s language names against Tatoeba’s and print differences. Note that CLDR has alternate namings on the top of the "normal" name.
Hope this helps. |
@cueyayotl I'll let you check and confirm the renaming suggested. I can imagine we won't have a clear answer for all the languages, so it would be nice to at least start with a list of languages we're confident to rename. We don't have to rename everything at once. For the more problematic ones, we can go step by step. |
@cueyayotl any workflow you'd suggest? |
@sabretou Now you're the one in charge to validate language requests on Tatoeba. Perhaps you might want to have a look on this :) |
Let's go ahead with the first batch of renamings. I have cleared the following for renaming. cmn -> Chinese (Mandarin) -> Mandarin Chinese |
Sure |
@sabretou I'm wondering if renaming Cantonese to Yue Chinese will not confuse our users. Looking at some of the comments of nickyeow, our main contributor in I have similar concerns for Shanghainese and to a certain extent Central Dusun, as we are introducing new words as replacement of the initial words. Actually for Shanghainese, I know there is a comment in our code saying:
Meaning that we used I suggest to try and contact members of Tatoeba who are contributing in |
According to Wikipedia, the "yue" iso code stands for Yue Chinese, which encompasses Cantonese as well as other varieties.
It’s a complex matter. If we want to make it easy to understand, we should use the word "Cantonese", but then we won’t ever have contributors or other Yue dialects such as Taishanese. If we want to follow the ISO standard, we should use "Yue Chinese" and include other dialects under that code, like Taishanese. However these dialects are mutually unintelligible, so it make little sense for contributors to group them under a same language on Tatoeba. Note that since we’ve been using the name Cantonese on Tatoeba, it’s likely that we only have contributors of Cantonese, and not other Yue dialects. |
Quoting Wikipedia about Wu Chinese:
However, looking at the Shanghainese article:
So we should figure out whether sentences currently belonging to our Shanghainese corpus are all Shanghainese dialect of Taihu Wu, or also include other Wu languages. It is worth noting that this year, there has been a proposal about splitting Wu Chinese, which is still under review by the SIL. If that proposal is accepted, it would result in the creation of Taihu Wu Chinese (among others). That would certainly help sorting out our wuu corpus and solve the naming issue. |
As for Central Dusun, that name has been changed by the SIL into Kadazan Dusun in 2016 as part of a merge. According to the proposal, the new name matches better how the speakers call their own language and it encompasses more dialects, so it’s probably safe to rename. |
The proposal has been rejected. |
Should I work on Phase 2? |
Here is an updated list of Tatoeba language names that differ from their standard ISO 639-3 names.
|
As we are introducing the new language selector, perhaps we should have languages match their ISO 639-3 names. This is because some languages used parentheses or alternate names for easier discovery earlier.
Here are my suggestions:
Language Code -> Current Name -> Proposed Name
cmn -> Chinese (Mandarin) -> Mandarin Chinese
nob -> Norwegian (Bokmål) -> Norwegian Bokmål
nno -> Norwegian (Nynorsk) -> Norwegian Nynorsk
nst -> Naga (Tangshang) -> Tase Naga
pan -> Punjabi (Eastern) -> Punjabi (Punjabi is by far the more popular spelling variant, so I recommend going with that. Alternately, we could add 'Panjabi' in parentheses).
zsm -> Malay -> Standard Malay
mww -> Hmong Daw (White) -> Hmong Daw
afb -> Arabic (Gulf) -> Gulf Arabic
pnb -> Punjabi (Western) -> Western Punjabi (I propose 'Punjabi' over 'Panjabi' for the same reason as above)
aln -> Albanian (Gheg) -> Gheg Albanian
jdt -> Juhuri (Judeo-Tat) -> Judeo-Tat
cjy -> Chinese (Jin) -> Jinyu Chinese
hnj -> Hmong Njua (Green) -> Hmong Njua
bcl -> Bikol (Central) -> Central Bikol
pfl -> Palatine German -> Pfaelzisch
orv -> Old East Slavic -> Old Russian
prg -> Old Prussian -> Prussian
cmo -> Mnong, Central -> Central Mnong
acm -> Iraqi Arabic -> Mesopotamian Arabic
jam -> Jamaican Patois -> Jamaican Creole English
mhr -> Meadow Mari -> Eastern Mari
mrj -> Hill Mari -> Western Mari
dtp -> Central Dusun -> Kadazan Dusun
wuu -> Shanghainese -> Wu Chinese
yue -> Cantonese -> Yue Chinese
pes -> Persian -> Iranian Persian
ell -> Greek -> Modern Greek
pms -> Piedmontese -> Piemontese
tpw -> Old Tupi -> Tupí
I propose zlm -> Malay (Vernacular) stay as it is. In ISO 639-3, it is listed as "Malay (individual language)", which could be confusing.
Similarly, I think kek -> Kekchi (Q'eqchi') should remain as-is for visibility.
ori -> Odia (Oriya) is another special case that I think should stay.
The text was updated successfully, but these errors were encountered: