-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
language tag handling needs more attention #11
Comments
I'll add some additional color here as a personal comment here. Note that there is a tension between source and target language tags. Most translation systems can consume a variety of different orthographic variations of a language to produce a given target language. For example, a language arc such as Note that script and macrolanguage differences remain important here, even when the language tags don't always specify the script. For example, a |
Thanks very much for your comments here. I have learned many new things. Let me try to get more concrete and propose a solution, to see if I've understood correctly. First, we have to recognize that the ground truth of what is supported is a per-user-agent set of machine learning translation models. These models could have more specific, or less specific capabilities. It depends on how they were trained. Some semi-realistic examples:
(Apologies for my lack of knowledge of Chinese... I hope it doesn't sidetrack the examples too badly.) Given this sort of ground truth, we need an algorithm that takes in arbitrary language tags Here is one guess at such an algorithm:
I think this algorithm works pretty well, although I'm still fuzzy on the best way to set up the list of supported language pairs. For example, if we set up |
Lots to unpack here. I think it would help to get more involvement from the translation/localization community, who deal with these issues on a daily basis. General notes to help the conversation along:
There are two problems that you have here: selection and description. Selection refers to the (sometimes human-involved) process of choosing which language arcs can be applied to a given input text and then employing the best one for the task. Description involved making clear the internal settings/limitations of a given language arc. For example, an arc such as If the there is a reverse language arc available, the output will not just be The You might support this by using lists of tags on either side of the arc description or using language ranges. User's might prefer labels like |
There are two major motivations for this change: * Splitting translation and language detection into separate APIs, to reflect what we've learned from prototyping. * Aligning better with other built-in API proposals, including future ones, by using shared patterns. This notably removes translation from an unknown source language, closing #1. It also adds AbortSignals and destroy() methods. This also removes the tentative proposal for language tag handling, instead pointing to discussions in #11.
Thanks again for your help. I appreciate your general notes and corrections. I used the expanded 7-segment format for the extended language tags because I otherwise found it confusing, but I appreciate that people who have more experience in the field don't need that. I agree with your framing of selection vs. description. In terms of the API I think that comes down to:
Anyway, I think I got too ambitious trying to give hypothetical examples and a full algorithm. Let me try to be more concrete. I'll focus just on description for now to scope it down further. Let's say I was going to ship a translation API representing the capabilities of Google Translate's Japanese to English mode. Here are some representative inputs and outputs:
What should the answers be to the following, in your opinion? canTranslate("ja", "en"); // Presumably this should work
canTranslate("ja", "en-US"); // "color" (like 色)
canTranslate("ja", "en-GB"); // "colour" (like いろ); "mobile phone" instead of "cell phone"
canTranslate("ja", "en-SG"); // "2 dollar" instead of "2 dollars"
canTranslate("ja", "en-150"); // "mobile" instead of "cell phone"
canTranslate("ja", "en-GB-oed"); // I think this would require 結びつき => "connexion"
canTranslate("ja", "en-Latn"); // Should this work?
canTranslate("ja", "en-Brai"); // Presumably should not work
canTranslate("ja", "en-Dsrt"); // Presumably should not work
canTranslate("ja", "en-x-pirate"); // Presumably should not work, unless we blanket grant x-?
canTranslate("ja", "en-x-lolcat"); // Presumably should not work, unless we blanket grant x-?
// Various unknown subtags cases, how should these work?
canTranslate("ja", "en-asdf");
canTranslate("ja", "en-x-asdf");
canTranslate("ja", "en-US-asdf");
canTranslate("ja", "en-US-x-asdf");
canTranslate("ja", "en-asdf-asdf");
canTranslate("ja-JP", "en"); // Presumably this should work
canTranslate("ja-JP-Jpan", "en"); // Should this work, or is it bad because of the Suppress-Script?
canTranslate("ja-JP-Hrkt", "en"); // Should this work? It seems to.
canTranslate("ja-Kana", "en"); // Should this work? It seems to.
canTranslate("ja-Latn", "en"); // Should this work? It did for "genkidesuka"/"irohaaoudesu" but not for "iro".
canTranslate("ja-Braille", "en"); // Presumably shouldn't work ("⠛⠑⠝⠊⠅⠊⠙⠑⠎⠥⠅⠁" example)
canTranslate("ja-Bopo", "en"); // Presumably shouldn't work ("ㄍㄣㄎㄧ ㄉㄜㄙㄨ ㄎㄚ?" example)
canTranslate("ja-Dsrt", "en"); // Presumably shouldn't work ("𐐘𐐇𐐤𐐆𐐗𐐆𐐔𐐇𐐝𐐊𐐗𐐀" example)
// Using the rarely-used jpx "collection" tag; should it work?
canTranslate("jpx-ja", "en");
canTranslate("jpx-Jpan", "en");
// Unusual/unknown subtag cases; how should they work?
canTranslate("ja-KR", "en");
canTranslate("ja-US", "en");
canTranslate("ja-asdf", "en");
canTranslate("ja-Jpan-JP-x-osaka", "en");
canTranslate("ja-JP-u-ca-japanese", "en");
canTranslate("ja-x-kansai", "en");
canTranslate("ja-JP-u-sd-jpjp", "en"); If you think there's a clear algorithm that resolves a lot of these cases, feel free to suggest that instead of answering each one. |
There are two major motivations for this change: * Splitting translation and language detection into separate APIs, to reflect what we've learned from prototyping. * Aligning better with other built-in API proposals, including future ones, by using shared patterns. This notably removes translation from an unknown source language, closing #1. It also adds AbortSignals and destroy() methods. This also removes the tentative proposal for language tag handling, instead pointing to discussions in #11.
For the source languages in your examples, all of the The longer tags present some questions. (note that If the user specifies a regional variation on the source side, they might want the tag to fall back when matching (that is, use BCP47 Lookup), because the source language is not visible in the output and because translation engines are usually less sensitive to linguistic variations. If the text is written in a non-default script, the translation engine might prefer if the text were transliterated or might (as in the Deseret example) not know what to do with it and pass it through. In either case, there is no harm is "losing" distinction found on tags like Suppress-Script tags can interfere with matching when matching is done by strict string comparison of the tags. That is, the range Private use sequences (starting with On the target side, there is some question in my mind about what On the other hand, as your examples point out, the additional subtags represent variations that the user might want. US vs. UK spelling variation or, UK vs. OED spelling variation (one variation This suggests that script or region subtags (and maybe variants) in the user's specified range should not be ignored. Even if the From a standards perspective, could we say that it is implementation defined how the matching takes place (implementation here meaning "of the translation engine", not the API)? Google Translate can decide whether it can "readily" handle a given tag as output or not and the answer might vary depending on the specific language arc. |
There will definitely have to be some implementation-definedness in the standard, simply because we can't pin down what the capabilities of each implementation will be. But I'd like to give some guidance, probably in the spec. Because just as a Chromium engineer, I need to know what we should make our API return! Our default course of action was the one originally mentioned in the explainer (~exact matching), which you said doesn't make sense. So summarizing your answers above, I'm getting the following:
I'm unsure whether BCP 47 Lookup or Filtering plays into any of the above suggestions. I'd appreciate any help in fleshing this out. In particular it might be helpful to stay focused on just the specific example I gave. Which of the (source, target) pairs that I gave should work, given the demonstrated capabilities of the Google Translate Japanese-to-English model? Which should not? Eventually we'll try to extract those out into wider guidance for implementers. But at the risk of side-tracking us, let me just illustate how other non-web APIs seem to work, which is similar to the model you said doesn't make sense. They have static lists of "supported languages", which are specific strings. E.g.: Azure, Google Cloud, DeepL. Sometimes (as is the case with DeepL) they have different source and target lists. In most cases the languages are simple two- or three-letter codes, but there are often some subtags used: e.g.
I get that on the web we're holding ourselves to higher standards for API design. But dang, this is just so simple. And a lot of developers are using such APIs today. If we want something more complicated, I and all the other implementers need some help figuring it out... |
Language tag handling
The proposed mechanisms don't make sense. They require absolute tag matches in order to work, when the normal way for translation and locale-based mechanisms to work is either BCP47 Lookup or BCP47 Filtering.
Generally, for this type of API, Lookup is the preferred mechanism, usually with some additional tailoring (the insertion of missing subtags:
Intl
already provides this).For example, if a system supports
ja
anden
, thencanTranslate()
should match requests foren-US
,en-GB
,ja-JP
orja-u-ca-japanese
, but not requests forena
,fr
, orzh-Hans
.Failing to provide this sort of support would mean that implementations would have to provide dozens or hundreds of tags that they "support" and/or would require the caller to massage the tag (instead of passing it blindly). This is especially the case in the "download" case, in which a site might generate dozens of spurious downloads due to mutations of the language tag.
Note: a deeper discussion, possibly in a joint teleconference, might be useful here.
The text was updated successfully, but these errors were encountered: