-
-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BCP 47 and output IANA language subtags #30
Comments
To further clarify: the rationale for not supporting the output of ISO 639-1 codes (as discussed in issue #10 ) does not apply to this request to return IANA primary language subtag codes, since they (certainly if further specified using BCP 47’s syntax of additional qualifiers) do have the largest coverage of the world’s languages, AFAIK, and hence, of what Franc is capable of recognizing. |
Thanks for the detailed question, great to see! Franc uses ISO 639-3 because that specification can represent every language used in Franc. Yes, BCP 47 can do that too, through the primary language subtag. The reason BCP 47 uses both 2-character codes and 3-character codes, is because it uses ISO 639-1, ISO 639-2, ISO 639-3, and ISO 639-5 (which ever comes first), and they register those in the IANA registry. Now, 639-5 is dead, and to my knowledge all 639-1 and 639-2T codes are also in 639-3, so there won’t be more (possibly supported) codes if Franc would switch to BCP 47 language subtags. The reason to use 639-3 is because it’s a single list of codes, each of three characters, large enough to contain all languages used in franc and small enough to include nothing else. BCP 47 on the other hand, is huge. If you pull down the IANA registry, that’s a lot more data than needed: because multiple specs are involved. What you’re planning to use Franc for? I have quite some knowledge on BCP 47 and ISO 639. Maybe I can help in some other way. P.S. thanks for offering to PR! |
Thanks for your swift response and clarification! I can see why, from a development/design perspective, ISO 639-3 (which has both complete and concise coverage, with uniform tags) is to be preferred over BCP 47 / IANA (which have more-than-necessary coverage, with tags of unpredictable length and form). From a practical viewpoint, however, it is still desirable to have Franc (optionally) return IANA primary language subtags, while the W3C recommends those as the preferred value for the This is not to say that Franc should pull down the entire IANA registry, let alone reckon with BCP 47’s complex syntax. If I’m not mistaken, it would suffice to “just” add (and maintain…) a simple mapping of the ISO 639-3 codes with their corresponding IANA language tags for the 75 ≤ n ≤ 335 languages that Franc supports? (After all, and as far as we are concerned, both sets are just strings.) We are developing a typesetting service (Textus) which converts Markdown files into (html5 compliant) responsive webpages and (ISO 19005 compliant) PDF documents. We would like to use Franc to do automatic language detection, after which proper hyphenation can be applied. (BTW, we’re big fans of your remark.js too!) |
OK, thanks! First off, I do understand why you want BCP-47 tags. That’s a good use case. And, I agree that the solution would be pretty light, as it would not need the complete IANA registry. But, I do think the solution would be better placed in another module, instead of in the core of Franc. var franc = require('franc');
var toBCP47 = require('iso-639-3-to-bcp-47');
var lang = toBCP47(franc('An English language document with words.'));
console.log(lang); Yields: 'en' Would that work? |
That would do great for our use case[^†], thanks! How would you plan to implement (m.m. like to see implemented) such an I’d be happy to do some grunt work to make this happen. Just let me know how you’d like me to be of assistance! [^†]: FYI: Pandoc too (and LaTeX) default to BCP 47 instead of ISO 639 3. An |
Sorry for the late response. I’d say to either create a module specifically for franc (all theoretically possible codes are in the An alternative would be to do this for all 639-3 codes, mapping them to ISO 639-1. More useful for non-franc users, but maybe the IANA registry has different values. Great that you’re willing to investigate. Thanks! |
@rhythmus Ping! |
@rhythmus I’m closing this due to no response. Let me know if I can help you further or if I should re-open this! |
@wooorm is it there no programmatic converter between iso-639-3 and bcp 47? I presume most of the work of this PR would be actually creating this (separate) converter?
As a side note, I feel like mentioning the iso-639-3 code format in the README (maybe with a link here?) would be helpful (I wasn't sure whether it was iso-639-3 or iso-639-2 and had to work it out) -- have drafted a PR, feel free to ditch that and word it yourself. PS thanks for this great library (and CLI is particularly handy!) |
Semantically, ISO 639-3 is a valid BCP 47: just not the suggested shortest canonical version. But yes, a you’re right on the work going into that!
Yes, or with my own https://github.com/wooorm/iso-639-3. I’d suggest a new project though,
Yes, I’d like that! Awesome! 👍
Thank you :) |
For reference: const iso639 = require('iso-639-3')
const shortLang = {}
for (const {iso6391, iso6393} of iso639) shortLang[iso6393] = iso6391
let lang = franc(md)
if (shortLang[lang]) lang = shortLang[lang] |
I created a new slim package to convert between iso-639-3 to iso-639-1, For languages without iso-639-1 that have a "macro language". |
By default, Franc returns ISO-639-3 three-letter language tags, as listed in the Supported Languages table.
We would like Franc to alternatively support outputting IANA language subtags as an option, in compliance with the W3C recommendation for specifying the value of the
lang
attribute in HTML (and thexml:lang
attribute in XML) documents.(Two- and three-letter) IANA language codes are used as the primary language subtags in the language tag syntax as defined by the IETF’s BCP 47, which may be further specified by adding subtags for “extended language”, script, region, dialect variants, etc. (RFC 5646 describes the syntax in full). The addition of such more fine-grained secondary qualifiers are, I guess, out of Franc’s scope, but it would be very helpful nevertheless when Franc would be able to at least return the IANA primary language tags, which suffice, if used stand-alone, to be still in compliance with the spec.
On the Web — as the IETF and W3C agree — IANA language subtags and BCP 47 seem to be the de facto industry standard (at least more so than ISO 639-3). Moreover, the naming convention for TeX hyphenation pattern files (such as used by i.a. OpenOffice) use ISO-8859-2 codes, which overlap better with IANA language subtags, too.
If Franc would output IANA language subtags, then the return values could be used as-is, and without any further post-processing or re-mapping, in, for example CSS rules, specifying hyphenation:
@wooorm :
data/support.json
?The text was updated successfully, but these errors were encountered: