-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR to ICU4X-JSON Converter #177
Comments
Regarding step 1, @hagbard wrote on email:
An alternative to step 1 is to leverage the Java API for CLDR via Rust JNI. |
I'm deeply concerned about the notion that CLDR data cannot be reliably read by anything but some underdocumented pre-existing tooling. So, my position short term is that sure, we can go with some java-cldr-to-icux crate to get us off the ground. But I'd like to strongly suggest that long term we aim to make CLDR JSON/XML data be available for any consumer to use without concerns. |
I think you might misunderstand where the intended boundary for "consumable" CLDR data lies. CLDR (the collection of XML files) is largely an internal detail for the CLDR project. There's a published definition of the schema and so on, but in it's "raw" XML form, it's just not that consumable by client code. ICU data is the client consumable output, and the CLDR libraries (and all the collected "history" that goes along with them) are absolutely necessary for processing CLDR data into ICU data, and these libraries will be changed when the rules about CLDR data change. Personally I don't think you want to even consider consuming CLDR data directly and should interpose something that you control (e.g. a data schema) in-between. Make that does what you need, and write tools on top of the new API (which is the only way to access CLDR data that can be assumed to be supported by CLDR) to emit your data structures. |
Oh, yes, I definitely wasn't aware of that. That's... unfortunate. Would Unicode be open to try to bring some normalized output (be it JSON or XML or some other canonical data format) be a public output? Otherwise it seems like CLDR is functionally gated on a Java API for CLDR. |
That "normalized" output is ICU. |
If I'm not mistaken ICU output is an API (or in fact implementation of an API in C and Java), not a data model. I'm asking if we could have CLDR data output in formats such as XML or JSON. |
There are "IcuData" and "resource bundle" files you can get from running the various tools. None of it is promised to be long-term stable though (and some of it is pre-processed for easy consumption by the libraries). You'd have to talk to the ICU committee if you want things to be promised in some new published data format. ICU is an API for data, not a published data spec. CLDR is a published spec aimed at managing locale data, not aimed for efficient client code use. What you're asking for simply doesn't exist at the moment. The CLDR data API is the closest you've got and it would be easy to adapt it to dump subsets of CLDR data (after locale resolution, filtering etc.) into a JSON format. Not sure about transliterator rules though, they're another sub-DSL inside CLDR data. I never had to understand that bit of it. |
UTS 35 does define the XML file structure. It covers locale fallbacks as well as the XML schema itself. What is the extent of the internal structure of the XML files that's not covered by UTS 35?
I agree. That's the intent behind step 2 in the OP.
ICU makes absolutely no claims over the stability of the ICU resource bundle files. ICU sees it as CLDR's responsibility to provide some resemblance of stability. |
Oh, I'm sure you can write new code that handles CLDR in exactly the same way as the code that's there, it's just that it's (a) a huge amount of work and (b) guaranteed to go stale as things like locale fallback behaviour changes (that's documented but not long term stable). Living on the "outside" of the CLDR data API buys you an enormous reduction in technical debt. |
I am aware of two specific issues which are of particular interest when it comes to reading CLDR data:
I think (1), locale fallback resolution, is not too hard to handle. Supplemental data has a pretty clear DAG of parent fallbacks. I implemented it in not too much Python code in the ICU4C data slicing utility. For (2), if the cldr-staging production XML document order conforms to DTD order, then we could blindly read the elements in document order. We only need to read, not write or validate. What other "gotchas" are there which make it intractable to write a tool to read CLDR XML? |
I'm not here to stop you deciding to do something you've set your mind on. Yes, it's "doable" by some definition. If you want to know exactly what the issues are, don't ask me, go look at the CLDRFile code and see if you can decypher it. I was originally going to do the locale fallback stuff in the API classes I wrote and was firmly steered towards using CLDRFile, because of nobody wanting to support a 2nd implementation due to the risk of bugs etc... At that point I no longer needed to dig into all the details, so just wrapped CLDRFile with the new API classes. |
I'm fairly certain it does not (and it still would need filtering for things like "draft" status).
At least one other, which is "grouping". If you override the name for Monday, you inherit the rest of the day names from parent locales. There are some icky rules around this though. |
I pinged some more folks. @srl295 says that we should either use the Java tool, or use the CLDR JSON, which is pre-resolved. However, CLDR JSON does not have comprehensive coverage. @macchiati says that it would be fairly easy to make CLDR publish fully resolved XML files. However, JSON would be good, and CLDR could help by making more guarantees about its completeness. However, CLDR JSON may not be good for RBNF or Translit Transforms. However, since ICU4X has a smaller set of features, CLDR JSON might be sufficient. |
Actually, the JSON format was designed for consumability by libraries without needing to deal with fallback resolution, etc.
I wouldn't say that it's an internal detail. The XML format IS the R in CLDR. That doesn't mean it's easy to consume. As it says in https://www.unicode.org/reports/tr35/#Introduction (I think I added this text):
So it's designed to be consumed, but it's optimized for maintenance over consumption. edit to add one area CLDR has not done enough on is support for consumers of CLDR. My own effort is (was?) here, https://github.com/unicode-org/cldr-implementers-guide including an outline from 5 and 8 years ago. But much more help is needed here. I think there's more work to do than just improving tr35. |
Fixed; thanks for clarifying! |
Ok, so we want to consume CLDR. How can we consume it, in line with what @srl295 says - to convert it into a more compact format? Btw. I know its a small subset and its anecdotal, but I've been using CLDR-JSON in https://github.com/zbraniecki/pluralrules and https://github.com/zbraniecki/unic-locale for a couple years now. I have never encountered a problem with that. |
I would like to propose how to generate a cohesive ICU4X data schema, generated from CLDR.
Why not use CLDR JSON at runtime? (1)
CLDR JSON is optimized for maintainability and compatibility with the CLDR Survey Tool for data intake from linguists.It is not optimized for runtime use by internationalization libraries. (2) The CLDR data schema can change over time. Having a separate ICU4X data schema means that we can be more stable and use the same code across multiple CLDR versions. (3) Pre-processing the data allows us to optimize it for size and speed.Here's how I want to transform CLDR into ICU4X.
Cldr37DataProvider
with a filesystem path to a CLDR data directory, either JSON or XML. (I haven't decided yet which to consume.)Cldr37DataProvider
should implementDataProvider
and return ICU4X data hunks based on data read from CLDR. This is where any transformations will take place.Cldr37DataProvider
should implement a new trait,IterableDataProvider
. That trait should allow the client to loop over all supported locales and families.DataExporter
, loops over the data from the data provider according toIterableDataProvider
and persists it in your choice of data format, such as JSON or Bincode. It serializes it using Serde, since ICU4X data hunks implement Serde./data
, that contains the output ofDataExporter
givenCldr37DataProvider
.Cldr38DataProvider
, which can use most of the same code asCldr37DataProvider
, with customizations for CLDR 38. I haven't yet figured out the object-oriented / polymorphism for this. I'll start by implementingCldr37DataProvider
and figure out the polymorphism as I go along.Advantages of this approach:
Cldr37DataProvider
can use the rest of the ICU4X library to perform logic (like serializing UnicodeSets). Furthermore, clients can choose to useCldr37DataProvider
at runtime if they so choose.Cldr37DataProvider
is compatible with the ICU4X runtime, we can make it thoroughly tested, and it won't get stale and become a tool that no one knows how to use.DataExporter
tool will be able to serialize data from any source, not just CLDR, as long as it implementsIterableDataProvider
. This gives a very clear path for a non-CLDR data source were to be added.I previously considered tools that would perform JSON-to-JSON transformations, but quickly realized that such an approach lacks the advantages listed above.
@zbraniecki @nciric
The text was updated successfully, but these errors were encountered: