Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR to ICU4X-JSON Converter #177

Closed
sffc opened this issue Jul 17, 2020 · 16 comments · Fixed by #258 or #199
Closed

CLDR to ICU4X-JSON Converter #177

sffc opened this issue Jul 17, 2020 · 16 comments · Fixed by #258 or #199
Assignees
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters T-core Type: Required functionality
Milestone

Comments

@sffc
Copy link
Member

sffc commented Jul 17, 2020

I would like to propose how to generate a cohesive ICU4X data schema, generated from CLDR.

Why not use CLDR JSON at runtime? (1) CLDR JSON is optimized for maintainability and compatibility with the CLDR Survey Tool for data intake from linguists. It is not optimized for runtime use by internationalization libraries. (2) The CLDR data schema can change over time. Having a separate ICU4X data schema means that we can be more stable and use the same code across multiple CLDR versions. (3) Pre-processing the data allows us to optimize it for size and speed.

Here's how I want to transform CLDR into ICU4X.

  1. Create a Cldr37DataProvider with a filesystem path to a CLDR data directory, either JSON or XML. (I haven't decided yet which to consume.)
  2. Cldr37DataProvider should implement DataProvider and return ICU4X data hunks based on data read from CLDR. This is where any transformations will take place.
  3. Cldr37DataProvider should implement a new trait, IterableDataProvider. That trait should allow the client to loop over all supported locales and families.
  4. A new tool, DataExporter, loops over the data from the data provider according to IterableDataProvider and persists it in your choice of data format, such as JSON or Bincode. It serializes it using Serde, since ICU4X data hunks implement Serde.
  5. We add a new top-level directory, /data, that contains the output of DataExporter given Cldr37DataProvider.
  6. When a new CLDR version is released, we should create Cldr38DataProvider, which can use most of the same code as Cldr37DataProvider, with customizations for CLDR 38. I haven't yet figured out the object-oriented / polymorphism for this. I'll start by implementing Cldr37DataProvider and figure out the polymorphism as I go along.

Advantages of this approach:

  • All Rust. Cldr37DataProvider can use the rest of the ICU4X library to perform logic (like serializing UnicodeSets). Furthermore, clients can choose to use Cldr37DataProvider at runtime if they so choose.
  • Serde Serialization. We are able to populate our final structs, and let Serde serialize them into our choice of format. The format might include Rust-specific serialization formats, like the static data format suggested in How should we build static data? #78.
  • No Separate Tooling. Since Cldr37DataProvider is compatible with the ICU4X runtime, we can make it thoroughly tested, and it won't get stale and become a tool that no one knows how to use.
  • Extendable. The DataExporter tool will be able to serialize data from any source, not just CLDR, as long as it implements IterableDataProvider. This gives a very clear path for a non-CLDR data source were to be added.

I previously considered tools that would perform JSON-to-JSON transformations, but quickly realized that such an approach lacks the advantages listed above.

@zbraniecki @nciric

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting labels Jul 17, 2020
@sffc sffc self-assigned this Jul 17, 2020
@sffc sffc added this to the 2020 Q3 milestone Jul 17, 2020
@sffc
Copy link
Member Author

sffc commented Jul 21, 2020

Regarding step 1, @hagbard wrote on email:

You would have to reimplement all of the CLDR utils which my API uses (CLDRFile etc.). These are ... non trivial and ... let's say ... less than ideally documented. For example, the DTD contains metadata in a special format inside XML comments that must be parsed and used for correctness (e.g. correct ordering).

It would be by my estimate 4-6 man-months work to reimplement everything you need to support the locale resolution, aliasing etc. and then you'd need to keep up to date with CLDR changes (since as you know there are plans to change the structure and maybe the aliasing rules). My API will get that for free when CLDRFile is updated, you would be on your own.

An alternative to step 1 is to leverage the Java API for CLDR via Rust JNI.

@zbraniecki
Copy link
Member

I'm deeply concerned about the notion that CLDR data cannot be reliably read by anything but some underdocumented pre-existing tooling.
It seems to me like @hagbard says that in order to use CLDR you need to go through a specific implementation of Java API, and I don't believe we should lock ourselves into that model long term.

So, my position short term is that sure, we can go with some java-cldr-to-icux crate to get us off the ground. But I'd like to strongly suggest that long term we aim to make CLDR JSON/XML data be available for any consumer to use without concerns.

@hagbard
Copy link
Contributor

hagbard commented Jul 21, 2020

I think you might misunderstand where the intended boundary for "consumable" CLDR data lies.

CLDR (the collection of XML files) is largely an internal detail for the CLDR project. There's a published definition of the schema and so on, but in it's "raw" XML form, it's just not that consumable by client code. ICU data is the client consumable output, and the CLDR libraries (and all the collected "history" that goes along with them) are absolutely necessary for processing CLDR data into ICU data, and these libraries will be changed when the rules about CLDR data change.

Personally I don't think you want to even consider consuming CLDR data directly and should interpose something that you control (e.g. a data schema) in-between. Make that does what you need, and write tools on top of the new API (which is the only way to access CLDR data that can be assumed to be supported by CLDR) to emit your data structures.

@zbraniecki
Copy link
Member

CLDR (the collection of XML files) is largely an internal detail for the CLDR project.

Oh, yes, I definitely wasn't aware of that. That's... unfortunate.

Would Unicode be open to try to bring some normalized output (be it JSON or XML or some other canonical data format) be a public output?

Otherwise it seems like CLDR is functionally gated on a Java API for CLDR.

@hagbard
Copy link
Contributor

hagbard commented Jul 21, 2020

That "normalized" output is ICU.

@zbraniecki
Copy link
Member

If I'm not mistaken ICU output is an API (or in fact implementation of an API in C and Java), not a data model. I'm asking if we could have CLDR data output in formats such as XML or JSON.

@hagbard
Copy link
Contributor

hagbard commented Jul 21, 2020

There are "IcuData" and "resource bundle" files you can get from running the various tools. None of it is promised to be long-term stable though (and some of it is pre-processed for easy consumption by the libraries). You'd have to talk to the ICU committee if you want things to be promised in some new published data format.

ICU is an API for data, not a published data spec. CLDR is a published spec aimed at managing locale data, not aimed for efficient client code use. What you're asking for simply doesn't exist at the moment. The CLDR data API is the closest you've got and it would be easy to adapt it to dump subsets of CLDR data (after locale resolution, filtering etc.) into a JSON format.

Not sure about transliterator rules though, they're another sub-DSL inside CLDR data. I never had to understand that bit of it.

@sffc
Copy link
Member Author

sffc commented Jul 21, 2020

CLDR (the collection of XML files) is largely an internal detail for the CLDR project.

UTS 35 does define the XML file structure. It covers locale fallbacks as well as the XML schema itself. What is the extent of the internal structure of the XML files that's not covered by UTS 35?

Make that does what you need, and write tools on top of the new API (which is the only way to access CLDR data that can be assumed to be supported by CLDR) to emit your data structures.

I agree. That's the intent behind step 2 in the OP.

There are "IcuData" and "resource bundle" files you can get from running the various tools. None of it is promised to be long-term stable though (and some of it is pre-processed for easy consumption by the libraries). You'd have to talk to the ICU committee if you want things to be promised in some new published data format.

ICU makes absolutely no claims over the stability of the ICU resource bundle files. ICU sees it as CLDR's responsibility to provide some resemblance of stability.

@hagbard
Copy link
Contributor

hagbard commented Jul 21, 2020

Oh, I'm sure you can write new code that handles CLDR in exactly the same way as the code that's there, it's just that it's (a) a huge amount of work and (b) guaranteed to go stale as things like locale fallback behaviour changes (that's documented but not long term stable). Living on the "outside" of the CLDR data API buys you an enormous reduction in technical debt.

@sffc
Copy link
Member Author

sffc commented Jul 21, 2020

I am aware of two specific issues which are of particular interest when it comes to reading CLDR data:

  1. Locale fallback resolution
  2. Element ordering

I think (1), locale fallback resolution, is not too hard to handle. Supplemental data has a pretty clear DAG of parent fallbacks. I implemented it in not too much Python code in the ICU4C data slicing utility.

For (2), if the cldr-staging production XML document order conforms to DTD order, then we could blindly read the elements in document order. We only need to read, not write or validate.

What other "gotchas" are there which make it intractable to write a tool to read CLDR XML?

@hagbard
Copy link
Contributor

hagbard commented Jul 21, 2020

I'm not here to stop you deciding to do something you've set your mind on.

Yes, it's "doable" by some definition. If you want to know exactly what the issues are, don't ask me, go look at the CLDRFile code and see if you can decypher it. I was originally going to do the locale fallback stuff in the API classes I wrote and was firmly steered towards using CLDRFile, because of nobody wanting to support a 2nd implementation due to the risk of bugs etc... At that point I no longer needed to dig into all the details, so just wrapped CLDRFile with the new API classes.

@hagbard
Copy link
Contributor

hagbard commented Jul 21, 2020

I am aware of two specific issues which are of particular interest when it comes to reading CLDR data:

  1. Locale fallback resolution
  2. Element ordering

I think (1), locale fallback resolution, is not too hard to handle. Supplemental data has a pretty clear DAG of parent fallbacks. I implemented it in not too much Python code in the ICU4C data slicing utility.

For (2), if the cldr-staging production XML document order conforms to DTD order, then we could blindly read the elements in document order. We only need to read, not write or validate.

I'm fairly certain it does not (and it still would need filtering for things like "draft" status).

What other "gotchas" are there which make it intractable to write a tool to read CLDR XML?

At least one other, which is "grouping". If you override the name for Monday, you inherit the rest of the day names from parent locales. There are some icky rules around this though.

@sffc
Copy link
Member Author

sffc commented Jul 21, 2020

I pinged some more folks.

@srl295 says that we should either use the Java tool, or use the CLDR JSON, which is pre-resolved. However, CLDR JSON does not have comprehensive coverage.

@macchiati says that it would be fairly easy to make CLDR publish fully resolved XML files. However, JSON would be good, and CLDR could help by making more guarantees about its completeness. However, CLDR JSON may not be good for RBNF or Translit Transforms. However, since ICU4X has a smaller set of features, CLDR JSON might be sufficient.

@srl295
Copy link
Member

srl295 commented Jul 21, 2020

@sffc

(1) CLDR JSON is optimized for maintainability and compatibility with the CLDR Survey Tool for data intake from linguists.

Actually, the JSON format was designed for consumability by libraries without needing to deal with fallback resolution, etc.

@hagbard

CLDR (the collection of XML files) is largely an internal detail for the CLDR project. There's a published definition of the schema and so on, but in it's "raw" XML form, it's just not that consumable by client code

I wouldn't say that it's an internal detail. The XML format IS the R in CLDR. That doesn't mean it's easy to consume.

As it says in https://www.unicode.org/reports/tr35/#Introduction (I think I added this text):

As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.

So it's designed to be consumed, but it's optimized for maintenance over consumption.

edit to add one area CLDR has not done enough on is support for consumers of CLDR. My own effort is (was?) here, https://github.com/unicode-org/cldr-implementers-guide including an outline from 5 and 8 years ago. But much more help is needed here. I think there's more work to do than just improving tr35.

@sffc
Copy link
Member Author

sffc commented Jul 22, 2020

Actually, the JSON format was designed for consumability by libraries without needing to deal with fallback resolution, etc.

Fixed; thanks for clarifying!

@zbraniecki
Copy link
Member

Ok, so we want to consume CLDR. How can we consume it, in line with what @srl295 says - to convert it into a more compact format?

Btw. I know its a small subset and its anecdotal, but I've been using CLDR-JSON in https://github.com/zbraniecki/pluralrules and https://github.com/zbraniecki/unic-locale for a couple years now.
I know that many ECMA402 polyfills use CLDR-JSON as the source for their ECMA402 API data.

I have never encountered a problem with that.

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jul 23, 2020
@sffc sffc added the A-design Area: Architecture or design label Aug 19, 2020
@sffc sffc modified the milestones: 2020 Q3, ICU4X 0.1 Sep 11, 2020
@sffc sffc linked a pull request Oct 3, 2020 that will close this issue
@sffc sffc closed this as completed Oct 3, 2020
@sffc sffc linked a pull request Oct 3, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters T-core Type: Required functionality
Projects
None yet
4 participants