Make sets for enumerated properties consistent with Script_Extensions #1577

echeran · 2022-02-03T01:56:49Z

For Script_Extensions work, we have exported a CodePointTrie with a companion array from ICU that contains the data for both Script and Script_Extensions properties.

In order to return a UnicodeSet for code points whose Script_Extensions contains a particular Script code, we use the newly-added CodePointTrie::get_range(). This approach allows us to realize the space savings that motivated the design of the data from ICU for Script / Script_Extensions.

As always, there are space-time tradeoffs / alternatives available here, as discussed previously (1, 2, 3).

For now, it might make sense to update the API implementations for sets-for-enumerated-properties (ex: pub fn get_for_script<D>(provider: &D, enum_val: Script) -> UnisetResult) to depend on data in a consistent manner with the set-for-Script_Extensions API. In other words, we would only need one data key for the entire enumerated property and only depend on the data for the code point trie. (Currently, we serialize each prop=val UnicodeSet and have a corresponding data key for each.)

In the future, we can think about optimizations. For example, some algorithms that depend on General_Category only are interested in a certain subset of gc values (ex: Nd, Nl, and No for numbers), so keeping the option for General_Category for the current style of key=val serialized sets data might be useful for data slicing.

The text was updated successfully, but these errors were encountered:

sffc · 2022-02-03T03:42:34Z

For binary enumerated properties, there are (and have always been) two general mechanisms:

Pre-compiled UnicodeSet stored directly in the data provider
- Best if you need a set for a specific value of an enumerated property known at compile time, such as "all letters" or "all code points in the Cyrillic script"
Code point tries with the UnicodeSet computed at runtime
- Best if you need sets for most or all values of an enumerated property

Up until Script_Extensions, we did mechanism 1 for all of the binary enumerated properties we support (which is just General_Category and Script right now). Elango's PR for Script_Extensions implements mechanism 2 without support for mechanism 1.

Moving forward, I see a few options:

Support all with mechanism 1 and support a subset with mechanism 2
Support all with mechanism 2 and support a subset with mechanism 1
Support all with both mechanisms 1 and 2
Support some with mechanism 1 and others with mechanism 2

sffc · 2022-02-04T19:36:10Z

Decision 2022-02-04:

Add the set and range iteration functionality for all other enumerated properties
Remove the standalone pre-computed sets for script and general category, as well as the functions, unblocking Migrate icu_properties to ResourceProvider #1560

echeran added C-unicode Component: Props, sets, tries discuss Discuss at a future ICU4X-SC meeting S-medium Size: Less than a week (larger bug fix or enhancement) T-techdebt Type: ICU4X code health and tech debt labels Feb 3, 2022

sffc assigned echeran Feb 4, 2022

sffc added this to the 2022 Q1 0.6 Sprint B milestone Feb 4, 2022

echeran mentioned this issue Feb 15, 2022

Return sets for enumerated property value data using CPT data #1608

Merged

echeran modified the milestones: 2022 Q1 0.6 Sprint B, 2022 Q1 0.6 Sprint C Feb 17, 2022

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 3, 2022

sffc mentioned this issue Mar 3, 2022

Migrate icu_properties to ResourceProvider #1560

Closed

echeran closed this as completed in #1608 Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sets for enumerated properties consistent with Script_Extensions #1577

Make sets for enumerated properties consistent with Script_Extensions #1577

echeran commented Feb 3, 2022

sffc commented Feb 3, 2022 •

edited

Loading

sffc commented Feb 4, 2022 •

edited

Loading

Make sets for enumerated properties consistent with Script_Extensions #1577

Make sets for enumerated properties consistent with Script_Extensions #1577

Comments

echeran commented Feb 3, 2022

sffc commented Feb 3, 2022 • edited Loading

sffc commented Feb 4, 2022 • edited Loading

sffc commented Feb 3, 2022 •

edited

Loading

sffc commented Feb 4, 2022 •

edited

Loading