Make sets for enumerated properties consistent with Script_Extensions #1577
Labels
C-unicode
Component: Props, sets, tries
S-medium
Size: Less than a week (larger bug fix or enhancement)
T-techdebt
Type: ICU4X code health and tech debt
Milestone
For Script_Extensions work, we have exported a
CodePointTrie
with a companion array from ICU that contains the data for both Script and Script_Extensions properties.In order to return a
UnicodeSet
for code points whose Script_Extensions contains a particular Script code, we use the newly-addedCodePointTrie::get_range()
. This approach allows us to realize the space savings that motivated the design of the data from ICU for Script / Script_Extensions.As always, there are space-time tradeoffs / alternatives available here, as discussed previously (1, 2, 3).
For now, it might make sense to update the API implementations for sets-for-enumerated-properties (ex:
pub fn get_for_script<D>(provider: &D, enum_val: Script) -> UnisetResult
) to depend on data in a consistent manner with the set-for-Script_Extensions API. In other words, we would only need one data key for the entire enumerated property and only depend on the data for the code point trie. (Currently, we serialize each prop=val UnicodeSet and have a corresponding data key for each.)In the future, we can think about optimizations. For example, some algorithms that depend on General_Category only are interested in a certain subset of gc values (ex:
Nd
,Nl
, andNo
for numbers), so keeping the option for General_Category for the current style of key=val serialized sets data might be useful for data slicing.The text was updated successfully, but these errors were encountered: