PluralRuleStringsV1 should be parsed at data load time #615

Manishearth · 2021-04-06T02:50:46Z

Currently they need to be further parsed into a PluralRuleList. It would be nice if that were handled by the data provider itself. We would have to potentially provide utility functions for converting between a PluralRulesStrings and a final parsed PluralRuleList, or perhaps make it so that data providers that can provide PluralRulesStrings can automatically provide a PluralRuleList.

It would also be good if PluralRuleList and RulesSelector could use Cows and borrow from the data provider such that

cc @sffc

The lower level API in #575 will also need to be updated to handle this.

The text was updated successfully, but these errors were encountered:

sffc · 2021-04-13T15:07:35Z

I'd suggest taking a similar approach as Pattern and Skeleton, where the JSON representation is human-readable, and the Bincode representation is pre-parsed. When JSON gets read in, the strings should be parsed, and when Bincode gets read in, you just need to point to the data.

Manishearth · 2021-04-19T23:50:14Z

We'll work on #663 first so that we know what FFI footguns are in the offing with this. It's possible to work on this now (and folks should feel free to pick it up!), but FFI may break.

sffc · 2021-09-17T03:30:28Z

Here's a model for how to represent plural rules as a zero-copy data structure: VarZeroVec<AndOrRelation>, which is a VarZeroVec<ZeroVec> with some metadata at each level.

This covers all cases, is compatible with UTS 35, and does not require infinite nesting.

Rust Structures

Here is AndOrRelation:

enum AndOr { And, Or };
enum Polarity { Positive, Negative };
struct AndOrRelation {
  and_or: AndOr,  # first entry is Or
  operand: PluralOperand,  # i, u, v, f, ...
  modulo: u32,
  polarity: Polarity,
  range_list: ZeroVec<RangeOrValue>,
};

And RangeOrValue:

enum RangeOrValue {
  Range(u32, u32),
  Value(u32),
}

Algorithm

I claim that the algorithm is just as fast as a more highly nested AST structure. Pseudo-code:

Let result = False.
For each relation in relation list:
  If relation.and_or == "and" and result == False:
    Next iteration.
  If relation.and_or == "or" and result == True:
    Return True.
  result = relation.compute(n).
Return result.

Byte Representation

Relation can be represented in 5 bytes plus the ZeroVec:

and_or: 1 bit
polarity: 1 bit
operand: say 6 bits (2^6 = 64 choices, far more than we currently need to support)
modulo*: 32 bits

* The modulo could likely be compacted further, given that virtually all modulos are on powers of 10.

RangeOrValue can be represented in 8 bytes:

Start of range: u32
End of range or scalar: u32
Discriminant: Whether the start of range and end of range are equal

Example

Rule string: "n % 10 = 3..4,9 and n % 100 != 10..19,70..79,90..99 or n = 0"

This rule string contains 3 operations. A JSON-like expansion into the above schema would be:

[
  {
    and_or: "or",  // first entry is "or"
    operand: "n",
    modulo: 10,
    polarity: "positive",
    range_list: [
      [3, 4],
      9
    ]
  },
  {
    and_or: "and",
    operand: "n",
    modulo: 100,
    polarity: "negative",
    range_list: [
      [10, 19],
      [70, 79],
      [90, 99]
    ]
  },
  {
    and_or: "or",
    operand: "n",
    modulo: 1,
    polarity: "positive",
    range_list: [
      0
    ]
  }
]

The bytes:

First Relation: 25 bytes
- 5 bytes for the Relation header
- 4 bytes for the ZeroVec length (I would really like to be able to configure this)
- 8*2 bytes for the content of the ZeroVec
Second Relation: 33 bytes
- 5 bytes for the Relation header
- 4 bytes for the ZeroVec length
- 8*3 bytes for the content of the ZeroVec
Third Relation: 17 bytes
- 5 bytes for the Relation header
- 4 bytes for the ZeroVec length
- 8 bytes for the content of the ZeroVec

Total: 75 bytes.

For comparison, the string is 60 bytes. So we are a bit bigger, but not too much bigger, and there are opportunities to optimize the byte length:

The modulo can probably be stored in a single byte, since almost all instances are 1 to a power of 10
If the ZeroVec has a length of 1, we can probably avoid storing the length by sticking another bit discriminant somewhere in the Relation header
I highly suspect that the we only need u16 instead of u32 for the RangeList pair

With these optimizations, the byte length would become:

First Relation: 14 bytes
- 2 bytes for the Relation header
- 4 bytes for the ZeroVec length (I would really like to be able to configure this)
- 4*2 bytes for the content of the ZeroVec
Second Relation: 18 bytes
- 2 bytes for the Relation header
- 4 bytes for the ZeroVec length
- 4*3 bytes for the content of the ZeroVec
Third Relation: 6 bytes
- 2 bytes for the Relation header
- 4 bytes for the singleton entry of the ZeroVec

Total: 38 bytes! Smaller than even the string representation.

Note: The above size does not include the VarZeroVec's own header, which will likely incur another 16ish bytes.

Manishearth · 2021-09-17T07:46:21Z

This sounds like it could benefit from the custom derive, though a couple issues are that it's harder to achieve bitpacking with a custom derive, and also I don't think a custom derive for AsULE can handle enums.

So might be better to write some custom packed ULE types.

Manishearth · 2021-10-04T21:18:50Z

@zbraniecki #1078 is resolved, hopefully that unblocks this

zbraniecki · 2021-11-15T18:21:17Z

Fixed in #1240

Manishearth mentioned this issue Apr 6, 2021

Add lower-level data API to PluralRules #575

Merged

sffc added A-ffi Area: FFI, WebAssembly, Transpilation A-performance Area: Performance (CPU, Memory) C-pluralrules Component: Plural rules S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality help wanted Issue needs an assignee labels Apr 13, 2021

sffc added this to the ICU4X 0.4 milestone Apr 19, 2021

sffc mentioned this issue Sep 16, 2021

Change PluralRules to not parse strings at runtime #854

Closed

sffc assigned zbraniecki Sep 16, 2021

sffc removed the help wanted Issue needs an assignee label Sep 16, 2021

sffc mentioned this issue Sep 17, 2021

Migrate remaining data structs to fully borrowed #856

Closed

This was referenced Sep 20, 2021

Add borrowed-only versions of data structs in ZeroVec #676

Closed

Restructure AsVarULE or get rid of it #1078

Closed

[tracking] ZeroVec improvements #1082

Closed

zbraniecki modified the milestones: ICU4X 0.4, ICU4X 0.5 Sep 30, 2021

zbraniecki mentioned this issue Oct 28, 2021

Switch PluralRules Data to ZeroVec #1240

Merged

zbraniecki modified the milestones: ICU4X 0.5, 2021 Q4 0.5 Sprint A, 2021 Q4 0.5 Sprint B Oct 29, 2021

zbraniecki closed this as completed Nov 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PluralRuleStringsV1 should be parsed at data load time #615

PluralRuleStringsV1 should be parsed at data load time #615

Manishearth commented Apr 6, 2021

sffc commented Apr 13, 2021

Manishearth commented Apr 19, 2021

sffc commented Sep 17, 2021 •

edited

Loading

Manishearth commented Sep 17, 2021

Manishearth commented Oct 4, 2021

zbraniecki commented Nov 15, 2021

PluralRuleStringsV1 should be parsed at data load time #615

PluralRuleStringsV1 should be parsed at data load time #615

Comments

Manishearth commented Apr 6, 2021

sffc commented Apr 13, 2021

Manishearth commented Apr 19, 2021

sffc commented Sep 17, 2021 • edited Loading

Rust Structures

Algorithm

Byte Representation

Example

Manishearth commented Sep 17, 2021

Manishearth commented Oct 4, 2021

zbraniecki commented Nov 15, 2021

sffc commented Sep 17, 2021 •

edited

Loading