Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PluralRuleStringsV1 should be parsed at data load time #615

Closed
Manishearth opened this issue Apr 6, 2021 · 6 comments
Closed

PluralRuleStringsV1 should be parsed at data load time #615

Manishearth opened this issue Apr 6, 2021 · 6 comments
Assignees
Labels
A-ffi Area: FFI, WebAssembly, Transpilation A-performance Area: Performance (CPU, Memory) C-pluralrules Component: Plural rules S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@Manishearth
Copy link
Member

Currently they need to be further parsed into a PluralRuleList. It would be nice if that were handled by the data provider itself. We would have to potentially provide utility functions for converting between a PluralRulesStrings and a final parsed PluralRuleList, or perhaps make it so that data providers that can provide PluralRulesStrings can automatically provide a PluralRuleList.

It would also be good if PluralRuleList and RulesSelector could use Cows and borrow from the data provider such that

cc @sffc

The lower level API in #575 will also need to be updated to handle this.

@sffc
Copy link
Member

sffc commented Apr 13, 2021

I'd suggest taking a similar approach as Pattern and Skeleton, where the JSON representation is human-readable, and the Bincode representation is pre-parsed. When JSON gets read in, the strings should be parsed, and when Bincode gets read in, you just need to point to the data.

@sffc sffc added A-ffi Area: FFI, WebAssembly, Transpilation A-performance Area: Performance (CPU, Memory) C-pluralrules Component: Plural rules S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality help wanted Issue needs an assignee labels Apr 13, 2021
@sffc sffc added this to the ICU4X 0.4 milestone Apr 19, 2021
@Manishearth
Copy link
Member Author

We'll work on #663 first so that we know what FFI footguns are in the offing with this. It's possible to work on this now (and folks should feel free to pick it up!), but FFI may break.

@sffc
Copy link
Member

sffc commented Sep 17, 2021

Here's a model for how to represent plural rules as a zero-copy data structure: VarZeroVec<AndOrRelation>, which is a VarZeroVec<ZeroVec> with some metadata at each level.

This covers all cases, is compatible with UTS 35, and does not require infinite nesting.

Rust Structures

Here is AndOrRelation:

enum AndOr { And, Or };
enum Polarity { Positive, Negative };
struct AndOrRelation {
  and_or: AndOr,  # first entry is Or
  operand: PluralOperand,  # i, u, v, f, ...
  modulo: u32,
  polarity: Polarity,
  range_list: ZeroVec<RangeOrValue>,
};

And RangeOrValue:

enum RangeOrValue {
  Range(u32, u32),
  Value(u32),
}

Algorithm

I claim that the algorithm is just as fast as a more highly nested AST structure. Pseudo-code:

Let result = False.
For each relation in relation list:
  If relation.and_or == "and" and result == False:
    Next iteration.
  If relation.and_or == "or" and result == True:
    Return True.
  result = relation.compute(n).
Return result.

Byte Representation

Relation can be represented in 5 bytes plus the ZeroVec:

  • and_or: 1 bit
  • polarity: 1 bit
  • operand: say 6 bits (2^6 = 64 choices, far more than we currently need to support)
  • modulo*: 32 bits

* The modulo could likely be compacted further, given that virtually all modulos are on powers of 10.

RangeOrValue can be represented in 8 bytes:

  • Start of range: u32
  • End of range or scalar: u32
  • Discriminant: Whether the start of range and end of range are equal

Example

Rule string: "n % 10 = 3..4,9 and n % 100 != 10..19,70..79,90..99 or n = 0"

This rule string contains 3 operations. A JSON-like expansion into the above schema would be:

[
  {
    and_or: "or",  // first entry is "or"
    operand: "n",
    modulo: 10,
    polarity: "positive",
    range_list: [
      [3, 4],
      9
    ]
  },
  {
    and_or: "and",
    operand: "n",
    modulo: 100,
    polarity: "negative",
    range_list: [
      [10, 19],
      [70, 79],
      [90, 99]
    ]
  },
  {
    and_or: "or",
    operand: "n",
    modulo: 1,
    polarity: "positive",
    range_list: [
      0
    ]
  }
]     

The bytes:

  1. First Relation: 25 bytes
    • 5 bytes for the Relation header
    • 4 bytes for the ZeroVec length (I would really like to be able to configure this)
    • 8*2 bytes for the content of the ZeroVec
  2. Second Relation: 33 bytes
    • 5 bytes for the Relation header
    • 4 bytes for the ZeroVec length
    • 8*3 bytes for the content of the ZeroVec
  3. Third Relation: 17 bytes
    • 5 bytes for the Relation header
    • 4 bytes for the ZeroVec length
    • 8 bytes for the content of the ZeroVec

Total: 75 bytes.

For comparison, the string is 60 bytes. So we are a bit bigger, but not too much bigger, and there are opportunities to optimize the byte length:

  • The modulo can probably be stored in a single byte, since almost all instances are 1 to a power of 10
  • If the ZeroVec has a length of 1, we can probably avoid storing the length by sticking another bit discriminant somewhere in the Relation header
  • I highly suspect that the we only need u16 instead of u32 for the RangeList pair

With these optimizations, the byte length would become:

  1. First Relation: 14 bytes
    • 2 bytes for the Relation header
    • 4 bytes for the ZeroVec length (I would really like to be able to configure this)
    • 4*2 bytes for the content of the ZeroVec
  2. Second Relation: 18 bytes
    • 2 bytes for the Relation header
    • 4 bytes for the ZeroVec length
    • 4*3 bytes for the content of the ZeroVec
  3. Third Relation: 6 bytes
    • 2 bytes for the Relation header
    • 4 bytes for the singleton entry of the ZeroVec

Total: 38 bytes! Smaller than even the string representation.

Note: The above size does not include the VarZeroVec's own header, which will likely incur another 16ish bytes.

@Manishearth
Copy link
Member Author

This sounds like it could benefit from the custom derive, though a couple issues are that it's harder to achieve bitpacking with a custom derive, and also I don't think a custom derive for AsULE can handle enums.

So might be better to write some custom packed ULE types.

@Manishearth
Copy link
Member Author

@zbraniecki #1078 is resolved, hopefully that unblocks this

@zbraniecki
Copy link
Member

Fixed in #1240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ffi Area: FFI, WebAssembly, Transpilation A-performance Area: Performance (CPU, Memory) C-pluralrules Component: Plural rules S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

3 participants