Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DateTimeFormat Patterns use ZeroVec #519

Closed
zbraniecki opened this issue Mar 3, 2021 · 5 comments
Closed

Make DateTimeFormat Patterns use ZeroVec #519

zbraniecki opened this issue Mar 3, 2021 · 5 comments
Assignees
Labels
A-design Area: Architecture or design C-datetime Component: datetime, calendars, time zones T-core Type: Required functionality
Milestone

Comments

@zbraniecki
Copy link
Member

While wanting to test my PR for #277 I started looking at the DTF pattern/parser.

It currently is pretty inefficient as it allocates new Strings for each literal and grows them and then allocates a vector of items just to populate a buffer.

I'd like to remodel it in the following steps:

  • Separate provider traits between Patterns and Symbols. (Separate Patterns and Symbols trait methods in DateTimeFormat. #517).
  • Separate Patterns and Symbols data and only store Symbols on the struct. This is somewhat necessary because Rust can't reason about partial borrows, but it also means DTF is lighter. (Store only symbols on DateTimeFormat #518).
  • Rewrite the parser to produce iterator over pattern items.
  • Turn PatternItem::Literal to be a Cow
  • Enable parser to produce owned Pattern from String, borrowed from &strand Cow from Cow.
  • Plug SimpleFormatter for iterator composition, which can be allocation free if the iterator ends up populating a buffer, or can be collected into a Pattern/Vec.
@zbraniecki zbraniecki added A-design Area: Architecture or design C-datetime Component: datetime, calendars, time zones labels Mar 3, 2021
@zbraniecki zbraniecki added this to the ICU4X 0.2 milestone Mar 3, 2021
@zbraniecki zbraniecki self-assigned this Mar 3, 2021
@sffc sffc added the T-core Type: Required functionality label Mar 4, 2021
@zbraniecki zbraniecki modified the milestones: ICU4X 0.2, ICU4X 0.3 Apr 14, 2021
@sffc
Copy link
Member

sffc commented Jul 22, 2021

Since 0.3 is already 3 weeks late, we suggest moving all incomplete issues to 0.4.

@sffc
Copy link
Member

sffc commented Aug 31, 2021

Here is one way we could express Pattern as a ZeroVec:

Screenshot from 2021-08-31 17-45-33

(Oops: In the Jamboard, I expressed my code points in big-endian instead of little-endian.)

In code, this would be something like

#[derive(Copy)]
struct EncodedPatternItem([u8; 3]);

struct Pattern<'data> {
    items: ZeroVec<'data, EncodedPatternItem>,
};

impl EncodedPatternItem {
    #[inline]
    fn get_discriminant(&self) { /* extract the first bit */ }
    #[inline]
    fn get_code_point(&self) { /* extract the code point bits */ }
    #[inline]
    fn get_key(&self) { /* extract the key bits */ }
    #[inline]
    fn get_value(&self) { /* extract the value bits */ }
}

Or, perhaps,

// same definition of EncodedPatternItem

#[derive(Copy)]
enum PatternItem<K, V>
where
    K: Copy + /* encodable to a single byte */,
    V: Copy + /* encodable to a single byte */,
{
    Field(K, V),
    Char(char),
}

impl<K, V> AsULE for PatternItem<K, V> {
    type ULE = EncodedPatternItem;
    // ...
}

struct Pattern<'data, K, V> {
    items: ZeroVec<'data, PatternItem<K, V>>,
};

In other words, in Pattern, you store a ZeroVec of EncodedPatternItem, which can be expanded to a PatternItem<K,V>. In DateTimeFormat, K is the field type enum (like Days, Hours, Years), and V is the field length enum (like numeric, 2-digit, abbreviated).

This is generalizable to other types of patterns, such as list patterns and currency/unit affix patterns.

Advantages:

  1. Compact storage
  2. Cheap during data loading
  3. Low memory use
  4. Generalizable to multiple types of patterns
  5. No additional lifetimes

Disadvantages:

  1. Runtime performance degradation (likely small but measurable)
  2. Requires a bit of extra work to implement (amortized if we can re-use it across multiple formatters)

Considerations (neither advantages nor disadvantages):

  1. Works best for patterns with many placeholders (like date patterns) and less well for patterns with strings of text literals (like MessageFormat), since each code point is encoded as its own pattern item
  2. Requires a fast mapping from byte to enum for the key and value (field type and field length)

CC @Manishearth

@zbraniecki zbraniecki changed the title Turn DateTimeFormat Pattern Literals into Cow Make DateTimeFormat Patterns use ZeroVec Sep 16, 2021
@sffc sffc modified the milestones: 2021 Q3 0.4 Sprint B, ICU4X 0.4 Sep 16, 2021
@zbraniecki
Copy link
Member Author

Screenshot from 2021-09-16 11-32-59

First benchmarks! Before any serious optimizations but fully operational!

@zbraniecki
Copy link
Member Author

The final piece of the puzzle is the cleanup in skeleton code.

In particular I believe we should:

  • Clean up the module a bit, there's a bit of code that doesn't seem necessary
  • Move the selection functions to operate on provider::PatternV1 in preparation to make it work as projections
  • Actually move the functions to provider::helper alongside pattern selections
  • Consider extracting algorithmic logic back to skeleton (the current logic is bound to provider SkeletonV1, but in theory it could work at runtime taking any map of skeleton/pattern and options)
  • Consider separating runtime Skeleton backed by ZeroMap from reference one

The latter two I'd prefer not to do now, so I'm going to spin off separate issues for it, but the first three I think are worth doing now.

@zbraniecki
Copy link
Member Author

DONE !!! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-datetime Component: datetime, calendars, time zones T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

2 participants