Split date time data into smaller data keys? #257

zbraniecki · 2020-09-20T07:28:35Z

As I'm implementing Dates in DataProvider and testing them using DateTimeFormat, I have some questions about how should we structure that.

Generally, the data in question looks like this: https://github.com/unicode-cldr/cldr-dates-modern/tree/master/main/en

It has (per locale):

display names for Months, Weekdays, DayPeriods, Quarters, Eras
patterns for time, date, and date_time
list of best patterns for skeletons
interval patterns
time zone names
relative display names

For now, we need:

display names for months, weekdays, day periods
patterns for time, date and date_time
list of best patterns for skeletons

Display names come in different:

contexts ("format" and "stand_alone")
widths ("abbreviated", "narrow", "short", "wide")

but they also can be for different calendar systems (I see at least "generic" and "gregorian").

As far as I understand DataProvider we have some flexibility in what do we request and what we get in response.

We could, for example, put "months/format/narrow" as a variant in DataEntry and gregory in DataKey and get just a list of month names in "format" and "narrow" for "gregory" calendar.

Or, we can just ask for "gregory" and set no variant in DataEntry and get all display names for all contexts and all widths.

@sffc - what are your thoughts on that? How should a request/response look like?

The text was updated successfully, but these errors were encountered:

zbraniecki · 2020-09-20T08:31:59Z

Based on @sffc comments in that PR, it also seems like he'd suggest we split the dates data into multiple keys:

datesym ( month display names, weekday display names)
timesym ( day period display names)
patterns (for style, and for skeleton)

I assume that for things like time zone names, interval patterns and relative display names he's suggestion we also create separate keys.

sffc · 2020-09-20T20:14:00Z

First, please leave skeletons (including available_formats) aside. Those need their own discussion.

Responses to the several sub-questions:

DataProvider Flexibility

As far as I understand DataProvider we have some flexibility in what do we request and what we get in response.

Right. The intended goal of the data provider keys is not to mimic CLDR. It's to return data that is easy and efficient to consume at runtime. Mapping from CLDR format to ICU4X format happens in CldrJsonDataProvider. I want to make sure we are aligned on that goal.

Number of Data Keys

Regarding the breakdown of the data keys. I feel strongly that there should be a minimum of three distinct data keys for DateTimeFormat, which Zibi listed in #257 (comment). Here are my reasons:

A component can request the data it needs and nothing more.
The same key can be shared by multiple code paths; for example, datesym can be used by both the datetimestyle path and the skeleton path.
Smaller data hunks mean that more locales will be able to point to shared hunks and make data smaller on disk.

Format Widths for Display Names

I've been thinking about the format widths (long/short/narrow) for display names and whether they belong in the data requests (key/entry) or not. I'm thinking that no, they don't belong in the data request; long/short/narrow should be together in the same data key and entry, and they should be a leaf of the struct. My reasoning:

The width trio (long/short/narrow) tends to be strongly correlated; if one changes, so do the others. Therefore, considering the trio as a single key does not decrease the ability to create shared data hunks.
It is not uncommon to fall back between widths; for example, if narrow is not present, then we fall back to short.
In order for a split data key to work as a method to reduce code size, we would need to track the requested width through much of the call stack in a way suitable for code slicing. I think this is too fine-grained to track.

Format Widths for Patterns

The situation is a bit different for pattern widths (dateStyle/timeStyle). None of the three aforementioned conditions apply here: the width patterns are not strongly correlated; they do not fall back; and we should be able to slice the data very early in the call stack.

Therefore, I think it makes sense to put format widths into either the data key or the data entry.

I would like to make this judgement after we have the code written and we can sit down and look at the concrete implications to the data bundles.

Calendar Systems

I'm arriving at the conclusion that calendar systems should be in the data key. Reasons:

The data for each calendar system is unique. Not all systems are represented by 12 month names. When we need a new struct (data schema), we need a new data key.
The code needed to run calendar systems needs to be written for each individual calendar, so the data keys will correspond to those modules of code.

We may at some point want to add an all-in-one calendar data key, but this is not relevant right now. We should cross that bridge later when we add calendar math to ICU4X.

zbraniecki · 2020-09-21T01:01:24Z

Thank you! This is so helpful!

sffc · 2020-10-15T06:22:54Z

Because of reasons I discussed in the new doc datetime-input.md, I now believe that we should not have different data keys for different calendar systems. We should pool the essential symbols for calendar systems into the same data key. We could consider filtering the data in the data entry or the offline build tool.

However, I do still feel that we should have different data keys for patterns, date symbols, and time symbols. We should probably go further and split the date symbols down into eras, months, day periods, and time zone names.

sffc · 2020-10-31T07:49:29Z

Concretely, I envision DateTimeFormat using the following separate, orthogonal keys, which covers all formatting except for time zone (which I want to leave to a separate discussion):

Display Names

datetime/era@1 collects era display names covering all desired calendar systems
datetime/cycyear@ covers cyclic year names
datetime/quarter@1 covers quarter names
datetime/month@1 covers month names
datetime/weekday@1 covers weekday names
datetime/dayperiod@1 covers the a and b day periods (am, noon, pm, and midnight)
datetime/flexperiod@1 covers the B day periods (in the morning, in the afternoon, …)

Format Patterns

datetime/patterns@1 covers long/medium/short date patterns, time patterns, and glue patterns
datetime/skeletons@1 covers availableFormats: the mapping from skeletons to patterns

Why more keys instead of fewer keys? I listed reasons in #257 (comment), but to reiterate:

A component can request the data it needs and nothing more.
The same key can be shared by multiple code paths; for example, month names can be used by both the datetimestyle path and the skeleton path.
Smaller data hunks mean that more locales will be able to point to shared hunks and make data smaller on disk.

zbraniecki · 2020-10-31T14:49:53Z

I'm convinced. This looks like a great design. One additional benefit of it is that version changes will be less common and more isolated in a more chunked model.

sffc · 2020-11-04T13:52:18Z

There are still some open questions in my mind about how exactly to provision data across calendar systems, but that question is being tracked in #355.

sffc · 2020-11-06T19:21:42Z

Shane to implement this along with #355.

sffc · 2020-12-09T00:22:20Z

Blocked on #409 like #355

sffc · 2022-01-27T18:53:56Z

This should be one of the last things to do in 1.0 after DTF has stabilized.

We should do this before 1.0 because it impacts data file stability.

gregtatum · 2022-03-29T21:10:15Z

I think this has two parts, one part is the data keys for the ECMA-402 compatible components bag, and the other is for the ideal components bag. Blocking for 1.0 will be ensuring we have the best split for the ECMA-402 compatible components bag.

sffc · 2022-05-27T18:20:49Z

Make sure to look at the data representation of the glue pattern and make changes if necessary for future-proofing. See #1131

sffc · 2022-07-31T05:59:48Z

We have split symbols from patterns and date from time; this is sufficient for the first release. I would still like to explore even more-granular splitting, but there's no time in 1.0 and we should coordinate this with the Ideal Components Bag work.

Manishearth · 2024-02-23T23:49:23Z

The neo date time format stuff does this

sffc · 2024-08-31T06:24:11Z

Posting this here because it seems like as good of a place as any:

I'l looking into the minimal set of patterns required for year formatting. A year can take three forms, which we could make dynamically selectable at runtime based on the value of the year:

Partial precision, such as: '00
Full precision, such as: 2000
With era, such as: 2000 AD

I looked into whether the patterns used for case 1 and 2 differ other than the length of the year field. I found that, at least according to the CLDR algorithm and data, the patterns for one are mostly identical to the patterns for the other with the year width swapped out. There are a few exceptions:

Legend: pattern on the left is the possibly-reduced-precision year with a full-precision year substituted. Pattern on the right is the pattern resulting from a full-precision year used during skeleton selection.

DataLocale{ug}/"chinese": "r-M-d" != "r-MM-dd"
DataLocale{qu-BO}/"gregorian": "d-M-y" != "dd-MM-y"
DataLocale{ky}/"gregorian": "y-d-M" != "y-dd-MM"
DataLocale{sat}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{bs-Cyrl}/"gregorian": "d.M.y." != "dd.MM.y."
DataLocale{en-SG}/"gregorian": "d/M/y" != "dd/MM/y"
DataLocale{ne}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{hr-BA}/"gregorian": "d. M. y." != "dd. MM. y."
DataLocale{zu}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{en-MV}/"gregorian": "d/M/y" != "dd/MM/y"
DataLocale{qu}/"gregorian": "d-M-y" != "dd-MM-y"
DataLocale{ha-GH}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{en-AU}/"gregorian": "d/M/y" != "dd/MM/y"
DataLocale{qu-EC}/"gregorian": "d-M-y" != "dd-MM-y"
DataLocale{sah}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{ha}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{lkt}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{ne-IN}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{sat}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{ha-NE}/"gregorian": "y-M-d" != "y-MM-dd"
DataLocale{ug}/"dangi": "r-M-d" != "r-MM-dd"

Spot-checking, in most of these cases, the pattern is inherited from the root locale, which makes me question the quality of the data.

If I were to store the patterns separately, I could use another flag in the data struct. We currently have a single byte reserved for flags in the packed data structure, and 4 bits are used, so this would mean using one more bit. However, we could keep the bit unset if the locale results in equivalent data as observed above.

https://github.com/unicode-org/icu4x/blob/main/components/datetime/src/provider/neo.rs#L560

This could be done as a follow-up so long as the bit remains available.

Part of #257

sffc · 2024-11-19T00:39:09Z

I believe this issue is fully resolved.

zbraniecki added A-design Area: Architecture or design discuss Discuss at a future ICU4X-SC meeting C-datetime Component: datetime, calendars, time zones labels Sep 20, 2020

zbraniecki mentioned this issue Sep 20, 2020

Add subset of dates data to DataProvider #256

Merged

zbraniecki mentioned this issue Sep 20, 2020

Adds support for CldrJSONDataProvider for Dates (based on #256) #258

Merged

zbraniecki self-assigned this Sep 24, 2020

sffc added this to the 2020 Q4 milestone Oct 22, 2020

sffc assigned sffc and unassigned zbraniecki Nov 6, 2020

sffc added T-core Type: Required functionality and removed discuss Discuss at a future ICU4X-SC meeting labels Nov 6, 2020

sffc added the blocked A dependency must be resolved before this is actionable label Dec 9, 2020

sffc mentioned this issue Jan 6, 2021

[datetime] Add support for width-aware noon and midnight DayPeriods #435

Closed

sffc added the discuss Discuss at a future ICU4X-SC meeting label Jan 6, 2021

sffc mentioned this issue Jan 7, 2021

Complete the set of DateTimeFormat options #272

Open

3 tasks

sffc modified the milestones: 2020 Q4, 2021-Q1-m1 Jan 7, 2021

sffc removed the blocked A dependency must be resolved before this is actionable label Jan 11, 2021

sffc mentioned this issue Jan 12, 2021

Add support for width-aware noon and midnight options for DayPeriod #444

Merged

sffc mentioned this issue Jan 21, 2021

Moving data provider structs into their own crates #451

Merged

sffc removed this from the 2021-Q1-m1 milestone Feb 3, 2021

zbraniecki mentioned this issue Oct 1, 2021

Separate SkeletonPatterns into its own key #1139

Merged

sffc modified the milestones: ICU4X 0.5, ICU4X 1.0 Jan 27, 2022

sffc unassigned zbraniecki Jan 27, 2022

sffc added the help wanted Issue needs an assignee label Jan 27, 2022

sffc removed the help wanted Issue needs an assignee label Jan 27, 2022

sffc assigned zbraniecki and sffc Jan 27, 2022

sffc modified the milestones: ICU4X 1.0 Untriaged, ICU4X 1.0 (Polish) May 25, 2022

sffc mentioned this issue May 27, 2022

Introduce CombinedPattern to DTF #1131

Open

This comment was marked as outdated.

Sign in to view

sffc modified the milestones: ICU4X 1.0 (Polish), ICU4X 2.0 Jul 31, 2022

Manishearth added this to icu4x 2.0 Feb 23, 2024

Manishearth moved this to Being worked on in icu4x 2.0 Feb 23, 2024

sffc mentioned this issue Sep 6, 2024

Replace PluralElementsV1 with PluralElementsPackedULE #5510

Merged

sffc modified the milestones: ICU4X 2.0 ⟨P1⟩, ICU4X 2.0 Beta ⟨P1⟩ Sep 17, 2024

sffc mentioned this issue Sep 23, 2024

Implement PackedPatternsV1 with packing and unpacking #5580

Merged

sffc added a commit that referenced this issue Sep 25, 2024

Implement PackedPatternsV1 with packing and unpacking (#5580)

1141c8e

Part of #257

sffc closed this as completed Nov 19, 2024

github-project-automation bot moved this from Being worked on to Done in icu4x 2.0 Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split date time data into smaller data keys? #257

Split date time data into smaller data keys? #257

zbraniecki commented Sep 20, 2020

zbraniecki commented Sep 20, 2020

sffc commented Sep 20, 2020

zbraniecki commented Sep 21, 2020

sffc commented Oct 15, 2020

sffc commented Oct 31, 2020

zbraniecki commented Oct 31, 2020

sffc commented Nov 4, 2020 •

edited

Loading

sffc commented Nov 6, 2020

sffc commented Dec 9, 2020

sffc commented Jan 27, 2022 •

edited

Loading

gregtatum commented Mar 29, 2022

sffc commented May 27, 2022

This comment was marked as outdated.

sffc commented Jul 31, 2022

Manishearth commented Feb 23, 2024

sffc commented Aug 31, 2024

sffc commented Nov 19, 2024

Split date time data into smaller data keys? #257

Split date time data into smaller data keys? #257

Comments

zbraniecki commented Sep 20, 2020

zbraniecki commented Sep 20, 2020

sffc commented Sep 20, 2020

DataProvider Flexibility

Number of Data Keys

Format Widths for Display Names

Format Widths for Patterns

Calendar Systems

zbraniecki commented Sep 21, 2020

sffc commented Oct 15, 2020

sffc commented Oct 31, 2020

zbraniecki commented Oct 31, 2020

sffc commented Nov 4, 2020 • edited Loading

sffc commented Nov 6, 2020

sffc commented Dec 9, 2020

sffc commented Jan 27, 2022 • edited Loading

gregtatum commented Mar 29, 2022

sffc commented May 27, 2022

This comment was marked as outdated.

sffc commented Jul 31, 2022

Manishearth commented Feb 23, 2024

sffc commented Aug 31, 2024

sffc commented Nov 19, 2024

sffc commented Nov 4, 2020 •

edited

Loading

sffc commented Jan 27, 2022 •

edited

Loading