Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify, document, and enforce string types in data structs #571

Closed
sffc opened this issue Mar 24, 2021 · 4 comments · Fixed by #1417
Closed

Clarify, document, and enforce string types in data structs #571

sffc opened this issue Mar 24, 2021 · 4 comments · Fixed by #1417
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library

Comments

@sffc
Copy link
Member

sffc commented Mar 24, 2021

@gregtatum said in https://github.com/unicode-org/icu4x/pull/541/files#r597812794:

We're only testing a subset of locale data available in the CLDR. This leads to some observations:

  1. The CLDR could currently contain data that breaks the assumption of what can be represented by a certain size. I think this is somewhat unlikely, but for the available formats work I wrote a script to programmatically verify assumptions.

  2. Future CLDR data could contain data that breaks this assumption. I'm less concerned about this as we could bump the version and fix the problem. I don't think we have an automated solution to categorically detect any case of this happening. However, it would happen when some user manually generates the data. This risk seems somewhat low, and something we could fix on the fly.

We should consider having a standard way to recommend whether a field in a data struct is best as a TinyStr, SmallStr, or Cow<str>, in accordance with the style guide.

We should also at least document why we reached a decision on which string type to use. @gregtatum also said:

Suggestion: It would be nice to document the decision here on representing this and others with 8 bytes.

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting labels Mar 24, 2021
@sffc sffc changed the title Create tool for automating whether Cow<str> or SmallStr should be used in a data struct Clarify, document, and enforce string choices in data structs Mar 25, 2021
@sffc sffc changed the title Clarify, document, and enforce string choices in data structs Clarify, document, and enforce string types in data structs Mar 25, 2021
@gregtatum
Copy link
Member

Another question I don't know the answer to, is what happens when TinyStr is too long in data providers?

@sffc
Copy link
Member Author

sffc commented Mar 26, 2021

Another question I don't know the answer to, is what happens when TinyStr is too long in data providers?

If the ASCII strings are fixed length, choose between TinyStr4, TinyStr8, and TinyStr16. If they are variable length, then you get into an interesting discussion of whether to use TinyStrAuto or fall back on Cow or SmallString.

@sffc sffc added S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library and removed discuss Discuss at a future ICU4X-SC meeting labels Apr 29, 2021
@sffc sffc self-assigned this Apr 29, 2021
@sffc sffc added this to the 2021 Q2-m3 milestone Apr 29, 2021
@sffc sffc added the blocked A dependency must be resolved before this is actionable label Apr 29, 2021
@sffc
Copy link
Member Author

sffc commented Apr 29, 2021

Blocked on #667

@sffc sffc removed the blocked A dependency must be resolved before this is actionable label Aug 11, 2021
@sffc
Copy link
Member Author

sffc commented Aug 11, 2021

The enforcement should be handled by the new #[icu_provider::data_struct] annotation.

Action items remaining on this issue:

  • Ensure the style guide is up to date
  • Scrub ICU4X of remaining references to SmallStr and other obsolete mechanisms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants