Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intl.DateTimeFormat needs a parser #342

Closed
anilanar opened this issue Apr 25, 2019 · 30 comments
Closed

Intl.DateTimeFormat needs a parser #342

anilanar opened this issue Apr 25, 2019 · 30 comments
Labels
c: datetime Component: dates, times, timezones enhancement s: comment Status: more info is needed to move forward

Comments

@anilanar
Copy link

anilanar commented Apr 25, 2019

An instance of DateTimeFormat can do Date -> string but cannot do the reverse string -> Date for formatted strings it created.

It's very hard to implement that in user-land because different browsers might handle different languages in different ways?

My proposal is a parse method that is reverse of the format method:

const x = new Intl.DateTimeFormat(/* format options */);

// true, assuming format options are lossless
x.parse(x.format(aDate)).getTime() === aDate.getTime()
@littledan
Copy link
Member

I believe this has been raised on other threads. Although ICU and many other libraries support localized date parsing, it's a bit brittle and not quite recommended. If someone has a free-form text input field, what they write might not be parseable by just trying to match what Intl.DateTimeFormat would output. For that reason, I'd encourage developers to make a structured input field, and to develop solutions to this harder problem in a higher-level library.

@zbraniecki
Copy link
Member

If all you want is to get a field out of a formatted date, you could use formatToParts for that. I agree with Daniel that parsing localized i18n input is very brittle.

@leobalter
Copy link
Member

a work for a parser should be coordinated with the Temporal proposal. I'm not in favor of creating a new parser otherwise neither jump ahead with a Date like parsing on a such modern API.

@anilanar
Copy link
Author

anilanar commented Apr 25, 2019

If someone has a free-form text input field, what they write might not be parseable by just trying to match what Intl.DateTimeFormat would output

A parser may not be able to parse a given string. That's the case for all parsers. If it fails to parse, it can throw an error with failure reason or return null or do whatever is favored by ES committee nowadays.

I'm not sure what is brittle here. I think DateTimeFormat has all the information necessary to parse what format produces. I'm not proposing a parser that tries to handle all possible formatting options. I'm asking for a parser that works with very specific formatting options plus a locale.

I'd encourage developers to make a structured input field

I find it very hard to get to know all possible ways of formatting dates across the globe (e.g. 12-31-2018 in US, 31.12.2018 in EU etc.) and implement a structured input element that can handle all. In addition, locale is usually not enough to decide date formatting options. For example, my locale is en-US but my date format (at OS level) is configured as dd.MM instead of US style. Perhaps Intl.DateTimeFormat can tell us more about the format itself for user-land to create a parser based on it.

For example:

  • C# has DateTime.Parse that takes a CultureInfo object, which is equivalent to locale + formatting options.
  • Joda library for Java has DateTimeFormatter that is a printer and a parser at the same time, similar to what I propose.
  • Apple's Foundation API (their std lib) has NSDateFormatter that is configured with either a locale identifier or a pattern and is able to parse strings matching its configuration.

@littledan
Copy link
Member

Note, Temporal does have a parser for ISO 8601. What I'm skeptical of is parsing human-input date-times, which seems intractable for a library as low-level/deterministic/data-driven as Intl.

I'm aware that many other date libraries have such a parser, and I think it's a mistake. Actually, V8 had an extension to ECMA-402 which included parsing in Intl.DateTimeFormat, and I removed it.

@leobalter
Copy link
Member

Note, Temporal does have a parser for ISO 8601. What I'm skeptical of is parsing human-input date-times, which seems intractable for a library as low-level/deterministic/data-driven as Intl.

This is exactly why I told this work should be coordinated. I'm against us creating the something with the same goal in two different places using different implementations. Date.parse is already an epitome of most confusion using JS.

@rxaviers
Copy link
Member

I echo @littledan #342 (comment) and like @zbraniecki said, formatToParts can be used to generate a good datepicker.

@aphillips
Copy link

While I agree with all the arguments against parsing human entered date strings, there may be a tiny amount of value in parsing patterned date values (e.g. Using picture strings or [less likely] skeletons). 8601 is a good example handled elsewhere, but there are non standard but machine generated formats, sometimes localized, for which having calendar-aware machinery for parsing is occasionally useful. For me this has mainly been reading text based flat file formats.

I could probably count on one hand the number of times this has been useful in my career: I've exerted way more effort avoiding this sort of parsing. But to me that would be the use case.

@leobalter
Copy link
Member

I'm not against a new Date parser, I'm only asking to follow up with other similar work that has been done to avoid more consistency in a filed we already have had enough (from Date.parse). Perhaps, we might end up with more than one method here and there, but consistency is ultimately required.

@littledan
Copy link
Member

@aphillips, How have you parsed these in the past? Can you give more detail to, "sometimes localized"?

@sffc
Copy link
Contributor

sffc commented Apr 25, 2019

Having implemented code that does things like this in ICU, I can attest that parsing localized strings is indeed very brittle.

There are two main use cases for the parsing of localized input:

  1. Strict: for example, to validate that text conforms to a given date format.
  2. Lenient: for example, to make a best attempt at turning a user-supplied string into a date.

A strict-only parser is not too hard to implement, because you have a limited space of strings that could be considered valid. However, when users think of parsing, they are usually thinking in terms of the second use case. That is a much harder problem to solve correctly. For example, if someone writes "10-12-2019", is that October 12 or December 10? If you know the user's locale, you can make a guess, and that's what ICU does. However, I wouldn't trust that result without having the user verify the output. This is why if the goal is user input, in general it is still safer to just use a good off-the-shelf date picker.

@sffc sffc added c: datetime Component: dates, times, timezones s: comment Status: more info is needed to move forward labels Apr 25, 2019
@anilanar
Copy link
Author

anilanar commented Apr 26, 2019

I’m not sure if I’m on the same page with some other attendants of this discussion.

I propose a strict only parser anyways. DateTimeFormat defines a strict format and I propose for it to have a parse function that would also be strict. To reiterate. parse is mathematically reverse of format when format options is lossless. It’s trivial to define isomorphic relation between Date and string when format options is lossy. So every adjective you use to characterize parse is also valid for format.

Why are we talking about Date.parse or other non-strict parsers anyways? They have nothing to do with what I’m proposing.

@littledan
Copy link
Member

Could you say more about your use case where a strict parser is useful in applications?

@aphillips
Copy link

I agree with what @sffc mentions above: strict is easy to code, but it is hugely intolerant of any variations and it makes implementations sensitive to changes in CLDR data (what used to work, stops suddenly...) A strict parser is useful when you are both the generator and consumer of the resulting localized strings (in which case, ISO 8601 is right there and you ought to use it).

@littledan when I have done lenient parsing, it was to parse custom date patterns, generally using Java-based ICU DateFormat. For example, a long time ago, when I worked at webMethods (so at least 15-16 years ago), we needed code to parse various flat file formats which were in some obscure industry standard. I've also seen people attempting to parse date strings that were machine generated (yet with localized tokens--mostly month name/abbreviation). This code was always super-fiddley because the parser yakked on even trivial things--I recall writing custom error handling over the lenient parser.

@anilanar I realize that you are searching for round trip capability, but the fact is that you're always better off passing date values as values and only using display strings for display purposes. It's a poorly internationalized application that relies on being able to interpolate a display string back into a date. It would be nice to have a "mathematical reverse", but stuff like time zone IDs gets in the way. Ultimately, the question is: what application do you have for this, vs. mere "completeness" of the API?

@anilanar anilanar closed this as completed May 8, 2019
@rxaviers
Copy link
Member

rxaviers commented May 8, 2019

On Globalize, we have a parser whose job is to perform an inverse operation of the formatter (it's strict). Its application is to parse user entered input in controlled UI (generated by formatToParts), e.g., https://github.com/rxaviers/react-date-input

@rxaviers
Copy link
Member

rxaviers commented May 8, 2019

If Ecma-402 doesn't provide a parser (at least a number parser), how would a user parse non "latin" numerals (e.g., eastern Arabic ٠١٢٣٤٥٦٧٨٩, full width digits 0123456789)?

@rxaviers
Copy link
Member

rxaviers commented May 8, 2019

I am reopening for feedback about the above

@rxaviers rxaviers reopened this May 8, 2019
@zbraniecki
Copy link
Member

how would a user parse non "latin" numerals

Why would we need to? If you need eastern arabic to western arabic numeral parser, then you should use a library for that, but I struggle to see it as a common use case. And if you need it, then likely you need different numeral systems as well.

@zbraniecki
Copy link
Member

My other concern is that formatter can be lenient in the output and fallback on other numeral systems. But parser can't. You can't rely on any parser that may not have data for any numeral system.

@sffc
Copy link
Contributor

sffc commented May 8, 2019

how would a user parse non "latin" numerals

I could see us providing an API to expose character properties, exposing a subset of uchar.h. There are methods like u_isdigit and u_digit. That would be a different issue, though.

@rxaviers
Copy link
Member

rxaviers commented May 9, 2019

At PayPal, there are cases where Japanese users enter numerals using fullwidth characters (which caused bug in some products). Product developers weren't even aware of the numeric regional differences. The goal (in such case) was simply to parse user entered numerals.

Let me repeat what you're suggesting to make sure I understood it right. Product developers should handle the numerical mapping themselves (using a specific library for that). I can picture if-elses in that code do handle user entered numerals. That should be preferred instead of simply relying on Intl, an internationalization library, whose purpose is to drive away regional differences in the implementation.

Is that right?

@rxaviers
Copy link
Member

rxaviers commented May 9, 2019

My impression was that a parser method would just expose whatever is already present in the engine to handle localization aspects of https://www.w3.org/TR/html/sec-forms.html#number-state-typenumber

@sffc
Copy link
Contributor

sffc commented May 9, 2019

I can picture if-elses in that code do handle user entered numerals.

Number parsing (and date parsing) requires heuristics. UTS 35 does not have a well-defined algorithm for parsing numbers. Given that situation, it seems safer to put number parsing heuristics in user land. The alternative would be to essentially rely on "if-elses" in the ICU library, which is undesirable because (1) it is not well-specified and (2) the heuristics can change from release to release.

@sffc
Copy link
Contributor

sffc commented Sep 29, 2019

Unicode properties (e.g., whether a character is a digit) are being discussed in #90. This should expose data about Arabic numerals so that a parser can be written in user land.

Closing the issue again because it was re-opened a few posts earlier with a question specifically about Arabic numerals.

@pixelbandito
Copy link

pixelbandito commented Mar 1, 2021

I have some counterpoints and questions.
Please take these as a good-faith attempt to solve problems, I'd be more than happy to have guidance on better approaches.

Our use case:
Users select a language and region in the webapp - we don't use the browser setting. It's not ideal, but something we can't get away from easily.

We use native JS Intl functionality to display dates with the user's configured langage/region. That works well.
Our text inputs for dates always appear alongside a calendar picker - they're nice user experience addition that allows for quick copy/paste and date entry.

We'd be completely willing to require strict-ish string formats on text inputs for dates, but we'd still need to handle cases where 1 March 2021 is represented differently, e.g. France-French "01/03/2021" vs. US English "3/1/2021".

Counterpoints (reasons we think a parser would be invaluable):

  • We'd rather not include a third-party date parsing library.
  • We don't want to re-invent the wheel by maintaining our own list of which region/language combos put their months, days, and years in different orders.
  • The ideal implementation for us would be what the user sees (formatted) aligns with what the user can enter to a text field (to be parsed), which creates a coupling between the formatter and any parser we implement.
  • HTML's input type="date" would be a reasonable choice, but it's still not supported across some major browsers.
  • As far as I understand, we can't tell an HTML date input to use a locale and language other than the browser setting. (Please tell me I'm wrong!)

Handling of messy user input:
We could handle some common formatting issues, like trimming / collapsing whitespace before running date strings through the parser. We could even split non-numeric character and change the separators in case a user entered a separator other than "/".

Questions:
Given that, I don't understand why it's such a bad idea to provide a parser that's the inverse of the current formatter.
Is it about a strict parser being a bad idea for UX reasons?
Is it about a non-strict parser being impossible to implement effectively?
Is it about tricky cases like numeric character variations that make any strategy difficult to implement and maintain?

Forgive me if I'm ranting, I'd really like to hear others' thoughts on this, and any workarounds other folks have come up with.

@zbraniecki
Copy link
Member

We'd rather not include a third-party date parsing library.

That seems like an a-priori preference that is not justified by anything in your comment. "We'd prefer not to include a userland library, so extend the standard" is not a strong position to take.

HTML's input type="date" would be a reasonable choice, but it's still not supported across some major browsers.

That's solvable with time and issues filed against browsers. Extending a spec in such a massive way would take years and I don't think you'd find browser versions that support the new functionality but not input type="date".

Given that, I don't understand why it's such a bad idea to provide a parser that's the inverse of the current formatter.

Because parsers are very very hard and very flaky and add disproportionally high maintenance, compatibility and security overhead for maintainers. In most cases that results with a small number of people happy about the solution, long tail of people unhappy about their case not being supported, and accruing bugs and problems that are perceived as a lowered quality standard library.

Forgive me if I'm ranting, I'd really like to hear others' thoughts on this, and any workarounds other folks have come up with.

Your comment is well phrased and expresses a genuine intent for your needs, I don't think there's anything wrong with that, but I appreciate your explicit description of intent and care about not come out as righteous :)
It's a subtle and complicated space, and date/time parser is a great example of something that intuitively feels like it should be relatively easy, but once you start digging into it you realize that it's an iceberg of problems for the standard library maintainers with many "tips" of that iceberg and for everyone "the right thing" is a different thing.

Finally, since data changes, if we support any internationalized date parsing, we will, by necessity, create a situation where your input to your website will work one day, but in the future the same input to the same website will break on the future update of the same browser because data has been updated and the parsing patterns changed.
This is super hairy and very very risky. We'd need to work a lot on maximizing the odds that web authors know how to work with that future incompatibility risk, and that's on top of all other issues I listed.

I hope my message conveys the "this is orders of magnitude more complicated than it looks on the surface" and would likely sink more resources and braintime from our group than everything else we work on combined, while still producing something that wouldn't satisfy majority of people that would like to see such API.

I believe user land library is a great solution. And if one becomes dominant and gets years of in-field experience, we could revisit this topic. But I think there's a reason that didn't happen yet.

@pixelbandito
Copy link

@zbraniecki Thank you for your response!

That's solvable with time and issues filed against browsers. Extending a spec in such a massive way would take years and I don't think you'd find browser versions that support the new functionality but not input type="date".

That's a really good point, and partially changes my view.

@sffc
Copy link
Contributor

sffc commented Mar 2, 2021

+1 on everything @zbraniecki said. Also, see my blog post on the subject:

https://blog.sffc.xyz/post/190943794505/why-you-should-not-parse-localized-strings

@tounsoo
Copy link

tounsoo commented Jan 11, 2023

I'd really like this feature. We have input for date that is used internationally and we are currently relying on 3rd party library. I would love to see it from the Intl.

@ryzokuken
Copy link
Member

@tounsoo please read the backlog. Parsing is out-of-scope for ECMA 402 and won't be included. A 3rd party library is indeed the right way to go.

gibson042 added a commit to gibson042/ecma402 that referenced this issue Mar 6, 2024
tc39gh-424 is more comprehensive than tc39gh-342, which is specific to DateTimeFormat
ryzokuken pushed a commit that referenced this issue Mar 7, 2024
gh-424 is more comprehensive than gh-342, which is specific to DateTimeFormat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: datetime Component: dates, times, timezones enhancement s: comment Status: more info is needed to move forward
Projects
None yet
Development

No branches or pull requests

10 participants