[Discussion] bidi control characters when formatting dates #28

caridy · 2015-09-12T00:09:05Z

Notes:

Edge currently includes "a bunch" of bidi control characters when formatting dates.
CLDR has the bidi data for structured text (STT is still in the proposal status, need more exploration here).
CLDR has direction marks in the date patterns for locales that need it.
Spec says nothing about bidi (probably assumes the previous bullet).
Other browsers are not using bidi structured text for dates explicitly.

Problems:

Formatting dates as regular text does not preserve the structure or direction and as a result the text on the screen becomes incomprehensible.
Users in different geographies are accustomed to different rules for structured text display. (e.g.: Arabic vs. Hebrew).
Isomorphic/Universal apps that will format dates on the server using node/v8 will produce different results than Chakra/Edge. (e.g.: React checksum will fail, which implies a full re-rendering of the initial payload).
Chakra/Edge doing something different causes interop problems when users try to parse the localized output.

Proposals:

Spec the details about STT and direction marks and the use of bidi for dates, and align all implementations with Chakra/Edge.
Spec the optional use of STT and direction marks for dates, get Chakra/Edge to align, get others to add the new option.
Make localized output opaque by ignoring STT rules on dates, and get Chakra/Edge to drop the feature (add a strong statement about how the localized output should be considered opaque).
Spec the use of direction marks for dates, get Chakra/Edge to align (others are already using CLDR which includes the direction marks when needed).

Links:

STT (Structured Text): http://cldr.unicode.org/development/development-process/design-proposals/bidi-handling-of-structured-text

caridy · 2015-09-12T00:12:09Z

/cc @bterlson

srl295 · 2015-09-12T01:05:37Z

~~I don't think CLDR currently uses LRMs/RLMs in dates.~~ Edit: CLDR definitely uses RLM in dates.

STT is still in the proposal status.
Forwarding to CLDR TC..

srl295 · 2015-09-12T01:08:19Z

CLDR TR35 5.3.2 says:

The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order (for example, the year, month, and day for dates). In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which the element is embedded.

For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.

Element content whose display may be affected in this way should include an explicit direction mark, such as U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK, at the beginning or end of the element content, or both.

srl295 · 2015-09-12T01:25:09Z

other browsers..

new Date(0).toLocaleString("ar",{month:"numeric",day:"numeric",year:"numeric"}).indexOf('\u200f')
returns !== -1 (RLM present) for:

Chrome 45.0.2454.85
Firefox developer 42.0a2
node 3.3 (with full ICU data)

caridy · 2015-09-12T02:45:09Z

@srl295 yes, that's part of the date patterns from CLDR, I can confirm that the we are getting the same results when using Intl.js polyfill, here is the data with a bunch of \u200f in the patterns: https://raw.githubusercontent.com/andyearnshaw/Intl.js/master/locale-data/json/ar-001.json

The question is:

is this enough?

I wonder what is Chakra/Edge doing differently since it doesn't use CLDR. @bterlson can you clarify?

srl295 · 2015-09-12T18:11:02Z

Not sure I follow - you said:

Edge currently includes bidi control characters when formatting dates

How is this != what cldr/node/v8/Icu does? Microsoft is now part of CLDR though (afaik) still deploying it. Your points About consistency are exactly why we started CLDR...

bterlson · 2015-09-12T19:55:46Z

We don't use CLDR for Intl in Edge, fwiw.

Here is one difference, though I'm not sure how to characterize it as I'm no expert :)

new Date(0).toLocaleString("ar",{month:"numeric",day:"numeric",year:"numeric"})
Edge: '\u200F\u0662\u0662\u200F\x2F\u200F\u0661\u0660\u200F\x2F\u200F\u0661\u0663\u0668\u0669'
Chrome: '\u0663\u0661\u200F\x2F\u0661\u0662\u200F\x2F\u0661\u0669\u0666\u0669'

So you're right that Chrome does add bidi control characters, which I didn't notice before, but they are in different locations. Also Chrome does not include them when formatting en dates as Edge does. Example:

new Date(0).toLocaleString("en",{month:"numeric",day:"numeric",year:"numeric"})
Edge: '\u200E12\u200E/\u200E31\u200E/\u200E1969'
Chrome: '12/31/1969'

srl295 · 2015-09-14T19:58:52Z

I don't think the codes are "added ", they are just part of the CLDR data.

Do you happen to be in touch with the Microsoft cldr people?

Thanks for putting the code points here, I will take a look a little bit
later.

bterlson · 2015-09-14T20:01:56Z

Yeah I guess "added" wasn't the verb I wanted. "Included" more like. I believe you that they're part of the CLDR data.

I am in touch with the CLDR folks here. I can ask any questions we might have if they don't chime in themselves. Let me know!

caridy · 2015-09-14T20:38:40Z

Ok @bterlson let's try to gather all the info for next week, so we can discuss it in person, and try to get to a resolution. I will update the description of this issue now that we have more information.

srl295 · 2015-09-22T19:34:56Z

Edge: '‏٢٢‏/‏١٠‏/‏١٣٨٩'
Chrome: '٣١‏/١٢‏/١٩٦٩'

So they are different dates in your example. As to formatting codes, it may be excessive but not harmful.

I'm not sure why this is an ecma402 discussion actually. I'd rather leave LRM/RLM out of the ecma402 discussion. If it's just a matter of content consistency, as I mentioned that's the whole point of CLDR, it seems akin to discussing whether "modifier letter turned comma" or "apostrophe" or curly quote should be used in certain languages.

zbraniecki · 2015-09-23T00:19:28Z

I agree that it feels slightly out of scope for ecma402.

In our code, we wrap all variables in strings in FSI/PDI, but that's more of a mixed-content problem.

bterlson · 2015-09-23T02:34:04Z

As to formatting codes, it may be excessive but not harmful.

In theory, but in practice I have gotten numerous bug reports on Edge's behavior as people expect to be able to parse some localized date in Chrome and have that same code work in Edge. This isn't too much of a stretch for people to make because Intl let's the specify exactly what components they want in the date. Why wouldn't it be safe to parse?

I'm not saying this has to be fixed/unified. If it isn't then there should be a statement in the spec I can point to that explicitly says that not treating formatted dates as opaque is a very bad idea and not guaranteed to work.

shervinafshar · 2015-09-23T04:32:27Z

Also Chrome does not include them when formatting en dates as Edge does.

Because these bidi marks are not needed for a date string requested for en. Needed RLMs are there when a date string for any RTL locale is requested. Since, according to UAX#9, a date string which is requested for en is not even considered bidirectional text, what is the rationale behind what Edge does here? Are these marks in the locale-specific data Edge consumes or they are just added on the fly?

srl295 · 2015-09-23T04:34:46Z

I'll check, but it should state something to the effect that results depend on other data, user prefs, etc. Probably many users doing parsing really want some other issue in this repo fixed (filed or unfiled). I'd be surprised if lrms/ rlms were implicated in most of such bug reports you see. Although Maybe I shouldn't be surprised if even numeric dates are different due to lrms. Parsing is a whole other issue itself. I wouldn't expect users to type rlms into an input field around date items, anymore than I would expect them to type a THIN SPACE before percent sign or NBSP in the locales that expect such on format.

caridy · 2015-09-23T05:41:47Z

@bterlson has a theory that we should validate, here is what we discussed: what happen when you have a system preferences in ar, with a page in ar and you render a date in en? should the annotations be in place? what does FF and Chrome do today?

shervinafshar · 2015-09-23T14:52:03Z

Both Chrome and FF are implementing UBA. You can check the behavior in the bidi demo tool. Note that the date string remains as one single run of L2.

One might argue that European numbers are directionally weak and might end up being resolved according to directional context (W2 to AN) and therefore some bidi marks are required for their correct display. Trying that, it's observed that the date string still remains as a single run of L2 and displays just fine without any need for additional bidi marks.

I would be more than happy to discuss any edge cases folks have encountered before, but even if for some edge case, the aforementioned theory is validated, adding invisible control marks to strings which are not requested for a bidi language is not a solution as it introduces control characters where they are not supposed to appear. Libraries with more peculiar requirements to tailor the directional behaviour of strings in diverse directional contexts can implement means to pass the the context if need be and appropriate the generated strings accordingly, but I strongly agree with others who voiced their concern about whether this topic actually falls within the scope of the spec.

tomerm · 2015-10-22T08:43:39Z

Joining the party late. Let us be clear on the reason why UCC are injected into date / time patterns (i.e. 05 August 1934) in CLDR. They are injected to assure certain display (i.e. we don't want to see 05 1934 August). Which means the assumption is that rendering engine using those date / time patterns is UBA (Unicode Bidi Algorithm) compliant and fully supports UCC (Unicode control characters such as LRE, RLE, PDF, LRM etc...). This is unfortunately not true in all cases.
Thus I believe UCC should be injected not by data provider (CLDR is a data provider having very little to do with display of data it provides), but rather by code responsible for rendering the data (in many cases various Java or JS based toolkits).

The proposal to CLDR mentioned at the top of this thread was not meant to resolve display problem. It was not about injection of UCC during formatting. It was about defining the rules for display of text with inherent structure (date /time stamp is just one of many cases). For example we want breadcrumbs to flow from left to right for English / French / Russian ... UI while (1 >> 2 >> 3) for Arabic / Hebrew / Urdu ... UI we want them to flow from right to left (3 <<< 2 <<< 1). Because before we approach the solution of the problem, we would like to have a clear understand about expected display. The expected display for the same pattern may be different for different cultures (think of mathematical formulas).

It is only because standard UBA is not capable of automatic identification of structure and enforcing it for display purposes , such a proposal was created. May be in the distant future (or may be not so distant) Siri, Cortana, Watson and similar technology will be able to cope with it. But at the moment it needs to be done manually. What I hope to achieve is some level of automation by:

Defining types of structured text patterns (which can be specified during authoring time - via markup for example). This can include such types as file path, bread crumbs, date / time stamp etc.
Defining a standard way of handling the display problem for structured text pattern and provide tools / API for concrete implementation.
More elaborate description / specification of this proposal is available from:
https://docs.google.com/document/d/1y9LhT7rbGGVHjh2uqTAYHzN5PfbAkPxO5sMJygOPc3I/edit#heading=h.b43z973xff51

caridy · 2015-11-06T00:55:26Z

Update: react-intl is reporting issues with the checksum due to the invisible characters used by IE11, @ericf has more information about this. We should try to reach to an agreement on the next meeting (in two weeks).

bterlson · 2015-11-06T19:35:55Z

@caridy / @ericf, wouldn't they have similar problems with Chrome when the server and client locales are different and one is RTL and one is LTR?

ericf · 2015-11-06T20:03:05Z

@bterlson we resolve the user's locale on the server via a combination of HTTP content negotiation and the user's settings. With their resolved locale we render on the React app on the server and client using this resolved locale value.

bterlson · 2015-11-06T20:05:28Z

@ericf alright makes sense, thanks for the clarification.

sffc · 2019-03-19T09:48:58Z

It looks like this discussion is resolved. Please reopen if necessary.

caridy added the question label Sep 12, 2015

caridy mentioned this issue Sep 12, 2015

resolution on the usage of bidi for dates andyearnshaw/Intl.js#134

Open

rxaviers mentioned this issue Oct 21, 2015

Message formatting dependant on base text direction (Bidi support) globalizejs/globalize#539

Open

This was referenced Nov 3, 2015

Extend Formatters to allow for token formatting #30

Closed

Date format has invisible characters in IE11 formatjs/formatjs#201

Closed

goyakin mentioned this issue Mar 29, 2016

Intl: new Date().toLocaleString('de') puts unicode (BiDi) markers around punctuation chakra-core/ChakraCore#599

Closed

jianchun mentioned this issue Apr 9, 2016

address a list of failing tests for chakra nodejs/node-chakracore#54

Merged

sffc closed this as completed Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] bidi control characters when formatting dates #28

[Discussion] bidi control characters when formatting dates #28

caridy commented Sep 12, 2015

caridy commented Sep 12, 2015

srl295 commented Sep 12, 2015

srl295 commented Sep 12, 2015

srl295 commented Sep 12, 2015

caridy commented Sep 12, 2015

srl295 commented Sep 12, 2015 via email

bterlson commented Sep 12, 2015

srl295 commented Sep 14, 2015

bterlson commented Sep 14, 2015

caridy commented Sep 14, 2015

srl295 commented Sep 22, 2015

zbraniecki commented Sep 23, 2015

bterlson commented Sep 23, 2015

shervinafshar commented Sep 23, 2015

srl295 commented Sep 23, 2015 via email

caridy commented Sep 23, 2015

shervinafshar commented Sep 23, 2015

tomerm commented Oct 22, 2015

caridy commented Nov 6, 2015

bterlson commented Nov 6, 2015

ericf commented Nov 6, 2015

bterlson commented Nov 6, 2015

sffc commented Mar 19, 2019

[Discussion] bidi control characters when formatting dates #28

[Discussion] bidi control characters when formatting dates #28

Comments

caridy commented Sep 12, 2015

Notes:

Problems:

Proposals:

Links:

caridy commented Sep 12, 2015

srl295 commented Sep 12, 2015

srl295 commented Sep 12, 2015

srl295 commented Sep 12, 2015

caridy commented Sep 12, 2015

srl295 commented Sep 12, 2015 via email

bterlson commented Sep 12, 2015

srl295 commented Sep 14, 2015

bterlson commented Sep 14, 2015

caridy commented Sep 14, 2015

srl295 commented Sep 22, 2015

zbraniecki commented Sep 23, 2015

bterlson commented Sep 23, 2015

shervinafshar commented Sep 23, 2015

srl295 commented Sep 23, 2015 via email

caridy commented Sep 23, 2015

shervinafshar commented Sep 23, 2015

tomerm commented Oct 22, 2015

caridy commented Nov 6, 2015

bterlson commented Nov 6, 2015

ericf commented Nov 6, 2015

bterlson commented Nov 6, 2015

sffc commented Mar 19, 2019