Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message formatting dependant on base text direction (Bidi support) #539

Open
ashensis opened this issue Oct 18, 2015 · 28 comments
Open

Message formatting dependant on base text direction (Bidi support) #539

ashensis opened this issue Oct 18, 2015 · 28 comments

Comments

@ashensis
Copy link

Issue of message formatting constitutes quite a challenge from Bidi perspective.
The primary consideration is, in which text direction message is supposed to be displayed.(in DOM model, the HTML element text direction is controlled either by 'dir' attribute or by 'direction' css style)

For detailed explanation of what base text direction term means please see, for instance, pages 5-9 of the following reference document.
https://docs.google.com/document/d/1dDrSwimrQbpbXybhMYDEiJXLeeqvelnY4cc8HeCP1Nk/edit?usp=sharing

  1. Let us consider for illustration the parametrized message:
    Globalize.loadMessages({he: {
    error: "THE FILE {0} HASN'T BEEN FOUND",
    }});
    where upper case stands for Arabic/Hebrew.
    Now if file pathe which is going to substitute parameter {0} is in English (like c:\file.txt) and this whole formatted message is supposed to be displayed in left-to-right text direction, the display will be senceless - the reading will be unintelligible.
    Globalize("he").messageFormatter("error")("c:\file.txt");
  2. In addition to the aforementioned 'use case', there is another typical usage of message formatting (let us call it bread crumb pattern) that may constitute a problem from base text direction prespective, although it may be viewed as a particular sub-case of parametric message.
    Let us consider the following message (as is expected to be displayed):
    Globalize.loadMessages({he: {
    breadcrumb: "{0} >> {1} >> {2}",
    }});

Even if formatted message with substitutions is supposed to be displayed in left-to-right text direction, the outcome will be rather unpredictable depending on substituting text as follows: (upper case stands for Arabic/Hebrew characters, lower for English)

Globalize("he").messageFormatter("breadcrumb")([ "first", "second", "third" ])
will produce on left-to-right display: "first >> second >> third"

Globalize("he").messageFormatter("breadcrumb")([ "FIRST", "SECOND", "THIRD" ])
will produce on left-to-right display: "THIRD << SECOND << FIRST"

Globalize("he").messageFormatter("breadcrumb")([ "first", "SECOND", "THIRD" ])
will produce on left-to-right display: "first >> SECOND << THIRD"
(please note the order of substitutions on display as well as hectic directionality of <<)
In case like 'breadcrumb' the proper segment isolation is required, which may be only achived by using Unicode control characters (aka. UCC)
UCC are to be used in order to cope with the first kind of problems described above.

The proposal is to augment the 'messageFormatter' (and 'formatMessage') API with
additional parameter - base text direction, and insert appropriate UCC into resultant
formatted string in order to resolve 2 kinds of above mentioned problems.
The base text direction parameter (either 'ltr' or 'rtl') will match the text direction in which the returned string is supposed o be displayed.

@rxaviers
Copy link
Member

Hi @ashensis, thanks for opening this issue.

Unicode has explicit direction mark, such as U+200E LEFT-TO-RIGHT MARK or U+200F RIGHT-TO-LEFT MARK (more info at http://www.unicode.org/reports/tr35/#Text_Directionality). CLDR data for date or number formats already use this characters for setting correct text direction. So, I suggest using the same approach in the messages.

Please, just let me know on any questions or if you believe there's any case for which this approach isn't sufficient. Thanks

@tomerm tomerm mentioned this issue Oct 19, 2015
2 tasks
@ashensis
Copy link
Author

Yes Rafael.
This has been my intention, however there is more to it then just using directional UCC marks.
From our experience with Bidi support for various IBM and other products, the cases mentioned in this feature make it necessary to use UCC separator marks as well (200E,200F) for proper isolation of pertinent segments of structured text. So I used rather general term UCC marks.

@rxaviers
Copy link
Member

This has been my intention, however there is more to it then just using directional UCC marks.

Ok, so I'm reopening this issue.

@rxaviers rxaviers reopened this Oct 19, 2015
@tomerm
Copy link

tomerm commented Oct 19, 2015

Just a couple of words on correlation between this ticket and #423.
Taking the same example from original description: "{0} >> {1} >> {2}"
For English we would like to have: a >> b >> c
While for Arabic we would like to have: c << b << a

For this to happen we need to understand current language of UI (English vs. Arabic) and based on that decide which UCC to use. For example:
If RTL = "LTR" // namely GUI is in English or any other LTR language
[LRE] + {0} + [LRM] + >> + [LRM] + {1} + [LRM] + >> + [LRM] {2} + [PDF]
if RTL = "RTL" // namely GUI is in Arabic or any other RTL language
[RLE] + {0} + [RLM] + >> + [RLM] + {1} + [RLM] + >> + [RLM] {2} + [PDF]

@tomerm
Copy link

tomerm commented Oct 19, 2015

@rxaviers I must apologize for my ignorance. We would like to move from use case / problem discussion to implementation discussion. Namely submit a patch and discuss the implementation details. According to my understand for this to happen the discussion in this specific ticket (which is considered to be a use case / problem discussion) must be concluded. What is the indication that the discussion is over ? Any of the fields below should get any particular value (Milestone, Assignee) ? Appreciate very much your clarifications in advance.

@rxaviers
Copy link
Member

I'm a newbie on this topic, so forgive me if my question is dumb. But, it's still not clear to me why MessageFormatter has to deal with the UCC marks instead of relying on user to appropriately insert them?

Note user has full control of the messages content and could use something like the below, right?

Globalize.loadMessages({
    en: {
        breadcrumb: "[LRE]{0}[LRM] >> [LRM]{1}[LRM] >> [LRM]{2}[PDF]"
    },
    he: {
        breadcrumb: "[RLE]{0}[RLM] >> [RLM]{1}[RLM] >> [RLM]{2}[PDF]"
    }
})

@tomerm
Copy link

tomerm commented Oct 21, 2015

If we continue this thread of thought we can say that user can rely on just plain HTML 5 / JS and develop its own web app without using any JS toolkit like JQuery, Dojo, Aurelia and many others. I believe we want to lift from user all the burden associated with G11N (globalization) support as much as possible. Usually a regular web app has hundreds if not thousands of messages. Even if only 50 % of those messages require additional formatting work, expecting from programmer to deal with it is obviously not very efficient. It is much more effective to have a very simple piece of code which handles it automatically across all languages.
By the way, doing it as part of MessageFormatter is one of the options (I would say one of the most natural one). This functionality can be provided as a stand alone function and used by programmers when needed (sometimes in conjunction with MessageFormatter but not necessarily).
Please observe that there are cases in which text is being concatenated (in a very similar way to breadcrumb) but not necessarily as part of any particular message. Namely content of a label shown on web app is being constructed pro-grammatically from different pieces of text.
label.value = "some text" + "-" + "some other text"
This is yet additional use case in which suggested solution can help.

@rxaviers
Copy link
Member

I've interpreted the answer as "yes, it's possible but not convenient". Don't take me bad I'm trying to understand the problem.

@rxaviers
Copy link
Member

Technically, Globalize messageFormat support is inherited from SlexAxton/messageformat.js. So, cc'ing @SlexAxton, @eemeli for their input.

PS: Alex, Eemeli, RTL information should be available on JSON CLDR when http://unicode.org/cldr/trac/ticket/9038 is addressed.

@rxaviers
Copy link
Member

Note to self: possibly related tc39/ecma402#28.

@eemeli
Copy link

eemeli commented Oct 21, 2015

Just read through the discussion above & #423. Do I understand right that the desired state would be for a MessageFormat string to be able to have its source as well as its component fields define their variable LTR/RTL-ness, and for this information to be encoded as Unicode control characters in the resulting output string? If that is the case, it sounds like a pre-processing stage to just take those strings and their directionality flags, and add the required control characters before passing the string on to the actual formatter. Or am I missing something?

It would help if an example could be provided with actual LTR & RTL characters, showing what the output currently is, and what it ought to be.

@tomerm
Copy link

tomerm commented Oct 22, 2015

We are definitely talking about kind of pre-processing or combined processing as a result of which not only place holders (i.e. {0}) in the message template (i.e. "{0} - {1} : {2}") are replaced by text, but also the entire layout of the message is preserved and is ready for display. The major assumption behind solution going along with UCC is that rendering is done by UBA (Unicode Bidi Algorithm) compliant engine with full support for UCC. This is a safe assumption for HTML / JS technology and modern browsers.

"define their variable LTR/RTL-ness" - assuming this relates to entire message layout flow (which is derived from current UI language), this information will be automatically retrieved from CLDR (specifically RTL property discussed in #423 ).

" showing what the output currently is, and what it ought to be" - are you referring to text content or the way it is displayed on the screen ? For text content the example above (i.e. "[LRE]{0}[LRM] >> [LRM]{1}[LRM] >> [LRM]{2}[PDF]") should be good enough (just replace all [..] with appropriate Unicode codes). If you refer to display I will have to provide images (or MS Word file with embedded images) to assure there is no interference with any additional rendering engine rule.

@tomerm
Copy link

tomerm commented Oct 22, 2015

formattedmessagewithucc

@eemeli
Copy link

eemeli commented Oct 22, 2015

Ok, I think I get it. But I also think that using LRM/RLM is in this case going to prove more complex and more incorrect than using LRE/RLE+PDF, once you start to consider neutral characters at the beginning or end of a variable value.

Consider, for instance, your MessageFormat string {0} - {1} : {2} when used in an LTR context, with inputs [ 'ALEPH 1', 'BETH 2', 'gamma 3' ] each represented here in logical order but with the first two in an RTL script. The desired eventual visualization for this string would be:

1 HPELA - 2 HTEB : gamma 3

Assuming I've understood the Unicode logic right, this is what the string would currently end up looking like, without any inserted control codes (please, do correct me if I'm mistaken):

HTEB - 1 HPELA 2 : gamma 3

If we insert LRM marks as suggested above, we get the logical string [LRE]ALEPH 1[LRM] - [LRM]BETH 2[LRM] : [LRM]gamma 3[PDF], which will result in this different, but still incorrect visualization:

HPELA 1 - HTEB 2 : gamma 3

What I think we'd really like to use is the Unicode 6.3 FSI/PDI pair around each input string, a <bdi> tag, or a <span dir="auto">, but none of those options are supported by all the browsers.

So in order to actually get a correct implementation out of this, I think we currently would need to get info about the directionality of each string value, and use that to construct something like [RLE]ALEPH 1[PDF] - [RLE]BETH 2[PDF] : gamma 3, which would get us the desired representation:

1 HPELA - 2 HTEB : gamma 3

To get to that, we'd need that info from the user via a new parameter for each input variable, when that variable's directionality differs from that of the surrounding text. Any custom JS directionality detector won't always work, given the possibility of input strings with a different directionality than the surrounding context but consisting only of weak characters.

Are there really other options than this for the character-level implementation?

@tomerm
Copy link

tomerm commented Oct 22, 2015

If message can include HTML markup obviously we can use the markup instead of UCC.
You are correct that FSI/PDI or dir=auto are not supported by all browsers. However for strict LTR/RTL directions we can use inline block element and set direction to LTR/RTL. Bottom line, there is an equivalent markup which we can use to achieve the exact same display.

Expected display is correct. However usage of RLE -PDF alone won't achieve any effect. LRE - PDF and RLE -PDF enforce LTR or RTL direction of text enclosed between those markers. However, they don't enforce relative order between tokens. This is what is achieved with LRM / RLM. For example:

For LTR flow of tokens: LRE token1 LRM token2 LRM .... tokenX PDF

For RTL flow of tokens: RLE token1 RLM token2 RLM ... tokenX PDF

For each token we can decide on specific direction based on parameter passed into function:

For LTR direction: token = LRE + token + PDF
For RTL direction: token = RLE + token + PDF

I believe we started to discuss the implementation options. Does it mean that this ticket can be closed and we can submit a draft implementation via PR ? We can provide both UCC and markup based solutions.

@tomerm
Copy link

tomerm commented Oct 22, 2015

messageformatting

@tomerm
Copy link

tomerm commented Oct 26, 2015

@rxaviers & @eemeli any verdict / comments ?

@rxaviers
Copy link
Member

I will be able to dig into it in a week from now...

@eemeli
Copy link

eemeli commented Oct 29, 2015

Sorry @tomerm, I got sidetracked by other work. You're right about the control codes, my wrong interpretation was based on this example, when connected with this correspondance. Am I misunderstanding something, or is there something wrong in one of the linked W3C explanations?

Tbh, I've not dealt with any mixed ltr/rtl cases before this, so this is a learning experience for me. I'll need to return to this a bit later.

@tomerm
Copy link

tomerm commented Oct 29, 2015

W3C articles authored by Richard Ishida and reviewed by Bidi GCoC from IBM (I am a head of this organization inside IBM) are very good source of information on Bidi subject (especially if you are new to it). However, they don't explain how to handle ANY bidi related issue. Specifically in many use cases (relevant for discussion in this ticket) they provide the assumption is that TOP level direction (mostly on the sentence / paragraph level) is LTR and that this text (sentence / paragraph) are embedded into environment with predominant LTR flow direction of UI (think of dir=ltr on the top level of HTML page). This is a valid scenario but it is obviously not a general one. Moreover, there are differences between browsers which better to be covered not via if ... else .. switch but by more commonly supported approach. Anyway, we are talking about the difference between following 2 solutions:
<span dir="rtl">EGYPT </span>, <span dir="rtl">BAHRAIN </span>
<span dir="rtl">EGYPT </span> &lrm;, <span dir="rtl">BAHRAIN </span>
For this specific text pattern discussed in W3C, both will work.

@rxaviers
Copy link
Member

rxaviers commented Nov 4, 2015

Hi @tomerm,

Technically, your PR would have to be made against https://github.com/SlexAxton/messageformat.js/, which is ultimately the realm of @eemeli and @SlexAxton. Given Globalize imports such code, I'm obviously interested that your issue is sorted out.

Like I said I'm a newbie on this topic and I am still trying to understand the motivation. Your full explanations and links obviously helped, but is the following summary true?

If a document is written in English and the default rendering orientation is LTR (somewhat equivalent to <html dir="ltr">...</html>), Hebrew (or any other RTL script) phrases must use explicit UCC or HTML markup (<span dir="rtl">...</span>).

If a document is written in Hebrew, exactly the opposite of the above is required for English (or any other LTR script) phrases.

Today (using current available code), it's possible to solve this problem by having user to deal with the UCC marks whenever scripts are mixed in messages. Although not convenient, I assume these are determined few places in an application. Please, correct my assumption if I'm wrong.

The goal of the issue, by handling UCC at the library level, is that users are going to be able to mix scripts in their messages and these are always going to be displayed in the correct rendering orientation.

Is that correct, please? Thanks

@tomerm
Copy link

tomerm commented Nov 4, 2015

Hi Rafael - @rxaviers ,

You are absolutely correct on the goal: users are going to be able to mix scripts in their messages and these are always going to be displayed in the correct rendering orientation.

Your are also correct by saying that it is possible to solve the problem already today. However this solution will need to be implemented by product developers since it won't be provided out of the box by JQuery. As a side comment, I think this assertion is mostly true for almost all functionality suggested in Globalize. With appropriate amount of efforts each JQuery based product can address globalization issues externally to JQuery.

I appreciate very much your support and help :-))

@eemeli
Copy link

eemeli commented Nov 4, 2015

Continuing slightly the character-level implementation part of this, is there a specific set of character codes that needs to be inserted before and/or after each part where the directionality changes? As in, if we're in an RTL context and want to include a variable with LTR content, what are the codes that need to go before and after the content? And do these codes depend only on the outer directionality, or will they produce bad output if e.g. we include RTL within RTL, rather than LTR within RTL?

If the answer to the latter quesiton can be "no", we could add a flag to messageformat.js that would then always insert the specified codes. To start, that flag-setter would probably need as input the directionality of the script, but we should be able to eventually get that from the CLR as well.

However, if we need to alter the codes depending on the directionality of each variable, it gets a bit trickier.

@tomerm
Copy link

tomerm commented Nov 4, 2015

Allow me to use UCC like notation in this discussion since otherwise if I use markup the examples will be less and less readable. For example I will use
<LRM> <RLM> <RLE> etc. to refer to different UCC characters.

There are several parameters which should be taken into account. I will try to avoid unrealistic scenarios and will assume following assumption: For any given language the message bundle(s) includes messages formulated in only one specific language. In other words, having Japanese messages in Russian message bundle is obviously technically possible but is not a supported use case. In this context I refer to message as a sentence formulated in natural language (i.e. My name is {0})

I won't presume that environment in which message is displayed has the same flow direction as it would be naturally for the language in which message is specified. Namely I do allow cases in which Arabic message is displayed in GUI with predominant LTR flow. This is very applicable for "messages" which constitute a concatenation of data (rather then sentences formulated in natural language). For example: {0} - {1} : {2}. Obviously such type of messages can include data in any language which is completely independent from GUI language / flow.

Parameters we need to take into account:

  1. GUI language - it is usually a well defined parameter (mostly based on locale). We already know that there is a way (via CLDR) to automatically identify natural direction for any given language.
  2. Direction of text for each of the placeholders replaced by actual data at runtime

GUI language usually will define the overall flow of the message:
English, French, Spanish etc.: <LRE> .... <PDF>
Arabic, Hebrew, Farsi etc.: <RLE> .... <PDF>
We enclose entire message to assure its flow is completely independent / isolated from outside environment.

To assure that tokens which might be part of the message flow in the same direction as GUI direction we will enclose each placeholder as follows:
GUI language =[English, French, Spanish etc.]: <LRM>{0}<LRM>
GUI language =[Arabic, Hebrew, Farsi etc.]: <RLM>{0}<RLM>

To assure direction of text inside each token is according to what was specified by caller we will act similarly to the way we did on the level of entire message;
GUI language =[English, French, Spanish etc.]:
for LTR resolved / actual direction: <LRM><LRE>{0}<PDF><LRM>
for RTL resolved / actual direction: <LRM><RLE>{0}<PDF><LRM>
GUI language =[Arabic, Hebrew, Farsi etc.]:
for LTR resolved / actual direction: <RLM><LRE>{0}<PDF><RLM>
for RTL resolved / actual direction: <RLM><RLE>{0}<PDF><RLM>

Q:...if we're in an RTL context and want to include a variable with LTR content, what are the codes that need to go before and after the content? ...
A: <RLM><LRE><PDF><RLM>

Q: And do these codes depend only on the outer directionality, or will they produce bad output if e.g. we include RTL within RTL, rather than LTR within RTL?
A: Obviously what is good for LTR content is not good for RTL content. Some codes (LRM / RLM) depend on the outer directionality while others (LRE / RLE) depend on the token directionality.

Q: ...To start, that flag-setter would probably need as input the directionality of the script, but we should be able to eventually get that from the CLR as well....
A: Indeed. This is what we can pro-grammatically get from CLDR based on current locale. I called it "GUI language" above

Q: ... if we need to alter the codes depending on the directionality of each variable, it gets a bit trickier....
A: It is a question of granularity of control. If text direction for each placeholder can be explicitly specified then it is pretty straightforward what to do (see my description above). For those placeholders for which direction can't be specified we can assume "auto" (aka contextual or first strong) as a default value. Please observe by the way, that for some placeholders (i.e. numbers) there is no need to enforce any direction at all. So, I believe possible values for text direction to be specified for a placeholder can be: NONE - do nothing LTR - left-to-right RTL - right-to-left AUTO - contextual.

@tomerm
Copy link

tomerm commented Nov 4, 2015

Sample call:
Globalize.loadMessages({he: {breadcrumb: "{0} >> {1} >> {2}"}});
Globalize("he").messageFormatter("breadcrumb")([ "first", "SECOND", "THIRD" ],["LTR", "RTL", "RTL"])

First array provides the data, second provides information which helps to format it.
If second array is NOT provided, we assume AUTO value for each one of the placeholders.

@eemeli
Copy link

eemeli commented Nov 6, 2015

@tomerm, I think some of your "UCC like notation" got dropped from your message when you posted it. This is how it appears to me, in case this is a browser issue:

screen shot 2015-11-06 at 15 12 37

@tomerm
Copy link

tomerm commented Nov 6, 2015

Apologies. Fixed the same comment above.

@ashensis
Copy link
Author

ashensis commented Jan 3, 2016

As a first phase of contributing corresponding support I filed new pull request under:
messageformat/messageformat#128
messageformat.js: Provide Bidi Text Direction and Structured Text support to MessageFormat #128

rxaviers pushed a commit that referenced this issue Jul 18, 2016
rxaviers added a commit that referenced this issue Jul 18, 2016
rxaviers added a commit that referenced this issue Jul 18, 2016
rxaviers added a commit that referenced this issue Jul 18, 2016
nkovacs pushed a commit to nkovacs/globalize that referenced this issue May 15, 2017
nkovacs pushed a commit to nkovacs/globalize that referenced this issue May 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants