Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for BiDi in placeables #28

Closed
zbraniecki opened this issue Feb 2, 2020 · 54 comments · Fixed by #315
Closed

Support for BiDi in placeables #28

zbraniecki opened this issue Feb 2, 2020 · 54 comments · Fixed by #315
Labels
blocker-candidate The submitter thinks this might be a block for the Technology Preview requirements Issues related with MF requirements list

Comments

@zbraniecki
Copy link
Member

Since placeables can be of mixed directionality, I'd like to suggest that Fluent's FSI/PDI insertion for string placeholders is added to requirements.

This allows a variable like userName to be inserted in a string with different directionality and inform the layout of the possible direction change.

W3C backlog: https://www.w3.org/International/articles/inline-bidi-markup/
Fluent wiki: https://github.com/projectfluent/fluent/wiki/BiDi-in-Fluent

@mihnita
Copy link
Collaborator

mihnita commented Feb 15, 2020

+100

What I've seen being useful was marking a placeholder / area as RTL / LTR.
But that is not very often.

A lot more often was "smart detection", basically inspect the value of the parameter and "guess" what the best direction would be.
See for example: https://developer.android.com/reference/android/support/v4/text/BidiFormatter

The android solution is not ideal, the developer should explicitly "wrap" the parameter
Pseudocode:
...loadString(id).format( bidiWrapper(userName))
I think it should be more like
...loadString(id).format(arg)

And the string would be something like ...{userName}...
It would be ALWAYS wrapped, by default, unless explicitly disabled:

  • ...{userName, bidi, rtl}...
  • ...{userName, bidi, ltr}...
  • ...{userName, bidi, none}...
  • ...{userName, bidi, auto}... => same as ...{userName}...

The wrapper is smart, does not add bidi control characters if not needed.
So you are not going to see Hello {LRM}John{PDF}! in an English string :-)

@nbouvrette
Copy link
Collaborator

I don't know what you think but to me, this type of feature seems related to text transformation (see related thread).

If I'm not mistaken Fluent handles this with function-like wrappers. I'm trying to picture a scenario where you might want to capitalize and change text direction. If we could have a standard way to transform text, it could keep this simple:

# This software is made by {brand}
يتكون هذا البرنامج بواسطة {brand, transform, {rtl, titlecase}}

@zbraniecki
Copy link
Member Author

If I'm not mistaken Fluent handles this with function-like wrappers.

No, Fluent implicitly wraps all placeables of type String in FSI/PDI to reset directionality.

@mihnita
Copy link
Collaborator

mihnita commented Feb 17, 2020

+100 doing it by default.

Not 100% sure FSI/DPI is the right thing, I would have to spend some time experimenting.
But yes to do the right thing by default, with the ability to turn it off for false positives.

@aphillips
Copy link
Member

+1 to providing this by default. Note that when direction metadata is available the FSI should be replaced with the appropriate base-direction isolating control.

Note too that @zbraniecki only mentions placeables of type string, but non-string placeables can have spillover effects (for example currency values).

@zbraniecki
Copy link
Member Author

@mihnita yes! I imagine that we would be able to do sth like:

bundle.formatPattern(pattern, {
  userName: FluentString(user.name, {dir: "rtl"});
});

as an option, and then, if that's provided, we can specify the directionality if it differs from the direction of the translation. If it matches, then we can skip directionality signs. If it is unknown, then we can use FSI/PDI.

@aphillips yes! In particular, we can know if the formatter provided result in the same language as the translation and wrap in marks or not. For most common scenario, where the currency formatter provided formatted text in the same direcitonality as the translation, we could skip it, but if we had to fallback and the currency is in different directionality, we would wrap.

@aharon-lanin
Copy link

A few comments:

  • FSI / PDI may not be supported in all platforms (browsers and operating systems). It seems to work on Chrome, Firefox, Android, and Linux. I have not tried it on Safari or iOS, and when I last tried it on Edge and Windows, it did not work. If it is not sufficiently supported, one needs to estimate directionality of the placeable in code, and then wrap the placeable with LRMs or RLMs on the outside (depending on message locale) and an LRE or RLE and a PDF on the inside (depending on placeable directionality), as BidiFormatter does, skipping as many of these as can be safely done.
  • For estimating directionality, I would recommend the first-strong algorithm, as used by FSI/PDI, for forward compatibility.
  • Being able to specify the directionality of the placeable is a great option. However, from experience, 95% of the time the caller has nowhere to get it.

@zbraniecki
Copy link
Member Author

Safari and iOS support FSI/PDI. Edge supports it as well. Windows modern APIs do support it, win32 does not.

@rxaviers
Copy link
Contributor

rxaviers commented Aug 17, 2020

@mihnita mihnita added the requirements Issues related with MF requirements list label Sep 24, 2020
@mihnita
Copy link
Collaborator

mihnita commented Oct 26, 2020

We can also take a look at what Android does.

The BidiFormatter::unicodeWrap takes a TextDirectionHeuristic, with several supported out of the box: android.text.TextDirectionHeuristics

At runtime that method looks at the value of the parameter, and adds the proper BiDi control characters, a bit smarter than just FSI / PDI, or first strong, or any "fixed" approach.


Dart has two methods, one "wraps" using BiDi control characters, the other one uses HTML tags:
https://api.flutter.dev/flutter/intl/BidiFormatter-class.html


I'm not advocating for any of these "as is", just submitting them as "prior art" and source of inspiration.

But if we go with this direction then I would call the wrappers inside of MessageFormat, not force the developers to wrap parameters by explicitly calling these kind of helper methods.

@zbraniecki
Copy link
Member Author

Can we revisit it now?

It seems like we still didn't add it to the spec. I suggest that we by default wrap any placeable in FSI/PDI marks, just like Fluent does it, in line with W3C recommendation for placeables - https://www.w3.org/International/articles/inline-bidi-markup/

We can introduce evasion logic that allows us to explicitly turn off FSI/PDI for a given message format as an option to communicate request to format a message without inserting FSI/PDI.

Finally, we could start building evading logic for scenarios where the directionality of the surrounding text and the placeable is known to match. For example number/date inserted in the same locale as a surrounding message does not need FSI/PDI. Similarly, a string inserted could be marked with explicit directionality:

let mf = new MessageFormat("en");
mf.format("Hello, { $user }", { user: MFString("John", { dir: "ltr" }) });

or as matching:

let mf = new MessageFormat("en");
mf.format("Hello, { $user }", { user: MFString("John", { dir: "matching" }) });

In the former case the algorithm will detect directionality of "en" and if the directionality of MFString matches it it'll evade FSI/PDI. In the latter it will evade it automatically.

@mihnita @stasm @eemeli

@zbraniecki
Copy link
Member Author

I'd like to suggest making a decision on it very soon. In my experience a lot of API users are not familiar with the problem space of directionality and the body of code starts growing where people expect to be able to match the output to a particular string and are surprised when FSI/PDI shows up in the output.

With Fluent we had to do quite a bit of evangelism - it was always well received, but definitely a paper cut.

I'm concerned that if we wait too long the argument of "too late" will pop up.

@aphillips
Copy link
Member

I tend to agree with @zbraniecki in general: to the degree possible this wants to be hidden in the "magick I18N stuff" and not be something regular developers have to think about all the time. Educating on bidi handling is hard and doesn't appear to add value until a company decides to do an RTL language.

However, I don't agree that inserting FSI/PDI is what W3C recommends. In markup contexts, we prefer that markup be used and include both language and direction metadata (i.e. both lang and dir attributes). We also prefer that the actual direction (e.g. LRI or LRI) be used whenever it is available. This both prevents spillover (due to isolation) and avoids problems with strings that have misleading strong directional characters at/near the start. We are spending significant effort in the W3C stack and possibly with ECMA-262 to try to get "localizable strings" to be first class citizens so that metadata can be scraped automagically for placeable values.

For formatted values (that is, where the placeable is a number, date, time, percent, currency value, usw that is generated by the message formatter) the base direction can be known from the locale. For unknown values (mainly strings), provision of metadata is required and FSI/PDI can be a fallback.

Note that some users may want to tailor the behavior because of their runtime environment, such as a few frameworks that don't yet support the isolating controls and show them as tofu. In this cases, RLM/LRM and embedding controls can be inserted as a shim. Others may want to turn off control generation because they are using a templating language or system that does the work for them.

@zbraniecki
Copy link
Member Author

Ah, good point on lang+dir, rather than just dir.

I think you're bringing two separate dimensions, which I'd categorize as:

  1. What information we provide about placeables
  2. How we annotate

I'll use the following example: "On January 15th 2022 at 5:45pm, Addison added 5 photos" which in MF2 will look something like this:

let $dateTime = {$timestamp :datetime date=medium time=medium}
let $personName = {$person :person firstName=long}
let $count = {$photoCount :number}

match {$count}

when 1 {On {$dateTime}, {$personName} added { $count } photo.}
when 0 {On {$dateTime}, {$personName} added { $count } photos.}

There are three placeables in this message and we may know the locale of the message itself (or not - is it possible for the lang/dir of the message to be undetermined via new MessageFormat("und") ?).

If dateTime is resolved into the same dir/lang as surrounding message we don't want to annotate, but if the message is in arabic, but DateTimeFormat doesn't have arabic data and resolves to English, we should annotate at least with directionality:

On {\uLRI}January 15 2022 at 5:45pm{\uPDI}, Addison added 5 photos.

(we use LRI because we know that datetime is in English, and we either know that the whole message is in Arabic or it is unknown)

For the user name, we may have an API that informs in what lang/dir is the name provided and then compare it to the message lang/dir, or we may not know.
If we do, and it differs, we can do the same as with date - LRI/RLI and PDI to pop. If we don't we can use FSI/PDI. If it doesn't differ we don't inject any.

For $count we repeat the same logic as we did for datetime.

Now, as mentioned in my previous message, the tricky question is how the develop annotates lang/dir of the variable. I suggested MF2 to provide typed variables types much like fluent does with FluentDateTime FluentNumber etc. This would allow for MF2String("Addison", {lang: "en"}) as optional (if omitted we'll use FSI/PDI).

Second question is how to control what we inject. My initial proposal is something like this:

let mf = new Intl.MessageFormat("en", {
  isolates: {
    lri: "\uLRI", // or MF2MarkupElement("bdo", {dir: "ltr"})
    rli: "\uRLI", // or MF2MarkupElement("bdo", {dir: "rtl"})
    fsi: "\uFSI",  // or MF2MarkupElement("bdo", {dir: "auto"})
    pdi: "\uPDI", // or  // or MF2MarkupElementClose("bdo")
  }
});

This way HTML bindings can provide MarkupElements for the same feature, and plain text can use the Unicode isolate characters. If LRI/RLI is set to null then FSI is used. If FSI/PDI is set to null, then nothing is ever injected.

This means that by default (if isolates is not explicitly provided) the API will inject unicode marks and frameworks can override them.

Attributes

What this doesn't resolve is that in ideal world a message like: Hello {strong}{$name}{/strong} would resolve to Hello <strong dir="auto">Addison</strong> rather than to Hello <strong><bdo dir="auto">Addison</bdo></strong>.

We may later evolve the logic to allow for population of attributes in cases where markup element is perfectly surrounding a placeable and we want to set dir/lang.

@mihnita
Copy link
Collaborator

mihnita commented Oct 28, 2022

Same as before, +100 :-)

But now, with a lot more things already "settled", I think we can dig deeper on what can / can't be done.

I've been thinking about it, and we probably need to answer some sub-questions.


What to add, exactly?

What can a low level library use to wrap placeholders?
The result might be used as plain text, or html, or something completely different.

Unicode control characters? HTML recommends using tags, not control characters.

HTML tags? We don't know if the consumer of the result understands HTML.
And we don't even know what kind of tags to insert.
A block kind of tag (div), or inline one (span)?
And should be <span dir="...">, or a <bdi>, or something else?
Even if HTML, should these be "events" (open tag, content, end tag) or DOM subtree (tag + content as child)

So I think the only thing that the spec can really say is put this info (somehow) in the "format to parts" (this chunk from here to there is RTL).
And leave it to a different layer to adapt the result for final consumption (control chars, html tags, something else).


And what part of the "chain" can do it.
Is it the custom function?
Or is it the engine?
Or a post-processing step, after .format (or .formatToParts) is invoked?

It the engine does it, all it can do when it sees ... pre ... {$ph :func} ... post ... is something like this:

  1. append ... pre ...
  2. Invoke :func
  3. take the string format of that
  4. analyze the string to guess the direction, and append the string "wrapped" in directional "markers" (see previous section about what that means)
    4 append ... post ...

I don't think that is a good model.
It still leaves some "guessing"
And only deals with "the outside" if things.

I think we want to allow for functions that in fact generate multiple components.

Let's think HTML...

And have a matrix formatter, that produces a table. Or a list formatter that produces a drop-box. Or even a regular <p> with <span> in it.
The elements inside the result should also be wrapped.
"You have emails from {$people :listformat}..." would probably have to result in
<p dir="ltr">You have emails from <bdi>person 1</bdi>, <bdi>person 2</bdi>, and <bdi>person 3</bdi>..."

Maybe <bd>, maybe <span> that's not the issue here. The issue is, each item needs to be wrapped.
Which the engine can't really do reliably.

So I think this is can only be done properly by the functions.


Do the translators needed to be able to change this, or not?

I would argue that yes, they need to.
If I have image tags in a string "To register with <img src="company_logo.jpg"> see <img src="next.jpg">"
You need a human to say "it's OK to flip the second image (next), but not the first one (company logo)".

The developer might know "ok, don't mirror the company logo", but you need the translator to tell you about the second one.


My proposals after this round of thinking:

  • we need a "direction" option that can be changed by translators, or fixed by the developer
  • allow translators to add that option, always, to any placeholder, if it was not already fixed by the developer.
  • that "direction" option should be applicable to all functions, standard or custom. Probably with values like "rtl", "ltr", "auto", "isolated", maybe more (TBD). That is a bit similar to HTML. There are attributes that are specific to certain elements (src, alt), but there are some universal ones (dir, lang, etc).
  • info about "this thing is rtl" should be in "parts", and not tech stack specific.
  • the info should be added by the functions, not the engine. The function gets the "direction" in the options bag and decides how to act on it.
  • have a post-formatToParts step that generates html / control characters / something else from the parts?
    And have that smart enough to also merge things like <strong><bdo dir="auto"> => <strong dir="auto">? (Addison's point)

Of course, if an implementation is not in a generic library like ICU, but very specific to produce HTML (in a browser), then some of the steps might be short-circuited (produce HTML tags / DOM directly, without format to parts + post-process).

@aphillips
Copy link
Member

Each string/substring should have a language and direction attribute (note that this is what W3C I18N is asking TC39 for with the maybe-terribly-named Localizable proposal). A formatToParts can produce a sequence of Localizable that the consumer can use to generate controls or HTML markup as needed.

I suspect that MF's format (i.e. formatToString rather than parts) should probably have a couple of modes, one of which is "do nothing" (just make a string and do not generate controls) and one of which is "plain-text" (i.e. generate isolate controls as needed).

Note that dir only has three potential values: ltr, rtl, and auto (first-strong/don't-know). Isolation should be the default vs. embedding. I'll have more detailed thoughts in a bit.

@mihnita
Copy link
Collaborator

mihnita commented Oct 28, 2022

three potential values: ltr, rtl, and auto

Ack, thanks.

@mihnita
Copy link
Collaborator

mihnita commented Oct 28, 2022

About Localizable

After a very-very superficial scan (can't call it read) of the Localizable proposal, and with the disclaimer that I don't "grok" the relation between W3C, WebIDL, and ECMAScript, or what WebIDL is really trying to do :-)

These are my quick impressions:


Can't direction be derived from locale?


WebIDL seems to be (mostly) "Unicode unaware / agnostic"

  • DOMString => "commonly interpreted as UTF-16 encoded strings ... although this is not required."
  • ByteString => "might be interpreted as UTF-8 encoded strings ... although this is not required."
  • USVString => The only one that seems to be guaranteed Unicode, but it's use seems to be discouraged (see the Warning)

Which makes these strings kind of useless for l10n / i18n.
Should the Localizable be explicit that it uses some kind of Unicode encoding? Which is even more important than the locale and direction (maybe it is saying that and I've missed it)


If there is resistance to Localizable, would it be an option to use Annotated types to express locale and direction. And (even more important in my opinion) the fact that the string annotated is Unicode?


Let me know if you think these points help in any way, and where should I cut / paste them (because it is clear they don't belong here :-)

@zbraniecki
Copy link
Member Author

If I correctly interpret what @mihnita @aphillips wrote below my last response we agree on the value and considerations.

The only item I'd like to clarify is if @mihnita believes that formatToString should return the isolation marks or not (you say that the bidi/lang system should annotate parts, but I don't see your position on the string output).

The question is - what are the next steps? As I mentioned above, I'm concerned about Tech Preview being released without this and I'd like to make sure we don't have any more releases (even if they remain TP) that make testers work with MF output without this feature.

@aphillips
Copy link
Member

Can't direction be derived from locale?

Not entirely. Language information can be used as a fallback when no direction information is available, but we don't think it is a good general solution.

WebIDL seems to be (mostly) "Unicode unaware / agnostic"

It seems that way because of JavaScript's historical (and misguided) ambivalence about saying that strings consist of Unicode code points. In reality, the three types @mihnita cites have a clear relationship to their respective representations.

The point of Localizable would be to create a type, class, or commonly shared data structure (via a "dictionary" definition) that specifications could just use. The "value" portion of a Localizable would be the text bearing string and each string would also have a lang and dir attribute. That way one could write:

<!-- for some variable value "myVar" -->
<p lang="$myVar.lang" dir="$myVar.dir">$myVar.value</p>

There already exist mappings for RDF and as-a-string serialization schemes in JSON-LD and a number of specifications use what amounts to Localizable as a JSON representation. A proposed definition for Localizable exists in our document String-Meta at this location

If there is resistance to Localizable, would it be an option to use Annotated types to express locale and direction.

Yes! This is entirely an option that is on the table. We would need some group to publish a normative spec (in W3C terms, a "Recommendation" or REC-track document) with the "dictionary" in it which specs could refer to normatively. This is what we asked WebIDL to do, but they "only model things that exist in JavaScript", hence my detour to ask TC39 to make a Localizable type. If we think that a Localizable or "natural language string" type in JavaScript proper would be useful for I18N generally (and it certainly would make it easy for developers to use it vs. writing a data structure), then we should push for it. I suspect, though, that the headwinds are going to be strong.

@zbraniecki noted:

The only item I'd like to clarify is if @mihnita believes that formatToString should return the isolation marks or not (you say that the bidi/lang system should annotate parts, but I don't see your position on the string output).

As I mentioned, it could be optional and I suspect it should be optional. Control characters insertion could also be added later, since most consumers probably don't introspect inside strings to find directional boundaries. That is, it might not be a blocker for the preview, but would be Very Nice To Have (compare to current MF, which does nothing). Current formatters, such as NumberFormat, only handle bidi issues internal to the formatted string value (cf. the thread with Peter Edberg about currency formats which various Amazon folk have commented on), but bidi isolation of placeables, including in MessageFormat is up to the pattern string and implementer. (For an example, look at Amazon's internal I18N utilities library for BidiFormat and friends)

I think another interesting question is: does formatToParts provide controls or does it provide metadata (and you insert your own controls or markup)? Notice that if formatToString provides controls and formatToParts does not, then that would mean that the two do not produce equivalent code point sequences when concatenating the parts together.

@mihnita
Copy link
Collaborator

mihnita commented Oct 29, 2022

I think that formatToParts would produce a (standardized) meta that can be converted by a processing step to controls, html tags, something else, or nothing.

And formatToString can be implemented by just iterating the parts from formatToParts and appending to a string buffer, ignoring some parts.

So if there is a part saying "from here to there we have a bidi isolate", formatToString can choose to ignore that info, or produce control characters.

For a low level library like ICU that should probably be an option and decided by the developer calling it (or the layers built on top of it). Probably would be good to do the same for ICU4X.


In recent years it looks like ICU is going in that direction.

For example LocalizedNumberFormatter.format returns a FormattedNumber, and there is no API that returns a string directly (similar to formatToString). You need to explicitly call toString on the result to get a string result.

And FormattedNumber has methods like getGender(), getNounClass(), getOutputUnit() and ways to iterate the "parts" (nextPosition(ConstrainedFieldPosition) and AttributedCharacterIterator toCharacterIterator()).
It looks very much like an "unpolished form" of formatToParts.

I hope we can improve things a bit with MF2.


And I think that defining the result of formatToParts is some other issue we need to revive :-)

@mihnita
Copy link
Collaborator

mihnita commented Oct 29, 2022

My take on this, less verbose, and maybe more clear:

does formatToParts provide controls or does it provide metadata

I think my answer would me metadata.

then that would mean that the two do not produce equivalent code point sequences when concatenating the parts together

I think we should not concatenate parts and strings.
Ideally each formatter function would return parts. MessageFormat would concatenate plain text (wrapped in a part) & parts returned by formatters.
And the final conversion from parts result to string would iterate the parts to generate string result.

The question is: how to we invoke older formatters which already return strings with controls.
Without thinking too much (so might not be a good idea) is that we need to wrap those functions in something that looks like the MF2 function signature. So it would return parts. And that "wrapper" would take the legacy string, and convert to parts, with meta for bidi info.

@macchiati
Copy link
Member

macchiati commented Oct 30, 2022 via email

@macchiati
Copy link
Member

macchiati commented Oct 30, 2022 via email

@aphillips
Copy link
Member

@mihnita The older formatters return controls to elicit proper ordering of ambiguous sequences within a formatted string, such as a number (especially currency values) or date. The formatters do not provide exterior wrapping/isolation to prevent spillover effects (which is what we're talking about here).

@macchiati I don't agree that:

We can't, however, know the base direction of the message, because that would depend on the context in which it is being used.

We need to know the base direction of the string, since the string itself is a placeable into its rendering context. When messages don't have a base direction, they are subject to spillover effects or wrong base direction detection, particularly if they start with a misleading strong character. Worst-case, we can use first-strong. I suppose that this might be the realm of a higher-level protocol, such as a resource language. But if strings don't have a base direction, we won't know how to decorate them automagically to get the right results. Inferring the base from the language is possible if that's all we have.

The following examples can be test driven on this demo page. The Arabic pattern means roughly "price {x} + {y} shipping!"

First, placeables needs isolation to avoid string-internal spillover effects. If you paste this string into the text box (this is also one of the examples in the list box at the top of the page):

<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta>1,234.56 AED</span> + 12.99 USD \u0627\u0644\u0634\u062d\u0646!

You get:

image

Adding a dir attribute to the price values (the placeables that message format might generate) produces the proper isolation (you can use Unicode controls instead of a span with a dir attribute):

<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta dir=auto>1,234.56 AED</span> + 12.99 USD \u0627\u0644\u0634\u062d\u0646!

image

If we don't know the base direction of the whole string, though, then when we insert it into a page we can get spillover effects that are unwanted. Let's simulate that by putting an opposite direction (English) wrapper around the string:

We promised: "<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta>1,234.56 AED</span> + 12.99 USD \u0627\u0644\u0634\u062d\u0646!"

... which produces the thoroughly broken:

image

Fixing the interior placeables helps:

We promised: "<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta dir=auto>1,234.56 AED</span> + <span dir=auto>12.99 USD</span> \u0627\u0644\u0634\u062d\u0646!"

... but still leaves the exclamation point on the wrong side (other effects can be produced with other strings):

image

@aphillips
Copy link
Member

I think that a given component should take care of its interior needs and then expose its own base paragraph direction. That way, if all the directions align you don't get extra characters providing unnecessary levels of isolation and the component doesn't need to know or be told its context--it just needs to report the base paragraph direction of its output (which it already knows).

Your example doesn't make that much sense to me: a date format or compact decimal format in a given locale will be assembling a string with a single base direction and the tokens it emits will be in a specific language. There can be local considerations (my example with RLM on the date 1/11/2022 above), which the formatter should take care of by emitting a string that is "display ready" in it's base direction. Isolation is not a panacea here: isolating the subformat tokens (month, day, year, etc.) in ١‏/١١‏/٢٠٢٢, ١٠:٢٣ ص does not result in a correct bidi string unless the whole thing is also wrapped and the RLMs are more effective.

If you take your example and turn it into an MF pattern string:

{name,name,full} purchased stock in {stock} for {price,compact-short, currency} on {date,date,::EEEEMMMMd} and {date,date,::jm}.

... and it's (Google translated) Arabic friend:

اشترى {name} مخزونًا في {stock} مقابل {price} في {date} و {date}.

If each formatter function reports the base direction of its output string (e.g. John H. Smith is ltr) then the parent formatter (message format in this case) can use that to decide to wrap the string with controls (or markup). PersonalNameFormat takes care of the insides of "John H. Smith", NumberFormat takes care of "$3.21M", and DateFormat takes care of "Tuesday, March 3" and "11:57 am". This makes implementation fairly simple: you only have to worry about whether/how to isolate whole strings that you are given.

@eemeli
Copy link
Collaborator

eemeli commented Nov 6, 2022

It might be useful to approach this by figuring out what the to-string formatted output of MF2 should be.

We may have some parts of the output for which we can know the directionality (e.g. literal text in the parent locale or {$count :number}) and others for which we might not be sure (e.g. {$name}). Should the inclusion of isolating marks between such parts be something require by default? And if not, what about cases where we know that the directionality of adjacent parts is different?

@zbraniecki
Copy link
Member Author

Should the inclusion of isolating marks between such parts be something require by default?

Yes.
And I believe we're converging on this consensus among all stakeholders in this thread.

MF2 should make it an extra step to produce multi-directional string output without isolation marks. By default it should use the information it has about placeholder positions to isolate at boundaries.

@eemeli
Copy link
Collaborator

eemeli commented Nov 8, 2022

@zbraniecki How about cases where we know the directionality matches?

For example, in an en-UScontext, we can presume that both literal text and the string representation of a placeholder like {$count :number} are both LTR. Should we require isolation even in this case, or could we allow for an implementation to leave it out?

@zbraniecki
Copy link
Member Author

How about cases where we know the directionality matches?

Those should be exempted from marks.

For example, in an en-US context, we can presume that both literal text and the string representation of a placeholder like {$count :number} are both LTR.

It's a bit more tricky actually. We should evaluate whether the number formatter used to format $count has the same directionality as the main text. If so, we can skip.

Also, as Addison pointed out, we may want to evaluate language information alongside direction. I'm a bit less clear on how exactly this meta information should look like, but I imagine that we could have a en-CA text with Relative time format placeholder using en-US and may want to mark it as lang=en-US. @aphillips - is that something you'd like to suggest, or just that if the placeholder is a variable from the developer (say, user name, or proper name) and is marked as lang=fr we should mark lang of that placeholder to be fr, but if it's about I18n formatter, we don't need to separate out lang information?

@eemeli
Copy link
Collaborator

eemeli commented Nov 9, 2022

Could we first figure out the absolute minimum that's required in the MF2 spec for formatted string output? That we're all agreed on as being a part of the base layer, while e.g. the shape of the formatted parts might well end up getting defined by specifications building on top of it.

Maybe something like this?

Where appropriate, the formatted string representation of a message MUST isolate message parts that may have different directionality than the message as a whole. Such a part MUST be prefixed with an explicit isolate character:

  • LEFT-TO-RIGHT ISOLATE U+2066 if the part is known to have LTR directionality,
  • RIGHT-TO-LEFT ISOLATE U+2067 if the part is known to have RTL directionality, or
  • FIRST STRONG ISOLATE U+2068 if the part's directionality is not certain.

In all cases, the part MUST be postfixed with a corresponding POP DIRECTIONAL ISOLATE U+2069 character.

Such wording would require a part sequence like LTR/RTL/RTL to include an unnecessary PDI + RLI character pair between the RTL parts if the message as a whole is LTR. Should that be optimised out?

@aphillips
Copy link
Member

Such wording would require a part sequence like LTR/RTL/RTL to include an unnecessary PDI + RLI character pair between the RTL parts if the message as a whole is LTR. Should that be optimised out?

It doesn't work that way. If you have a base paragraph direction string that is LTR and you have two consecutive RTL insertions, you want isolation in between them to prevent spillover effects. Consider this example:

السعر 1,234.56 AED 12.99 USD الشحن

This has two placeable strings ("1,234.56 AED" and "12.99 USD") with only a space between them. Without isolation they draw like:

image

With isolating controls they draw correctly without spillover effects:

image

The only time that isolating markup or controls can be omitted safely is when:

(i) the placeable and the host string have the same base direction
(ii) and either all characters in the placeable have the same base direction or the first and last characters are strong "same direction" as the "base direction".

This is why unknown strings need FSI/PDI around them.

@eemeli
Copy link
Collaborator

eemeli commented Nov 10, 2022

@aphillips:
If you have a base paragraph direction string that is LTR and you have two consecutive RTL insertions, you want isolation in between them to prevent spillover effects. Consider this example:

Ah, had not played around with that example; thank you, that was useful. I wasn't able to observe spillover when omitting inner isolates between parts with the same directionality, but their overall order is indeed affected. So if we're in an LTR context, and the logical order of our message is L1, R1, R2, L2 then if we isolate each part, the displayed order is as expected: L1, R1, R2, L2. However, if we leave out the inner isolation between the RTL parts, then we'd observe L1, R2, R1, L2.

The only time that isolating markup or controls can be omitted safely is when:

(i) the placeable and the host string have the same base direction
(ii) and either all characters in the placeable have the same base direction or the first and last characters are strong "same direction" as the "base direction".

This is why unknown strings need FSI/PDI around them.

Is it FSI specifically that we should be using, or should we use LRI and/or RLI if we do know the directionality of the inner part?

@aphillips
Copy link
Member

@eemeli

Is it FSI specifically that we should be using, or should we use LRI and/or RLI if we do know the directionality of the inner part?

It is FSI if the direction of the inserted string is unknown. It is LRI or RLI if the direction is known (matching the direction of the string).

@mihnita
Copy link
Collaborator

mihnita commented Nov 13, 2022

I also think that a big part of the discussion is about who is responsible for adding those control characters, or special-bidi-control parts when we format to parts.

It is pretty clear that the "function" should be do it, because of situations like this:

Expires on {exp :date}...
Stuff to buy {lst :listformat}...

Where the formatted date needs internal directional control characters.
And in the list case you probably want each item in the list isolated.

But should the result of the whole placeholder be wrapped?
And if yes, who should do it, the function, or the "engine"

Here is what I mean: Expires on {exp :date}...
And let's say we want the format to parts result to be:

parts = [
  "Expires on "
  ISOLATE_START,
  "Nov 11, 2022"
  ISOLATE_END,
  "..."

Should that be done by "the engine" (the part of MessageFormat implementation that is function agnostic, in only invokes functions and "glues" the result together)?
Or that is again the responsibility of the function?

The engine:

for (each part in ast.parts) {
    if (part is text) {
        result.append(part)
    } else if (part is placeholder) {
        result.append(ISOLATE_START)
        result.append(invoke placeholder.function with options and whatever else we need)
        result.append(ISOLATE_END)
    }
}

or the function:

for (each part in ast.parts) {
    if (part is text) {
        result.append(part)
    } else if (part is placeholder) {
        result.append(invoke placeholder.function with options and whatever else we need)
    }
}

I am inclined to say the function is also responsible for that part.
The function would know best if its own result needs wrapping or not.

@zbraniecki
Copy link
Member Author

I think there's alternative to:

parts = [
  "Expires on "
  ISOLATE_START,
  "Nov 11, 2022"
  ISOLATE_END,
  "..."

We could do:

parts = [
  {type: LITERAL, value: "Expires on ", dir: LTR},
  {type: DATE, value: 293131221, dir: RTL},
  {type: LITERAL, value: ".", dir: LTR},
  "..."
]

and allow the consumer to decide on injecting marks.

@zbraniecki
Copy link
Member Author

@mihnita @eemeli @stasm @aphillips @echeran - thoughts?

@aphillips
Copy link
Member

I agree that the isolates want to be included in specific parts, not separate elements in the "parts" array. For cases where the direction and language are the same all the way through, it allows implementations to omit isolating controls (or markup or such). For cases where the parts are separately rendered, it allows the caller to extract language and direction metadata for a given span.

If we had an LString type, the representation would be more like:

parts = [
   {type: LITERAL, value: { value: "Expires on ", lang: "en-US", dir: "LTR" }},
   {type: DATE, value: someDateValue},
   {type: LITERAL, value: { value: ".", lang: "en-US", dir: "LTR" }}
]

The DATE object would get language and base paragraph direction information from the formatter. The default for lang would be und and the default for dir would be auto (first-strong).

To @eemeli's point earlier, we could resolve this separately (and potentially later), provided we can agree on the "format-to-string" output. I agree that the code point sequences don't have to be identical to the concatenated toString output of "format-to-parts", but it would be good if they were at least somewhat consistent :-).

Finally, note that parts needs to have language and base paragraph direction metadata of its own. The language presumably is the locale of the formatter. The base direction might be provided by the resource provider. (In the case of ICU, we provide a guess at the base direction from the locale, although this is not as holistically provisioned as it might be.

@zbraniecki
Copy link
Member Author

So, in fact you argue that parts should be:

parts = {
  elements: [
    {type: LITERAL, value: { value: "Expires on ", lang: "en-US", dir: "LTR" }},
    {type: DATE, value: someDateValue},
    {type: LITERAL, value: { value: ".", lang: "en-US", dir: "LTR" }}
  ],
  lang: "en-US",
  dir: "LTR",
}

right? That's a pretty challenging alteration and incompatible with ECMA-402 FormatToParts, but maybe necessary?

Or we could assume that people can derive lang/dir from resolvedOptions() the way they would for getting lang/dir out of DateTimeFormat::formatToParts?

@aphillips
Copy link
Member

Coming from resolvedOptions() sounds right.

@macchiati
Copy link
Member

macchiati commented Dec 8, 2022 via email

@aphillips
Copy link
Member

When a message gets constructed, you really want all the pieces of the message to be in the same language wherever possible. I don't want a Czech date in the parts above, but one that is really for en-US.

In general, I agree. However:

  • resource fallback can lead to a message template in a different language than the runtime (formatter) locale
  • data inserted into a string can be in a different language, e.g. You purchased the book "HTML و CSS: تصميم و إنشاء مواقع الويب"

My example was somewhat pedantic about lang/dir metadata because I'm thinking in terms of "attributed strings" or "attributed values". There can be (and should be) an inheritance model so that data does not need to be replicated on every level. But we need the ability to tag data as appropriate.

The implementation we made when I was at Amazon tied the resource format and formatter together. The template structure used selectors (just as we've moved to selectors in MFv2) which resolved to a pattern string (by evaluating plurals, selects, and such) and the resulting pattern string was in a single language and had a single base direction. When we look at parts (as above in this thread), the language and direction on literals that come from the pattern string itself are entirely redundant--the parts only exist because placeables appear inside the template (inside can be at either end, please note), causing us to have "parts" of the template expressed as separate literals. It's the placeables that need bidi isolation and language markup, not the literals (which can only ever be in one language with one base paragraph direction unless one is being stoopidly cute).

Does that make sense?

@aphillips
Copy link
Member

For BIDI as well, it is only necessary to convey the status of a piece that differs from the enclosing parts; so those can also be optional in the cited case.

This is not correct. Even if the base direction is the same, there are cases where isolation of placeables is desirable to prevent spillover effects. Consider the example The price is ${price} + ${shipping} in shipping in Arabic:

السعر 1,234.56 AED + 12.99 USD الشحن

This should render:

السعر ⁧1,234.56 AED⁩ + ⁧12.99 USD⁩ الشحن

Note that the second string has RLI/PDI around the placeables--but all of the "parts" are RTL!! The presence of LTR characters and numbers in the currency values does not mean that their locale is not ar-AE or that their base direction is not RTL. Also enclosing and ending punctuation positioning depends on direction.

@macchiati
Copy link
Member

macchiati commented Dec 9, 2022 via email

@macchiati
Copy link
Member

My example was somewhat pedantic about lang/dir metadata because I'm thinking in terms of "attributed strings" or "attributed values". There can be (and should be) an inheritance model so that data does not need to be replicated on every level. But we need the ability to tag data as appropriate.

I'm not saying that we don't need the ability to tag BIDI; if the dir on an element isn't equal to a dir on the parts, it needs to be present. That is, I agree with your statement "There can be (and should be) an inheritance model so that data does not need to be replicated on every level. But we need the ability to tag data as appropriate."

data inserted into a string can be in a different language, e.g. You purchased the book "HTML و CSS: تصميم و إنشاء مواقع الويب"

On the other hand, I still doubt that the lang attribute is particularly useful. I'm not against having it be an optional attribute. I just have yet to see a convincing case where it is required (as I noted earlier). And in the case you give here, I don't see that it is. An example would help: especially given that the data sources will often not have that information, what would the process do that in the presence of the lang attribute that it wouldn't do otherwise?

It's the placeables that need bidi isolation and language markup, not the literals
I'm a bit confused. In the examples you had, the someDateValue didn't have the extra attributes while the literals did. I'm guessing that the someDateValue was a stand-in for a tuple that did have the attributes. Is that the case?

@zbraniecki
Copy link
Member Author

I just have yet to see a convincing case where it is required (as I noted earlier).

We know, and you listed it yourself, that we'll want it for TTS.

I'm ok with it being optional, as it won't be used by toString reducer.

@macchiati
Copy link
Member

macchiati commented Dec 9, 2022 via email

@eemeli
Copy link
Collaborator

eemeli commented Dec 9, 2022

I continue to think that the shape of an MF2 formatted parts result should not be defined by the core MF2 spec, but by the spec/implementation layers building on top of it. I believe that @mihnita is working on a PR stating something like that so that we'll be able to close #41 and #272.

However, I do think that the formatted string result should be explicitly defined for MF2; on that, I rather like the current shape of #315. There it would be valuable to get some more input on whether the isolation should happen by default either

  1. when we know that we need it, or
  2. when we don't know that we don't need it.

The current proposal is to go with option 2, but e.g. applying the change from #315 (comment) would switch to the first option.

Once we have a definition of how this should work for string output, it'll make it easier for implementations (i.e. the ICU4J tech preview & the JS polyfill) to experiment with formatted-parts and verify that the intended goal is achievable.

@zbraniecki
Copy link
Member Author

I continue to think that the shape of an MF2 formatted parts result should not be defined by the core MF2 spec, but by the spec/implementation layers building on top of it.

How do you envision it affecting building binding layers on top of MF2 to various frameworks? If ICU4C, ICU4J, ICU4X, ECMA-402 and even maybe SpiderMonkey vs V8 will have different shape of parts and differently encode information, including inevitably that some implementations will provide information allowing bindings to do things that other implementations will not provide sufficient information for?

@eemeli
Copy link
Collaborator

eemeli commented Dec 9, 2022

@zbraniecki I've replied to you here: #272 (comment).

@aphillips
Copy link
Member

@eemeli I agree with you. I think #315 is close. I added a comment just now about tweaking the wording there to isolate by default but permit implementations to "optimize" their output (by omitting isolation when it is not necessary). Such optimization is harder than it looks.

It is important that isolation is not only when the placeable's direction "does not match" the host string's direction. I think the requirement is that isolation is required if one of these conditions is met:

  • the placeable has a base direction different from the template string
  • the placeable has more than one directional run and neutral, weak, or anti-base direction runs occur on either end of the placeable (they can occur mid-string, as long as they are entirely enclosed in strongly directional characters that match the base direction of the string as a whole) or the embedding level of each end of the placeable does not match (not sure this can occur, but it might)

@macchiati

I'm not saying that we don't need the ability to tag BIDI; if the dir on an element isn't equal to a dir on the parts, it needs to be present. That is, I agree with your statement "There can be (and should be) an inheritance model so that data does not need to be replicated on every level. But we need the ability to tag data as appropriate."

Another way to say this is that the pattern (your element?) string and each of its parts each has a direction attribute that can be queried. When a given part does not have its own direction or language value, it is inherited from the "level above" (generally the pattern string as a whole) or computed from the locale (e.g. ULocale.isRightToLeft()).

On the other hand, I still doubt that the lang attribute is particularly useful. I'm not against having it be an optional attribute. I just have yet to see a convincing case where it is required (as I noted earlier). And in the case you give here, I don't see that it is. An example would help: especially given that the data sources will often not have that information, what would the process do that in the presence of the lang attribute that it wouldn't do otherwise?

My example wasn't affected by lack of language metadata to be sure. @zbraniecki calls out voice selection. Any kind of language-specific processing (such as font selection in CJK, for example) would also benefit from accurate language metadata. This is more of a corner case for MFv2, since generally we're trying to make a string in a given locale, but data is data and can be multilingual. When it is not in the same language, the ability to query the metadata allows the user to e.g. decorate the text with a language appropriate <span>. Our W3C group has written several docs, such as String-Meta and use cases, about this.

In this case, every "literal" part of the message that is part of the template string will have the same language as the template string as a whole. It's only when a literal piece of data (such as the book's title, as in the example) has a different language that the metadata appears on a literal. MessageFormat won't generally use the value, but consumers might when consuming the parts. (If we permitted nested patterns, then the language would be necessary for features such as quote generation or to feed nested formatters as the locale)

It's the placeables that need bidi isolation and language markup, not the literals
I'm a bit confused. In the examples you had, the someDateValue didn't have the extra attributes while the literals did. I'm guessing that the someDateValue was a stand-in for a tuple that did have the attributes. Is that the case?

The someDataValue in my example looks sort of like this:

 {type: DATE, value: someDateValue},

In this case, it's a date value, such as a java.util.Date, a Java Temporal, a JavaScript Date, usw. Presumably there is an associated DateFormatter whose locale (and skeleton/pattern/options) determine the literal string. The locale of that formatter is the language. The direction of that locale is the base direction for that part. So I could have written it as:

 {type: DATE, value: $someDateValue, lang: $someDateValue.locale, dir: $someDateValue.dir},

Or perhaps:

{ type: DATE, value: someDateValue, 
        lang: myDateFormat.resolvedOptions().locale, // implied
        dir: myDateFormat.resolvedOptions().dir      // implied
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker-candidate The submitter thinks this might be a block for the Technology Preview requirements Issues related with MF requirements list
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants