[core] Use ICU for bidirectional text layout and Arabic text shaping #6984

ChrisLoer · 2016-11-09T18:25:07Z

No description provided.

mention-bot · 2016-11-09T18:25:10Z

@ChrisLoer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @jfirebaugh, @kkaefer and @brunoabinader to be potential reviewers.

brunoabinader · 2016-11-09T18:56:40Z

src/mbgl/text/bidi.cpp

+    UErrorCode errorCode = U_ZERO_ERROR;
+
+    std::u16string input16 = util::utf32_to_utf16::convert( input32 );
+    std::unique_ptr<UChar*> outputText = std::make_unique<UChar*>( new UChar[input16.size() * 2] ); // Maximum output of ubidi_transform is twice the size of input according to ubidi_transform.h


This is what valgrind is complaining about on the Qt5 Travis bot.

I suppose you mean std::make_unique<UChar[]>(input16.size() * 2); instead?

jfirebaugh · 2016-11-09T18:49:13Z

src/mbgl/util/utf.hpp

@@ -15,7 +15,28 @@ class utf8_to_utf32 {
        boost::u8_to_u32_iterator<std::string::const_iterator> end(utf8.end());
        return std::u32string(begin,end);
    }
+


We should rewrite this whole file in terms of std::codecvt, dropping the boost dependency.

jfirebaugh · 2016-11-09T18:52:51Z

src/mbgl/layout/symbol_layout.cpp

@@ -90,7 +92,7 @@ SymbolLayout::SymbolLayout(std::string bucketName_,
                u8string = platform::lowercase(u8string);
            }

-            ft.text = util::utf8_to_utf32::convert(u8string);
+            ft.text = bidi.bidiTransform( util::utf8_to_utf32::convert( u8string ) );


Since getGlyphRange doesn't support anything beyond the Basic Multilingual Plane, are we better off using utf16 throughout this code, rather than going utf8 ⇢ utf32 ⇢ utf16 ⇢ utf32?

jfirebaugh · 2016-11-09T18:56:22Z

src/mbgl/text/bidi.cpp

+    UErrorCode errorCode = U_ZERO_ERROR;
+
+    std::u16string input16 = util::utf32_to_utf16::convert( input32 );
+    std::unique_ptr<UChar*> outputText = std::make_unique<UChar*>( new UChar[input16.size() * 2] ); // Maximum output of ubidi_transform is twice the size of input according to ubidi_transform.h


auto outputText = std::make_unique<UChar[]>(input16.size() * 2);

brunoabinader · 2016-11-09T18:59:13Z

Awesome stuff @ChrisLoer! Btw, we usually add tags e.g. [core] in commit messages in GL native.

jfirebaugh · 2016-11-09T19:04:10Z

We'll want some rendering tests to go along with these changes. I suggest keeping them very simple, modeled on test cases such as this one which uses a single symbol layer plus a GeoJSON source to provide simple data.

1ec5 · 2016-11-10T09:13:50Z

src/mbgl/layout/symbol_layout.cpp

@@ -90,10 +92,10 @@ SymbolLayout::SymbolLayout(std::string bucketName_,
                u8string = platform::lowercase(u8string);
            }

-            ft.text = util::utf8_to_utf32::convert(u8string);
+            ft.text = bidi.bidiTransform( util::utf8_to_utf16::convert( u8string ) );


I’m curious about how visual mirroring at this stage interacts with line wrapping. Does this PR address the concern raised in #6057 (comment)?

In #6057, I intentionally performed the mirroring in GlyphSet::lineWrap(), which seemed to work fine for point-placed, wrapped, and line-placed labels: #6057 (comment).

No this approach has the same problem, I was punting on the line wrap issue, but we should probably do at least as much as you did and apply line wrapping before the transform. Having RTL also be bottom-to-top is obviously no good!

For fully correct line breaking support, we'd need to work with the ICU bidirectional algorithm at a lower level -- you start with ubidi_setParagraph, then you find line breaks, then you use ubidi_setLine on the previously set paragraph. It's necessary in order to correctly calculate the embedding levels. My intuition is that we'll be able to get away without doing that, at least at first.

On the other hand, the line wrapping algorithm doesn't really have correct shaping input if you don't do the transform ahead of time. One partial fix we could try is to detect the "paragraph level" direction and pass it into GlyphSet::lineWrap() -- if the paragraph is RTL, then line break up instead of down. It wouldn't do the right thing if you had, say, 3 lines of LTR text embedded in an RTL paragraph, but that might be rare enough to ignore for now?

Bidi POI labels could get pretty long, for instance “مكتبة الإسكندرية‎‎ Maktabat al-Iskandarīyah” for the Library of Alexandria. In this case, the LTR text is a third longer than the RTL text. I don’t know how many cases there are where the two are roughly even in length.

Hrmm... that string renders pretty poorly using my base-direction hack:

I pushed my changes so you can take a look at how I did the line breaking, but the results for the Library of Alexandria are pretty disheartening.

jfirebaugh · 2016-11-11T19:06:22Z

src/mbgl/layout/symbol_layout.cpp

@@ -90,10 +92,12 @@ SymbolLayout::SymbolLayout(std::string bucketName_,
                u8string = platform::lowercase(u8string);
            }

-            ft.text = util::utf8_to_utf32::convert(u8string);
+            std::u16string u16string = util::utf8_to_utf16::convert( u8string );


No spaces after ( or before ) in our house style. You can use clang-format to fix formatting up.

@kkaefer Is there a way to run clang-format on all modified files in a PR?

@jfirebaugh uncomment this line and run make check. This will run both clang-tidy and clang-format on modified files.

I ran clang-tidy and clang-format on all the files, and it looks like they hadn't been tidied in a while, so it introduced a lot of diffs unrelated to my changes. Should I still go ahead with the changes?

Can you subset it further, to the code that's being modified for other reasons?

OK, just did a quick manual review and tried to only clang-format "new" code.

jfirebaugh

Minor nitpicking. This is close. Can you rebase on master, fix the merge conflicts, and see where CI is at?

jfirebaugh · 2016-11-15T01:20:52Z

src/mbgl/text/bidi.cpp

+        U_SHAPE_LETTERS_SHAPE, // Add options here for handling numbers in bidirectional text
+        &errorCode);
+
+    if (U_FAILURE(


Put the comment above the if statement so this doesn't line-break so awkwardly.

jfirebaugh · 2016-11-15T01:21:30Z

src/mbgl/text/bidi.cpp

+    // This just looks for the first character with a strong direction property, it does not perform
+    // the BiDi algorithm
+    return ubidi_getBaseDirection(input.c_str(), static_cast<int32_t>(input.size())) == UBIDI_RTL
+               ? true


No need to write <boolean expression> ? true : false, just write <boolean expression>.

ChrisLoer · 2016-11-15T04:21:48Z

@1ec5 you might want to double-check the changes I made to i18n.cpp while I was rebasing -- it was just moving from uint32_t characters to uint16_t characters, which I think should be fine since we're already limited to the BMP. I talked with @jfirebaugh about the line-breaking problem and we decided to start with the simplified rule of reversing the line breaks when the base direction of the label is RTL. Your heuristic for line breaking in #6057 gets better results, but it's not as easy to apply when we're doing the transformation ahead of time with the input string. Our plan is to start with this PR as a strict improvement on "totally broken", and then follow up with a PR that uses ICU's support for bidirectional line breaking.

One other issue that we haven't discussed at all is handling of numbers in Arabic text. I just used the default behavior, and we end up with the same behavior OpenStreetMap uses (numbers are written LTR using Arabic-Indic digits: for example, see the "24" at https://www.openstreetmap.org/way/26177467). Google Maps uses European digits in the same case (see https://goo.gl/maps/E2yvjkVTp8R2). ICU shaping includes options that make it easy for us to choose either format regardless of the underlying data.

1ec5 · 2016-11-15T08:08:19Z

Your heuristic for line breaking in #6057 gets better results, but it's not as easy to apply when we're doing the transformation ahead of time with the input string.

#6057 was strictly about mirroring rather than complex text shaping. It mirrors after applying line breaking, taking advantage of the fact that the visual length of a line of text remains constant after mirroring.

Similarly, I think it would be possible to individually text-shape each individual line after line breaking, since Arabic words are still delimited by spaces (unlike, say, Thai). The caveat is that some lines would shorten during the transformation, so it would appear as though we were breaking some lines prematurely. The result for purely RTL text would probably be severely width-constrained compared to the approach you’ve implemented; on the other hand, bidirectional text would finally read correctly.

Our plan is to start with this PR as a strict improvement on "totally broken", and then follow up with a PR that uses ICU's support for bidirectional line breaking.

This PR is certainly an improvement over master. I’m looking forward to the glorious ICU future. 👍

jfirebaugh

This is ready to go from my perspective. I would like to observe and record the size increase across all SDKs, so let's get #7005 merged first and then merge this PR after we get a nightly build which confirms the binary size stats are getting tracked correctly.

cc @kkaefer

kkaefer · 2016-11-16T15:24:25Z

I ran a manual build for Android/arm-v7 and got a size change of +7%, from 4,903,788 bytes to 5,239,412 bytes

kkaefer · 2016-11-16T15:39:11Z

src/mbgl/text/bidi.cpp

+    // the BiDi algorithm
+    return ubidi_getBaseDirection(input.c_str(), static_cast<int32_t>(input.size())) == UBIDI_RTL;
+}
+


Can you please merge the code formatting changes into the previous commit? That makes the code changes a lot easier to follow.

kkaefer · 2016-11-16T15:41:05Z

src/mbgl/text/glyph_set.hpp

                             float horizontalAlign, float verticalAlign, float justify,
                             float spacing, const Point<float> &translate) const;
    void lineWrap(Shaping &shaping, float lineHeight, float maxWidth, float horizontalAlign,
                  float verticalAlign, float justify, const Point<float> &translate,
-                  bool useBalancedIdeographicBreaking) const;
+                  bool useBalancedIdeographicBreaking, const bool rightToLeft) const;


We've been moving towards using enums like enum class WritingDirection : bool { LeftToRight, RightToLeft }; to avoid ambiguity in functions like this.

This would jive nicely with the WritingMode bitfield that I'm implementing for #1682.

OK, moved this into a WritingDirection enum defined in bidi.hpp. @1ec5, I didn't see a PR that already had a WritingMode bitfield, so hopefully this still jives nicely.

kkaefer · 2016-11-16T15:41:21Z

src/mbgl/util/utf.hpp

 #include <codecvt>
+#include <locale>


null change

Removed. So clang-format will put includes in alphabetical order... Is our style "add includes in alphabetical order, but don't reorder existing ones"?

kkaefer · 2016-11-16T15:45:04Z

src/mbgl/text/bidi.cpp

+BiDi::BiDi()
+{
+    UErrorCode errorCode = U_ZERO_ERROR;
+    transform = ubiditransform_open(&errorCode); // Only error is failure to allocate memory, in that case ubidi_transform would fall back to creating transform object on the fly


We're recording the error code, but don't do anything with it. Can we either fail hard here, or add checks in BiDi::bidiTransform that don't use the transform object if it's not initialized?

kkaefer · 2016-11-16T15:45:17Z

src/mbgl/text/bidi.hpp

+    std::u16string bidiTransform( const std::u16string& );
+
+private:
+    UBiDiTransform* transform;


Initialize to nullptr?

kkaefer · 2016-11-16T15:46:45Z

src/mbgl/layout/symbol_feature.hpp

@@ -11,6 +11,7 @@ class SymbolFeature {
 public:
    GeometryCollection geometry;
    optional<std::u16string> text;
+    optional<bool> rightToLeft;


Use an enum WritingDirection here for clarity

jfirebaugh · 2016-11-17T19:55:45Z

#7098 should fix the test-suite errors.

… conversions and reduce in-memory size. Continue to use uint32 as glyph ID to maintain Glyph PBF, even though we're only using 16 bits of that uint32. Use std::codecvt instead of boost::unicode_iterator for UTF8->UTF16 conversions.

… shaping. Apply bidi and shaping in symbol_layout. Add utility functions for converting to and from UTF-16.

brunoabinader reviewed Nov 9, 2016

View reviewed changes

jfirebaugh suggested changes Nov 9, 2016

View reviewed changes

1ec5 reviewed Nov 10, 2016

View reviewed changes

jfirebaugh mentioned this pull request Nov 10, 2016

[core] Bump standard library to libstdc++5-dev and clang version to 3.8. #6991

Merged

ChrisLoer force-pushed the cloer_icu_arabic branch from 6b83cec to 5e25e6e Compare November 10, 2016 20:04

ChrisLoer changed the title ~~Use ICU for bidirectional text layout and Arabic text shaping~~ [core] Use ICU for bidirectional text layout and Arabic text shaping Nov 10, 2016

jfirebaugh suggested changes Nov 11, 2016

View reviewed changes

ChrisLoer force-pushed the cloer_icu_arabic branch 2 times, most recently from df14358 to a2a6f2b Compare November 15, 2016 01:07

jfirebaugh suggested changes Nov 15, 2016

View reviewed changes

ChrisLoer force-pushed the cloer_icu_arabic branch from a2a6f2b to 8b7a6cb Compare November 15, 2016 03:59

jfirebaugh approved these changes Nov 15, 2016

View reviewed changes

ChrisLoer force-pushed the cloer_icu_arabic branch from 8b7a6cb to 7435d4b Compare November 15, 2016 23:01

kkaefer suggested changes Nov 16, 2016

View reviewed changes

ChrisLoer force-pushed the cloer_icu_arabic branch from 7435d4b to 18568fc Compare November 16, 2016 21:19

kkaefer approved these changes Nov 17, 2016

View reviewed changes

ChrisLoer force-pushed the cloer_icu_arabic branch 2 times, most recently from 6e62c60 to ae64c57 Compare November 17, 2016 19:23

jfirebaugh force-pushed the cloer_icu_arabic branch from ae64c57 to 24c3595 Compare November 17, 2016 22:11

ChrisLoer added 2 commits November 17, 2016 14:12

[core] Add ICU package for Bidirectional text support and arabic text…

9422cc8

… shaping. Apply bidi and shaping in symbol_layout. Add utility functions for converting to and from UTF-16.

[core] Add minimal line breaking support for RTL text.

39a6d73

ChrisLoer force-pushed the cloer_icu_arabic branch from 24c3595 to 39a6d73 Compare November 17, 2016 22:12

jfirebaugh merged commit d4fc66a into master Nov 17, 2016

jfirebaugh deleted the cloer_icu_arabic branch November 17, 2016 22:33

ChrisLoer mentioned this pull request Nov 17, 2016

Complex Text Rendering mapbox/DEPRECATED-mapbox-gl#4

Closed

1ec5 mentioned this pull request Nov 18, 2016

Upright CJK characters in vertically-oriented labels #7114

Merged

ChrisLoer mentioned this pull request Nov 28, 2016

Support bidirectional text layout and Arabic text shaping mapbox/mapbox-gl-js#3708

Closed

1ec5 mentioned this pull request Dec 1, 2016

[POC] Naïve right-to-left labeling #6057

Closed

4 tasks

boundsj mentioned this pull request Jan 18, 2017

Pass ICU to libtool in package script #7773

Merged

[core] Use ICU for bidirectional text layout and Arabic text shaping #6984

[core] Use ICU for bidirectional text layout and Arabic text shaping #6984

Conversation

ChrisLoer commented Nov 9, 2016

mention-bot commented Nov 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brunoabinader commented Nov 9, 2016

jfirebaugh commented Nov 9, 2016

1ec5 Nov 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfirebaugh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisLoer commented Nov 15, 2016

1ec5 commented Nov 15, 2016 • edited Loading

jfirebaugh left a comment

Choose a reason for hiding this comment

kkaefer commented Nov 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfirebaugh commented Nov 17, 2016

1ec5 Nov 10, 2016 •

edited

Loading

1ec5 commented Nov 15, 2016 •

edited

Loading