Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

[core] Use ICU for bidirectional text layout and Arabic text shaping #6984

Merged
merged 3 commits into from
Nov 17, 2016

Conversation

ChrisLoer
Copy link
Contributor

No description provided.

@mention-bot
Copy link

@ChrisLoer, thanks for your PR! By analyzing the history of the files in this pull request, we identified @jfirebaugh, @kkaefer and @brunoabinader to be potential reviewers.

UErrorCode errorCode = U_ZERO_ERROR;

std::u16string input16 = util::utf32_to_utf16::convert( input32 );
std::unique_ptr<UChar*> outputText = std::make_unique<UChar*>( new UChar[input16.size() * 2] ); // Maximum output of ubidi_transform is twice the size of input according to ubidi_transform.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what valgrind is complaining about on the Qt5 Travis bot.

I suppose you mean std::make_unique<UChar[]>(input16.size() * 2); instead?

@@ -15,7 +15,28 @@ class utf8_to_utf32 {
boost::u8_to_u32_iterator<std::string::const_iterator> end(utf8.end());
return std::u32string(begin,end);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rewrite this whole file in terms of std::codecvt, dropping the boost dependency.

@@ -90,7 +92,7 @@ SymbolLayout::SymbolLayout(std::string bucketName_,
u8string = platform::lowercase(u8string);
}

ft.text = util::utf8_to_utf32::convert(u8string);
ft.text = bidi.bidiTransform( util::utf8_to_utf32::convert( u8string ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since getGlyphRange doesn't support anything beyond the Basic Multilingual Plane, are we better off using utf16 throughout this code, rather than going utf8 ⇢ utf32 ⇢ utf16 ⇢ utf32?

UErrorCode errorCode = U_ZERO_ERROR;

std::u16string input16 = util::utf32_to_utf16::convert( input32 );
std::unique_ptr<UChar*> outputText = std::make_unique<UChar*>( new UChar[input16.size() * 2] ); // Maximum output of ubidi_transform is twice the size of input according to ubidi_transform.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto outputText = std::make_unique<UChar[]>(input16.size() * 2);

@brunoabinader
Copy link
Member

Awesome stuff @ChrisLoer! Btw, we usually add tags e.g. [core] in commit messages in GL native.

@jfirebaugh
Copy link
Contributor

We'll want some rendering tests to go along with these changes. I suggest keeping them very simple, modeled on test cases such as this one which uses a single symbol layer plus a GeoJSON source to provide simple data.

@@ -90,10 +92,10 @@ SymbolLayout::SymbolLayout(std::string bucketName_,
u8string = platform::lowercase(u8string);
}

ft.text = util::utf8_to_utf32::convert(u8string);
ft.text = bidi.bidiTransform( util::utf8_to_utf16::convert( u8string ) );
Copy link
Contributor

@1ec5 1ec5 Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m curious about how visual mirroring at this stage interacts with line wrapping. Does this PR address the concern raised in #6057 (comment)?

In #6057, I intentionally performed the mirroring in GlyphSet::lineWrap(), which seemed to work fine for point-placed, wrapped, and line-placed labels: #6057 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this approach has the same problem, I was punting on the line wrap issue, but we should probably do at least as much as you did and apply line wrapping before the transform. Having RTL also be bottom-to-top is obviously no good!

For fully correct line breaking support, we'd need to work with the ICU bidirectional algorithm at a lower level -- you start with ubidi_setParagraph, then you find line breaks, then you use ubidi_setLine on the previously set paragraph. It's necessary in order to correctly calculate the embedding levels. My intuition is that we'll be able to get away without doing that, at least at first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, the line wrapping algorithm doesn't really have correct shaping input if you don't do the transform ahead of time. One partial fix we could try is to detect the "paragraph level" direction and pass it into GlyphSet::lineWrap() -- if the paragraph is RTL, then line break up instead of down. It wouldn't do the right thing if you had, say, 3 lines of LTR text embedded in an RTL paragraph, but that might be rare enough to ignore for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bidi POI labels could get pretty long, for instance “مكتبة الإسكندرية‎‎ Maktabat al-Iskandarīyah” for the Library of Alexandria. In this case, the LTR text is a third longer than the RTL text. I don’t know how many cases there are where the two are roughly even in length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm... that string renders pretty poorly using my base-direction hack:

screen shot 2016-11-10 at 1 46 05 pm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed my changes so you can take a look at how I did the line breaking, but the results for the Library of Alexandria are pretty disheartening.

@ChrisLoer ChrisLoer changed the title Use ICU for bidirectional text layout and Arabic text shaping [core] Use ICU for bidirectional text layout and Arabic text shaping Nov 10, 2016
@@ -90,10 +92,12 @@ SymbolLayout::SymbolLayout(std::string bucketName_,
u8string = platform::lowercase(u8string);
}

ft.text = util::utf8_to_utf32::convert(u8string);
std::u16string u16string = util::utf8_to_utf16::convert( u8string );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No spaces after ( or before ) in our house style. You can use clang-format to fix formatting up.

@kkaefer Is there a way to run clang-format on all modified files in a PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfirebaugh uncomment this line and run make check. This will run both clang-tidy and clang-format on modified files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran clang-tidy and clang-format on all the files, and it looks like they hadn't been tidied in a while, so it introduced a lot of diffs unrelated to my changes. Should I still go ahead with the changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you subset it further, to the code that's being modified for other reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just did a quick manual review and tried to only clang-format "new" code.

@ChrisLoer ChrisLoer force-pushed the cloer_icu_arabic branch 2 times, most recently from df14358 to a2a6f2b Compare November 15, 2016 01:07
Copy link
Contributor

@jfirebaugh jfirebaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpicking. This is close. Can you rebase on master, fix the merge conflicts, and see where CI is at?

U_SHAPE_LETTERS_SHAPE, // Add options here for handling numbers in bidirectional text
&errorCode);

if (U_FAILURE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the comment above the if statement so this doesn't line-break so awkwardly.

// This just looks for the first character with a strong direction property, it does not perform
// the BiDi algorithm
return ubidi_getBaseDirection(input.c_str(), static_cast<int32_t>(input.size())) == UBIDI_RTL
? true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to write <boolean expression> ? true : false, just write <boolean expression>.

@ChrisLoer
Copy link
Contributor Author

@1ec5 you might want to double-check the changes I made to i18n.cpp while I was rebasing -- it was just moving from uint32_t characters to uint16_t characters, which I think should be fine since we're already limited to the BMP. I talked with @jfirebaugh about the line-breaking problem and we decided to start with the simplified rule of reversing the line breaks when the base direction of the label is RTL. Your heuristic for line breaking in #6057 gets better results, but it's not as easy to apply when we're doing the transformation ahead of time with the input string. Our plan is to start with this PR as a strict improvement on "totally broken", and then follow up with a PR that uses ICU's support for bidirectional line breaking.

One other issue that we haven't discussed at all is handling of numbers in Arabic text. I just used the default behavior, and we end up with the same behavior OpenStreetMap uses (numbers are written LTR using Arabic-Indic digits: for example, see the "24" at https://www.openstreetmap.org/way/26177467). Google Maps uses European digits in the same case (see https://goo.gl/maps/E2yvjkVTp8R2). ICU shaping includes options that make it easy for us to choose either format regardless of the underlying data.

@1ec5
Copy link
Contributor

1ec5 commented Nov 15, 2016

Your heuristic for line breaking in #6057 gets better results, but it's not as easy to apply when we're doing the transformation ahead of time with the input string.

#6057 was strictly about mirroring rather than complex text shaping. It mirrors after applying line breaking, taking advantage of the fact that the visual length of a line of text remains constant after mirroring.

Similarly, I think it would be possible to individually text-shape each individual line after line breaking, since Arabic words are still delimited by spaces (unlike, say, Thai). The caveat is that some lines would shorten during the transformation, so it would appear as though we were breaking some lines prematurely. The result for purely RTL text would probably be severely width-constrained compared to the approach you’ve implemented; on the other hand, bidirectional text would finally read correctly.

Our plan is to start with this PR as a strict improvement on "totally broken", and then follow up with a PR that uses ICU's support for bidirectional line breaking.

This PR is certainly an improvement over master. I’m looking forward to the glorious ICU future. 👍

Copy link
Contributor

@jfirebaugh jfirebaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to go from my perspective. I would like to observe and record the size increase across all SDKs, so let's get #7005 merged first and then merge this PR after we get a nightly build which confirms the binary size stats are getting tracked correctly.

cc @kkaefer

@kkaefer
Copy link
Member

kkaefer commented Nov 16, 2016

I ran a manual build for Android/arm-v7 and got a size change of +7%, from 4,903,788 bytes to 5,239,412 bytes

// the BiDi algorithm
return ubidi_getBaseDirection(input.c_str(), static_cast<int32_t>(input.size())) == UBIDI_RTL;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please merge the code formatting changes into the previous commit? That makes the code changes a lot easier to follow.

float horizontalAlign, float verticalAlign, float justify,
float spacing, const Point<float> &translate) const;
void lineWrap(Shaping &shaping, float lineHeight, float maxWidth, float horizontalAlign,
float verticalAlign, float justify, const Point<float> &translate,
bool useBalancedIdeographicBreaking) const;
bool useBalancedIdeographicBreaking, const bool rightToLeft) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been moving towards using enums like enum class WritingDirection : bool { LeftToRight, RightToLeft }; to avoid ambiguity in functions like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would jive nicely with the WritingMode bitfield that I'm implementing for #1682.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved this into a WritingDirection enum defined in bidi.hpp. @1ec5, I didn't see a PR that already had a WritingMode bitfield, so hopefully this still jives nicely.

#include <codecvt>
#include <locale>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. So clang-format will put includes in alphabetical order... Is our style "add includes in alphabetical order, but don't reorder existing ones"?

BiDi::BiDi()
{
UErrorCode errorCode = U_ZERO_ERROR;
transform = ubiditransform_open(&errorCode); // Only error is failure to allocate memory, in that case ubidi_transform would fall back to creating transform object on the fly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're recording the error code, but don't do anything with it. Can we either fail hard here, or add checks in BiDi::bidiTransform that don't use the transform object if it's not initialized?

std::u16string bidiTransform( const std::u16string& );

private:
UBiDiTransform* transform;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize to nullptr?

@@ -11,6 +11,7 @@ class SymbolFeature {
public:
GeometryCollection geometry;
optional<std::u16string> text;
optional<bool> rightToLeft;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an enum WritingDirection here for clarity

@ChrisLoer ChrisLoer force-pushed the cloer_icu_arabic branch 2 times, most recently from 6e62c60 to ae64c57 Compare November 17, 2016 19:23
@jfirebaugh
Copy link
Contributor

#7098 should fix the test-suite errors.

… conversions and reduce in-memory size.

Continue to use uint32 as glyph ID to maintain Glyph PBF, even though we're only using 16 bits of that uint32.
Use std::codecvt instead of boost::unicode_iterator for UTF8->UTF16 conversions.
… shaping.

Apply bidi and shaping in symbol_layout.
Add utility functions for converting to and from UTF-16.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants